After implementing the new slot optimisation strategy and running a full onAB Blocks World experiment (learning modules and all), the new slot optimisation has been shown to be at least as good as the old strategy. Both clear and clear&clear modules are optimal and the onAB task is solved nearly immediately (the algorithm quickly finds that the clear a and b rule is best to use).
The new slot optimisation strategy works not by the algorithm suggested in the previous post, but by a similar update function to cross entropy. The slots exist in a new structure called an ordered distribution, where each element has a value between 0 and 1 (inclusive), relating to where in the order the element is best placed. So an element with a value 0 is best placed at the front, while an elemen with 0.5 is better in the middle. An ordered distribution is not normalised, but the values will always remain between 0 and 1.
This ordered distribution is used to create policies by sampling slots from the distribution with removal (by creating a probability distribution with custom weights), on an index by index basis. So for index 0, elements with value 0 are more likely to be sampled than elements with value 1. Element values are updated using the same cross-entropy elite sample update function, except instead of how many times a slot occurs, the algorithm uses the average placing of the slot (assuming the slot was used during exploration, otherwise it receives a placing of 1). Gradually the slot placing converges and finds an optimal placement.
The benefit of using this distribution over the previously mentioned indexed distribution are that this distribution doesn’t care about the number of slots needed, so slots can be added and removed on the fly. It also is much smaller than an indexed distribution which was O(n^2), whereas this one is O(n). And finally this one copes with the use of automatically added LGG rules better.
The benefits the ordered distribution have over the probability distribution are that an ordered distribution can find an ordering, rather than just find useful slots.
Next up, I need to implement the cross-entrobeam optimisation algorithm, which should hopefully speed up rule optimsation even faster. Though perhaps before this I’ll just add in the action weighting in PacMan, which shouldn’t take long to implement, but perhaps a little longer to ensure it’s working correctly.