The learning of the algorithm has been shown in experiments to be effective at learning an optimal policy for Blocks World (onAB, anyway). And it can create reasonable (possibly optimal) modules too.
The problem is, it is far too slow on Ms. PacMan (a naturally slow environment). Furthermore, the games presented on the Dresden GGP server are tailored for agents which can quickly pick up learning strategies and play effectively. In fact, I’d say they’re better tailored for forward planning agents, but this current algorithm isn’t there yet. So I need to look at options for speeding the learning process.
The sliding window learning was the first step. While it currently is theoretically just as fast as the previous version, it could be made faster, perhaps by experimenting with an update parameter equal to the regular learner (as opposed to an update parameter 1/10th of the regular one). However this may lead to early convergence. Only experimentation will tell for sure.
An alternative update system could be to use an elite sample size (beam?) of the maximum number of rules in a slot (say 7) and use the policies within it as a continual update. The update parameter would start off small, but grow with slot experience.
The algorithm would be something like this:
– Sample policies until we have S samples (S is equal to the maximum number of rules in the slots)
– Once we have S samples, perform an update, with the update parameter proportional to the average rule experience (Re) for the slot in which the rule occupies.
– Rule experience (Re) is simply how many time the rule has been part of a policy (and triggered, so unused rules gain no exp).
– Continue to generate policies, increasing Re and swapping out bad policies in S for better policies. Update after every step.
– The average rule experience can be calculated at update time or perhaps simply when a rule is updated (as each rule has a home slot member to ‘talk’ to.
This algorithm should be much faster than the regular, or even the sliding window one. However, I can make no guarantees about it converging to the optimal policy. It is likely that it will find a good policy, simply through random-ish exploration and it’s ability to ‘hang-on’ to good policies. An extra possible addition to this algorithm is to ‘down-weight’ policies that didn’t make it into the elite beam. Furthermore, this algorithm approaches the holy ‘anytime’ algorithms, which produce better results the longer they run.
Something to note about the cross-entropy method is that 95% of the samples are ‘thrown away.’ This seems a bit odd and wasteful. Sure, it ensures that you have the true ‘elite’ samples, but because the update parameter isn’t dramatic, it doesn’t quite balance.
The algorithm seems very similar to beam search, which may warrant some investigation… Perhaps I have come up with a new spin on the cross entropy method. Crossbeam search? Entrobeam search? The name can wait until I have a paper and successful results.