PhD Progress: Controlling the Cross Entropy Parameters

On reading Tom’s thesis, and the detailing of the TG and TGR algorithms, I realised that they’d probably be quite fast (faster than 24 hours anyway). Of course, they have been tested on Blocks World, which is a very different domain from Pacman.

A possible way to speed up the cross-entropy process is to adjust the parameters of the algorithm, namely the policy size and rule number. These are, by default 100 and 100. These numbers are largely ‘magic’ (or standard, going by cross entropy measures), but they could be adjusted to match the environment. Although I will need to find the proper values by experimentation, I am proposing that the policy size is equivalent to the number of permutations of conditions. Or perhaps conditions + actions.

In the Pacman world, there are a total of 11 predicates, with the total number of predicate permutations at 97. The total number of action permutations is 5. So, perhaps both rule number and policy size could be 97.

In a blocks world case, where the goal is on(a,b) (or something similar), the predicate permutation number is (using just on2 and clear1 as obs and move2 (the second argument can be free, tied or floor) approximately 18 (illegal states and such may change this figure somewhat). The action number is approx. 18 also. Maybe the policy size should depend on the max of either preds or actions. Having smaller policies and a smaller random rule base may make the convergence much faster. It’ll all come down to experiments, really.

So, I have some experiments to run: Decreasing the policy size, decreasing the random rule size, increasing the selection ratio, decreasing the population size, decreasing the population size and increase the selection ratio to balance.

Something sort of within the same subject is splitting the policy into priorities. This sort of thing would be useless in blocks world, where actions are performed instantaneously. Perhaps this could be a parameter for the algorithm? Blocks world will still most certainly function with a split policy, but it will be like shrinking the policy threefold. Experiments will tell…