While I am still debating the wisdom of this approach (whether it will prove useful), I’ll note down the general algorithm.
So the current pre-goal strategy just notes the pre-goal. This means it is only really useful for getting to that pre-goal. But what about getting to the pre-pre-goal, and so on? It works well in blocks world because that is a simple environment, but it makes little sense in Ms. PacMan, which has a number of actions. So instead of a pre-goal state, note down action states for every action used.
For example, in Ms. PacMan, there’s toDot behaviour, fromGhost, fromPowerDot, etc. The key to finding out the best times to use these actions (discovering the best rules) is to note down the times when a good policy uses these actions. Of course, there may be problems with Ms. PacMan which tends to use all actions every step. That will have to be dealt with.
There is also the problem of continuous values. But for now, I need to outline the algorithm:
1. Run the optimal policy once and record the action-state trace. Store the values of the action-state trace as probabilistic action-states (one for each action).
2. Generate S samples (as given by the cross-entrobeam algorithm) and update rule probabilities.
3. Generate another sample (as given by cross-entrobeam algorithm). If it is added to the elites, then use its action-state trace to update the probabilistic action-states. Also, update rule probs.
4. Repeat. At each action-state update, attempt to generate new rules, using the probabilities contained within the pre-goal.
The problem with maintaining a probabilistic action-state is that the size of the action-state explodes, becoming as big as the state observations themselves.
After noting this down, I don’t think it’s such a good idea to pursue. Furthermore, there is no clear benefit to using this for the Hanoi problem.