Ms. PacMan seems to finally be working – not in terms of learning (that I know of) – but rather in terms of not crashing. While looking at the code though, I noticed that I may have an issue with the toGhost action. By itself, it seems counter-intuitive, that is unless the ghost is edible, in which case it’s desirable.
However, due to the strict way the pre-goal state is created, it is likely to never be used. This is because while an optimal (or even a lucky agent’s) pre-goal may contain an edible ghost, it also may not. In fact, the agent could conceivably reach the goal without using toGhost at all, resulting in its pre-goal not including edible ghosts. While I say toGhost isn’t used, because the agent essentially chooses every policy at once, it is used, just not towards any consequence.
There are a couple of ways to fix this: Modify the action requirements for toGhost such that they must include edible. However, this violates the terms of the action because its technically not legal. This problem is the old ‘how much can I help?’ problem that I’ve come across.
Another is to only generate pre-goals for actions EXAMINED in the final step. So while the agent returns a full list of all actions, only form a pre-goal using the top few ones that actually mattered.
The above should be implemented anyway, but it still may not help.The toGhost action will never progress (unless through a miraculous coincidence) beyond its LGG form unless the optimal policy can create a pregoal for it. This leads me to believe that perhaps the optimal policy framework should be tweaked to ensure a pre-goal is present for every possible goal-fulfillling action. For example, something like providing a state description of the pre-goal. But then we fall back into the problem of basically designing the agent.
Perhaps for Ms. PacMan the agent gets several optimal traces which should hopefully help it out. This will require explicit state spec notes on which actions can achieve the goal. This form of learning is more of a ‘follow the teacher’ method, which should be acceptable as a form of learning.
Anyway, I haven’t really remarked on the ‘S’ title.
S (Statistical) was the original aim of this research, and may be required for proper learning. As with the edible case, the edible predicate may not always be present. this could be problematic and so instead of a hard unification step, I could soften it and include probabilistic values for predicates. Each predicate could start optimistically (100% seen) and gradually lower its probability with each unseen pre-goal unification. Any predicates dropping below say, 50% would be removed fully.
However, in the opposite case, where the predicate is at first not present, and later frequently seen, it’ll need to be increased to ~100%. Perhaps start all predicates at 50% and go either way.