Some time ago, I started some Ms. PacMan experiments – on the regular domain and the two reduced domains. I think it was before I left for Aus, so just after slot optimisation was implemented. I have some results from the experiments, although all of them were cut short by some ‘write’ failure.
PacMan Regular:
Performance from 80% of a single experiment (not converged)
3492.5334
3953.9
3786.3
6411.2334
6768.1333
6809.6333
6479.3
6784.1665
6806.8335
6867.7
6817.1665
6985.7
6876.7666
6747.567
6435.2
6797.3335
6694.3667
6904.6665
6471.033
6703.1333
6915.6665
6707.467
6528.967
6613.5
6421.3667
6603.3667
6636.9
5877.6665
6166.3335
6822.3335
6502.1333
6862.3667
6938.1333
7054.433
6807.6333
6974.9
6174.967
6185.967
5865.0
6112.933
As can be seen, the agent does perform some learning, though as to what rule it was, I am unsure. It doesn’t seem to progress beyond the 6800 mark though. Perhaps it would be a good idea to load an agent up with the generator file and observe the behaviour.
The readable policy is:
A typical policy:
(distanceDot player ?X ?Y) (pacman player) => (toDot ?X ?Y)
(distanceFruit player ?X ?Y) (pacman player) => (toFruit ?X ?Y)
(distancePowerDot player ?X ?Y) (pacman player) => (fromPowerDot ?X ?Y)
(distanceGhost player ?X ?Y) (pacman player) => (fromGhost ?X ?Y) / (distanceGhost player ?X ?__Num2&:(betweenRange ?__Num2 0.0 10.25)) (pacman player) => (fromGhost ?X ?__Num2)
(distanceGhost player ?X ?Y) (pacman player) => (toGhost ?X ?Y) / (distanceGhost player ?X ?__Num3&:(betweenRange ?__Num3 22.0 33.0)) (pacman player) => (toGhost ?X ?__Num3)
(distancePowerDot player ?X ?Y) (pacman player) => (toPowerDot ?X ?Y)
None of these slot values have converged to 0, only toDot is closest at about 0.15, followed by toFruit at 0.33. The fromPowerDot rule is curious, and implies that the agent doesn’t waste the powerdots. They would typically be eaten as part of toDot. I would have hoped fromGhost would be higher (at 0.59), but the again the agent does seek to maximise reward and running from ghosts only has implicit benefits. The toGhost behaviour lacks eating edible ghosts rules, but that is likely due to the inflexible pre-goal unification.
Judging by this policy, the agent concerns itself little with ghost avoidance and more with amassing reward.
PacMan NoPowerDots
A performance file chosen from a choice of 3.2 files:
3267.2334
3305.0334
3430.8333
3307.5334
3266.4333
3207.7334
3243.9666
3161.0334
3186.0334
3414.5667
3174.4666
3153.6333
3141.9666
3187.0334
3197.0
3105.9333
3190.8333
3127.2334
The agent has appeared to have converged in this environment, although to a somewhat lacking policy. But pehaps that is the best it can do. The other two completed performance files are of similar values but at different convergence points.
A typical policy:
(distanceDot player ?X ?Y) (pacman player) => (toDot ?X ?Y)
(aggressive ?X) (nonblinking ?X) (distanceGhost player ?X ?Y) (pacman player) => (fromGhost ?X ?Y)
(distanceFruit player ?X ?Y) (pacman player) => (toFruit ?X ?Y)
(aggressive ?X) (nonblinking ?X) (distanceGhost player ?X ?Y) (pacman player) => (toGhost ?X ?Y)
There is a clear first rule here, judging by the slots, with toDot nearly at 0 for slot selection. FromGhost is next at 0.33 and the others at higher values. Because this environment is much more dangerous, the agent places higher emphasis on avoiding ghosts than eating fruit. However, the agent still eats dots over avoiding ghosts. But perhaps that is only because fromGhost is largely deterministic behaviour and will only result in the agent scrambling about without reward gain. The last two are more or less interchangable, which is odd, as toGhost behaviour is pointless in this domain.
PacMan NoDots
A performance file chosen from 4.7 files:
2631.7666
3001.9666
2229.3667
2331.6667
2032.6666
5316.7666
5606.4
10341.866
9931.0
9868.233
10084.434
10025.634
10329.233
9936.7
10132.267
10204.5
10048.267
10082.7
10190.033
10205.4
9945.6
10238.866
This domain clearly seems quite easy to finish. I’d attribute this to the fruit. I may have to run some experiments with the fruit removed. As can be seen, the agent takes a little while to get going, but eventually figures out the best rules to follow. Each other performance file is similar to this one, though some actually get worse before getting better.
A typical policy:
(distanceFruit player ?X ?Y) (pacman player) => (toFruit ?X ?Y)
(distancePowerDot player ?X ?Y) (pacman player) => (toPowerDot ?X ?Y)
(distanceGhost player ?X ?Y) (pacman player) => (fromGhost ?X ?Y)
(distancePowerDot player ?X ?Y) (pacman player) => (fromPowerDot ?X ?Y)
(distanceGhost player ?X ?Y) (pacman player) => (toGhost ?X ?Y)
As stated above, the fruit is what the agent chooses here. The strategy is something like: if the fruit is available go get it, otherwise disrupt the ghosts with powerdots. If a ghost is close, run from it. Some of the policy files actually had smarter rules for the ghost behaviour, like if the ghost is aggressive, then run, or if the ghost is edible, then go to it.
So that’s that. Clearly the agent is learning (before cross-entrobeam search) so that’s a good sign. But alas, the regular approach still isn’t as good as it could be. I should visually inspect the agent’s behaviour by loading the generator files to see what’s going wrong in each domain.