I have finally ran and completed the experiments for the hand-coded and random rule agents. At the end of the experiments, I output the average reward among the elite samples. Here are the results:
Hand-coded (34 episodes, as 16 were done separately)
100.00000% complete. Elapsed: 51:32:15, Remaining: 0:0:0
Average episode elite scores:
78745.87
83027.48
81085.53
83940.06
87843.28
83304.2
90278.41
87169.45
90495.6
89847.68
88119.0
91488.61
92209.88
89397.05
92143.13
89283.4
90389.52
92378.53
91848.4
91277.81
89553.67
90381.74
90205.13
91268.2
90408.6
90153.33
91356.15
91904.46
90901.41
92954.88
93016.99
91931.26
91215.41
93314.6
Random rule policy (Full 50 episodes)
100.00000% complete. Elapsed: 55:11:30, Remaining: 0:0:0
Average episode elite scores:
7742.2
11269.332
13557.135
15936.4
17129.734
19037.537
19051.463
20580.531
22265.332
27127.592
32455.928
38833.8
41077.996
42968.46
42417.07
43838.266
44584.13
45201.266
44522.59
46518.195
45115.203
45193.066
45517.465
46797.996
46707.605
45971.066
46797.465
47212.61
47904.266
47408.066
47441.926
47419.195
48914.734
47392.93
48966.8
49081.406
48081.344
49223.8
48085.594
49496.06
48561.336
48784.27
49795.066
48970.07
48311.33
48795.734
48623.27
49217.863
49241.27
49554.8
The hand-coded results show growth, but not as much growth as that seen in the random results. However, the hand-coded rules have a clear advantage over randomly-generated rules. This is the issue I intend to remedy over the coming weeks.
I have created a graph of the results, but I don’t think it’ll be uploaded, thanks to Mac’s outstanding performance in taking screenshots. Nonetheless, the results, when aligned, show the same sort of behaviour (after the 16th episode, anyway. I assume the growth looks much the same). The performance gradually levels out to a plateau, with the random rules converging to about 50000 and the hand-coded rules converging to about 92500.
The absolute best score an agent could achieve is approximately (I don’t think it’s exact) 186890. However, 120000 of that comes from optimally eating ghosts. The agent is unlikely to be able to optimally eat every ghost with every powerdot on every level, so a more likely best estimate is saying that the agent eats all ghosts with at least one of the powerdots on each level (3000 points for eating ghosts per level) which equals 96890. The other variable is eating fruit, which if eaten on every level, is worth a total of 41250 points. This is already factored into the score.
From these results, the agent clearly does a good job of eating both ghosts and fruit, with the best hand-coded policy achieving an average score of 102290. The best random policy achieves an average score of 62657.
These policies are:
Hand-coded
[1]: if CONSTANT>=0.0 then TO_DOT+
[1]: if NEAREST_ED_GHOST<99.0 then TO_ED_GHOST+
[1]: if NEAREST_FRUIT<99.0 then TO_FRUIT+
[1]: if NEAREST_ED_GHOST<99.0 then TO_ED_GHOST+
[1]: if NEAREST_GHOST<4.0 then FROM_GHOST+
[2]: if MAX_JUNCTION_SAFETY<3.0 then TO_SAFE_JUNCTION+
[2]: if NEAREST_POWER_DOT<2.0 and NEAREST_GHOST<5.0 then TO_POWER_DOT+
[3]: if NEAREST_ED_GHOST>=99.0 then FROM_POWER_DOT-
[3]: if CONSTANT>=0.0 then TO_SAFE_JUNCTION+
[3]: if CONSTANT>=0.0 then TO_SAFE_JUNCTION+
This policy looks quite well formed. The first priority is the one that changes most. Essentially, it is read in reverse order: If a ghost is close, run from it; else if there are edible ghosts, eat them; else if there is a fruit eat it, or default to eating dots. The second priority deals mainly with tiebreaking decisions when ghosts are near, in that it chooses the max junction safety direction or towards the powerdot if it is close. Finally, the last priority puts in default tie-breaking by choosing the safest junction.
The default behaviour of this policy is to eat edible ghosts and fruit if available, otherwise eat dots, with the eating always in the safest direction. Of course, above all this is avoiding ghosts.
Random rule policy
[1]: if FROM_GHOST_CENTRE+ then TO_DOT+
[1]: if CONSTANT>=1.0 and GHOST_CENTRE_DIST<15.793810320642846 then FROM_POWER_DOT-
[1]: if DOT_CENTRE_DIST>=17.11724276862369 and TO_CENTRE_OF_DOTS- then FROM_GHOST_CENTRE-
[1]: if NEAREST_GHOST>=15.0 and TO_ED_GHOST+ then TO_DOT+
[1]: if NEAREST_GHOST>=21.0 and MAX_JUNCTION_SAFETY<4.0 then TO_FRUIT-
[1]: if NEAREST_ED_GHOST>=10.0 then TO_DOT+
[1]: if NEAREST_FRUIT>=8.0 then TO_DOT+
[1]: if NEAREST_POWER_DOT>=9.0 and FROM_POWER_DOT+ then TO_DOT+
[2]: if FROM_GHOST+ then TO_DOT+
[2]: if FROM_POWER_DOT- then FROM_GHOST-
[2]: if KEEP_DIRECTION+ then TO_SAFE_JUNCTION-
[2]: if NEAREST_GHOST<6.0 then TO_DOT-
[2]: if TO_DOT+ then KEEP_DIRECTION+
[2]: if NEAREST_ED_GHOST>=10.0 and TO_CENTRE_OF_DOTS- then TO_POWER_DOT-
[2]: if FROM_GHOST_CENTRE+ then FROM_GHOST+
[2]: if DOT_CENTRE_DIST<17.11724276862369 and TO_DOT+ then FROM_GHOST_CENTRE-
[2]: if TO_SAFE_JUNCTION+ then TO_FRUIT-
[3]: if NEAREST_ED_GHOST>=3.0 and NEAREST_DOT>=1.0 then FROM_GHOST+
[3]: if MAX_JUNCTION_SAFETY<13.0 and NEAREST_DOT>=12.0 then FROM_GHOST+
[3]: if NEAREST_ED_GHOST>=10.0 then TO_ED_GHOST-
[3]: if NEAREST_FRUIT>=2.0 then TO_CENTRE_OF_DOTS-
[3]: if CONSTANT<1.0 then TO_DOT-
[3]: if MAX_JUNCTION_SAFETY<4.0 and GHOST_DENSITY>=0.0 then KEEP_DIRECTION+
[3]: if TO_DOT- then TO_FRUIT+
[3]: if TO_FRUIT+ then TO_SAFE_JUNCTION-
[3]: if FROM_POWER_DOT+ then FROM_GHOST+
[3]: if NEAREST_FRUIT>=18.0 and FROM_POWER_DOT- then KEEP_DIRECTION-
This policy is much bigger, unfortunately, and probably has a lot of redundant rules. However, when watching the agent, it appears to use the to dot approach for priority 1 and from ghost for priority 2. This works by deactivating to dot when a ghost draws near, thus causing the priority 2 behaviour to take over.