The experiments have finished running and the results are here:
Fired Policy, 25 episodes, Hand-coded:
23382.932
35534.4
42656.73
43254.67
50027.86
55115.996
60153.664
60357.0
62273.797
65428.336
68832.66
69526.4
70930.92
73594.94
76653.14
81189.01
80561.41
83494.73
84098.19
87400.14
86716.28
86398.41
89340.27
88578.99
88555.38
Reintegration strategy, constant, 25 episodes, Hand-coded:
24976.2
38775.336
48561.266
52978.336
58631.195
61405.54
62353.67
63225.27
66774.22
67535.6
70723.27
73677.26
76765.94
78708.48
80170.336
81938.73
81456.93
81710.586
86589.94
85712.54
85214.2
84442.46
84234.54
85138.74
84843.2
Reintegration strategy, decaying, 25 episodes, Hand-coded:
24219.207
36947.4
44144.87
52567.34
57010.4
63731.74
64325.21
65092.066
68576.41
71655.07
70928.8
74391.4
72944.01
76654.62
75102.06
74518.46
75871.2
75655.48
79737.53
79338.67
82021.27
81515.53
82974.13
86753.47
87630.38
The results, when viewed on a graph, show that there isn’t a significant difference between the strategies. However, if the results are unbiased, then regular hand-coding would work best.
One note of interest is that the reintegration of the rules appears to provide a small boost to the early results (1-10ish).
I should have tested the experiments over random rulesets as well.
There is still one more strategy to try: the reintegration with delay.