Progress: Improved Results

Have been working on the agent between the meeting with Bernhard (7th May) and now, just failed to post about it.

Tried out using O(x^3) height rather than O(x^2) to simulate emergency play mode. Also, I turned off emergency play as it only seems to be in the way.

The results weren’t very good using this new height mode, even if the parameters were tuned and tweaked.

So I also tried O(x^3) height and max well depth parameter instead of chasms. This works by finding the deepest 1-wide well in the playing field, thus giving a chasm a minimum value of 3. This parameter differs from the chasm parameter in that it only counts 1 chasm (the deepest one). Chasms counted the number (but not the depth) of chasms. Perhaps having both parameters would give better results. However, the results obtained were even better than before:

Episode: 95 steps: 100000
Top three regular policies:
Mults: {1125000, 1, 10, 5500000, 1}, worth 0.46125244085931005, over 146977 steps.
Mults: {1125000, 1, 7, 5500000, 1}, worth 0.4531672089316614, over 242027 steps.
Mults: {1125000, 1, 8, 5500000, 1}, worth 0.45102040816326533, over 980 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 194915
Total reward: 249718.0
Episode: 96 steps: 14528
Top three regular policies:
Mults: {1125000, 1, 10, 5500000, 1}, worth 0.45953288806658954, over 149750 steps.
Mults: {1125000, 1, 7, 5500000, 1}, worth 0.4531672089316614, over 242027 steps.
Mults: {1125000, 1, 8, 5500000, 1}, worth 0.45102040816326533, over 980 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 196007
Total reward: 251102.0
Episode: 97 steps: 13463
Top three regular policies:
Mults: {1125000, 1, 10, 5500000, 1}, worth 0.45793100943523535, over 152282 steps.
Mults: {1125000, 1, 8, 5500000, 1}, worth 0.454, over 1000 steps.
Mults: {1125000, 1, 7, 5500000, 1}, worth 0.4531672089316614, over 242027 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 197031
Total reward: 252379.0
Episode: 98 steps: 56079
Top three regular policies:
Mults: {1125000, 1, 10, 5500000, 1}, worth 0.4583597411637819, over 161883 steps.
Mults: {1125000, 1, 7, 5500000, 1}, worth 0.45311711251886133, over 242067 steps.
Mults: {1125000, 1, 8, 5500000, 1}, worth 0.45, over 1020 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 201319
Total reward: 257858.0
Episode: 99 steps: 100000
Top three regular policies:
Mults: {1125000, 1, 10, 5500000, 1}, worth 0.463458363684197, over 180516 steps.
Mults: {1125000, 1, 7, 5500000, 1}, worth 0.45308380902775947, over 242087 steps.
Mults: {1125000, 1, 8, 5500000, 1}, worth 0.4519230769230769, over 1040 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 208969
Total reward: 267466.0
totalSteps is: 2744386

Note that some of them maxed out on the 100000 move limit. This run gives an average of 2089 lines made per episode, more if the game wasn’t limited.

So I tried using O(x^2) height and maxWellDepth together and received these results:
Episode: 95 steps: 26234
Top three regular policies:
Mults: {4837500, 8, 62, 34100000, 1}, worth 0.5282893660552973, over 250603 steps.
Mults: {4837500, 8, 31, 34100000, 1}, worth 0.5257557630634027, over 12537 steps.
Mults: {4837500, 4, 62, 34100000, 1}, worth 0.5235294117647059, over 680 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 132196
Total reward: 178962.0
Episode: 96 steps: 1960
Top three regular policies:
Mults: {4837500, 8, 62, 34100000, 1}, worth 0.5280204477722164, over 250981 steps.
Mults: {4837500, 8, 31, 34100000, 1}, worth 0.5257557630634027, over 12537 steps.
Mults: {4837500, 4, 62, 34100000, 1}, worth 0.5235294117647059, over 680 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 132334
Total reward: 179162.0
Episode: 97 steps: 13947
Top three regular policies:
Mults: {4837500, 8, 62, 34100000, 1}, worth 0.527880141640585, over 253616 steps.
Mults: {4837500, 8, 31, 34100000, 1}, worth 0.5257557630634027, over 12537 steps.
Mults: {4837500, 4, 62, 34100000, 1}, worth 0.5235294117647059, over 680 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 133383
Total reward: 180581.0
Episode: 98 steps: 1484
Top three regular policies:
Mults: {4837500, 8, 62, 34100000, 1}, worth 0.5275962773124102, over 253902 steps.
Mults: {4837500, 8, 31, 34100000, 1}, worth 0.5257557630634027, over 12537 steps.
Mults: {4837500, 4, 62, 34100000, 1}, worth 0.5235294117647059, over 680 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 133482
Total reward: 180730.0
Episode: 99 steps: 16534
Top three regular policies:
Mults: {4837500, 8, 62, 34100000, 1}, worth 0.5277445183819767, over 256402 steps.
Mults: {4837500, 8, 31, 34100000, 1}, worth 0.5228478140898207, over 12557 steps.
Mults: {2981250, 4, 36, 19800000, 1}, worth 0.5176470588235295, over 340 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 134732
Total reward: 182450.0
totalSteps is: 1782982

Not as good results but not too bad either. Running again to see if it was simply a bad game. Also, the agent was initialised with bad parameters.

Results:
Episode: 95 steps: 20720
Top three regular policies:
Mults: {3000000, 5, 124, 19699262, 1}, worth 0.531827427907115, over 200577 steps.
Mults: {3000000, 5, 62, 19699262, 1}, worth 0.5172158407551178, over 74870 steps.
Mults: {3900000, 12, 40, 19048046, 1}, worth 0.5084507042253521, over 1420 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 165272
Total reward: 221100.0
Episode: 96 steps: 100000
Top three regular policies:
Mults: {3000000, 5, 124, 19699262, 1}, worth 0.5325592361258179, over 219503 steps.
Mults: {3000000, 5, 62, 19699262, 1}, worth 0.5172245960386657, over 74890 steps.
Mults: {3000000, 5, 248, 19699262, 1}, worth 0.5113636363636364, over 440 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 172886
Total reward: 231390.0
Episode: 97 steps: 25557
Top three regular policies:
Mults: {3000000, 5, 124, 19699262, 1}, worth 0.5326349387853656, over 224324 steps.
Mults: {3000000, 5, 62, 19699262, 1}, worth 0.5172245960386657, over 74890 steps.
Mults: {3000000, 5, 248, 19699262, 1}, worth 0.5113636363636364, over 440 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 174816
Total reward: 234033.0
Episode: 98 steps: 9683
Top three regular policies:
Mults: {3000000, 5, 124, 19699262, 1}, worth 0.5322204536768287, over 226170 steps.
Mults: {3000000, 5, 62, 19699262, 1}, worth 0.5172245960386657, over 74890 steps.
Mults: {3000000, 5, 248, 19699262, 1}, worth 0.5113636363636364, over 440 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 175538
Total reward: 234987.0
Episode: 99 steps: 49490
Top three regular policies:
Mults: {3000000, 5, 124, 19699262, 1}, worth 0.5320201145374196, over 235452 steps.
Mults: {3000000, 5, 62, 19699262, 1}, worth 0.5171799492368933, over 74910 steps.
Mults: {3900000, 12, 40, 19048046, 1}, worth 0.5084507042253521, over 1420 steps.
Top three emergency policies:
Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.
Total lines: 179287
Total reward: 240007.0
totalSteps is: 2367240

Somewhat better, but the other one looks better. Have to run that again to see.

Something to note, the optimal parameters for these domains have very large values in them, and might be better with smaller values so they can suit a larger range of domains.

Another idea: If the value of the top parameter state falls below a certain value, the eTemp should be raised so the agent can explore it’s way out. This would only happen if the agent was ill-suited to a particular domain, and allows for a less linear approach to shifting problems. This idea came as a result of thinking about the proving run. Whether the agent is re-initialised on each domain or if it is the same agent running over all domains is unclear. I shall ask on the forum.