Have been working on the agent between the meeting with Bernhard (7th May) and now, just failed to post about it.

Tried out using O(x^3) height rather than O(x^2) to simulate emergency play mode. Also, I turned off emergency play as it only seems to be in the way.

The results weren’t very good using this new height mode, even if the parameters were tuned and tweaked.

So I also tried O(x^3) height and max well depth parameter instead of chasms. This works by finding the deepest 1-wide well in the playing field, thus giving a chasm a minimum value of 3. This parameter differs from the chasm parameter in that it only counts **1** chasm (the deepest one). Chasms counted the number (but not the depth) of chasms. Perhaps having both parameters would give better results. However, the results obtained were even better than before:

`Episode: 95 steps: 100000`

Top three regular policies:

Mults: {1125000, 1, 10, 5500000, 1}, worth 0.46125244085931005, over 146977 steps.

Mults: {1125000, 1, 7, 5500000, 1}, worth 0.4531672089316614, over 242027 steps.

Mults: {1125000, 1, 8, 5500000, 1}, worth 0.45102040816326533, over 980 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 194915

Total reward: 249718.0

Episode: 96 steps: 14528

Top three regular policies:

Mults: {1125000, 1, 10, 5500000, 1}, worth 0.45953288806658954, over 149750 steps.

Mults: {1125000, 1, 7, 5500000, 1}, worth 0.4531672089316614, over 242027 steps.

Mults: {1125000, 1, 8, 5500000, 1}, worth 0.45102040816326533, over 980 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 196007

Total reward: 251102.0

Episode: 97 steps: 13463

Top three regular policies:

Mults: {1125000, 1, 10, 5500000, 1}, worth 0.45793100943523535, over 152282 steps.

Mults: {1125000, 1, 8, 5500000, 1}, worth 0.454, over 1000 steps.

Mults: {1125000, 1, 7, 5500000, 1}, worth 0.4531672089316614, over 242027 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 197031

Total reward: 252379.0

Episode: 98 steps: 56079

Top three regular policies:

Mults: {1125000, 1, 10, 5500000, 1}, worth 0.4583597411637819, over 161883 steps.

Mults: {1125000, 1, 7, 5500000, 1}, worth 0.45311711251886133, over 242067 steps.

Mults: {1125000, 1, 8, 5500000, 1}, worth 0.45, over 1020 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 201319

Total reward: 257858.0

Episode: 99 steps: 100000

Top three regular policies:

Mults: {1125000, 1, 10, 5500000, 1}, worth 0.463458363684197, over 180516 steps.

Mults: {1125000, 1, 7, 5500000, 1}, worth 0.45308380902775947, over 242087 steps.

Mults: {1125000, 1, 8, 5500000, 1}, worth 0.4519230769230769, over 1040 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 208969

Total reward: 267466.0

totalSteps is: 2744386

Note that some of them maxed out on the 100000 move limit. This run gives an average of 2089 lines made per episode, more if the game wasn’t limited.

So I tried using O(x^2) height and maxWellDepth together and received these results:

`Episode: 95 steps: 26234`

Top three regular policies:

Mults: {4837500, 8, 62, 34100000, 1}, worth 0.5282893660552973, over 250603 steps.

Mults: {4837500, 8, 31, 34100000, 1}, worth 0.5257557630634027, over 12537 steps.

Mults: {4837500, 4, 62, 34100000, 1}, worth 0.5235294117647059, over 680 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 132196

Total reward: 178962.0

Episode: 96 steps: 1960

Top three regular policies:

Mults: {4837500, 8, 62, 34100000, 1}, worth 0.5280204477722164, over 250981 steps.

Mults: {4837500, 8, 31, 34100000, 1}, worth 0.5257557630634027, over 12537 steps.

Mults: {4837500, 4, 62, 34100000, 1}, worth 0.5235294117647059, over 680 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 132334

Total reward: 179162.0

Episode: 97 steps: 13947

Top three regular policies:

Mults: {4837500, 8, 62, 34100000, 1}, worth 0.527880141640585, over 253616 steps.

Mults: {4837500, 8, 31, 34100000, 1}, worth 0.5257557630634027, over 12537 steps.

Mults: {4837500, 4, 62, 34100000, 1}, worth 0.5235294117647059, over 680 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 133383

Total reward: 180581.0

Episode: 98 steps: 1484

Top three regular policies:

Mults: {4837500, 8, 62, 34100000, 1}, worth 0.5275962773124102, over 253902 steps.

Mults: {4837500, 8, 31, 34100000, 1}, worth 0.5257557630634027, over 12537 steps.

Mults: {4837500, 4, 62, 34100000, 1}, worth 0.5235294117647059, over 680 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 133482

Total reward: 180730.0

Episode: 99 steps: 16534

Top three regular policies:

Mults: {4837500, 8, 62, 34100000, 1}, worth 0.5277445183819767, over 256402 steps.

Mults: {4837500, 8, 31, 34100000, 1}, worth 0.5228478140898207, over 12557 steps.

Mults: {2981250, 4, 36, 19800000, 1}, worth 0.5176470588235295, over 340 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 134732

Total reward: 182450.0

totalSteps is: 1782982

Not as good results but not too bad either. Running again to see if it was simply a bad game. Also, the agent was initialised with bad parameters.

Results:

`Episode: 95 steps: 20720`

Top three regular policies:

Mults: {3000000, 5, 124, 19699262, 1}, worth 0.531827427907115, over 200577 steps.

Mults: {3000000, 5, 62, 19699262, 1}, worth 0.5172158407551178, over 74870 steps.

Mults: {3900000, 12, 40, 19048046, 1}, worth 0.5084507042253521, over 1420 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 165272

Total reward: 221100.0

Episode: 96 steps: 100000

Top three regular policies:

Mults: {3000000, 5, 124, 19699262, 1}, worth 0.5325592361258179, over 219503 steps.

Mults: {3000000, 5, 62, 19699262, 1}, worth 0.5172245960386657, over 74890 steps.

Mults: {3000000, 5, 248, 19699262, 1}, worth 0.5113636363636364, over 440 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 172886

Total reward: 231390.0

Episode: 97 steps: 25557

Top three regular policies:

Mults: {3000000, 5, 124, 19699262, 1}, worth 0.5326349387853656, over 224324 steps.

Mults: {3000000, 5, 62, 19699262, 1}, worth 0.5172245960386657, over 74890 steps.

Mults: {3000000, 5, 248, 19699262, 1}, worth 0.5113636363636364, over 440 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 174816

Total reward: 234033.0

Episode: 98 steps: 9683

Top three regular policies:

Mults: {3000000, 5, 124, 19699262, 1}, worth 0.5322204536768287, over 226170 steps.

Mults: {3000000, 5, 62, 19699262, 1}, worth 0.5172245960386657, over 74890 steps.

Mults: {3000000, 5, 248, 19699262, 1}, worth 0.5113636363636364, over 440 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 175538

Total reward: 234987.0

Episode: 99 steps: 49490

Top three regular policies:

Mults: {3000000, 5, 124, 19699262, 1}, worth 0.5320201145374196, over 235452 steps.

Mults: {3000000, 5, 62, 19699262, 1}, worth 0.5171799492368933, over 74910 steps.

Mults: {3900000, 12, 40, 19048046, 1}, worth 0.5084507042253521, over 1420 steps.

Top three emergency policies:

Mults: {1, 1, 32, 1, 256}, worth 0.0, over 0 steps.

Total lines: 179287

Total reward: 240007.0

totalSteps is: 2367240

Somewhat better, but the other one looks better. Have to run that again to see.

Something to note, the optimal parameters for these domains have very large values in them, and might be better with smaller values so they can suit a larger range of domains.

Another idea: If the value of the top parameter state falls below a certain value, the eTemp should be raised so the agent can explore it’s way out. This would only happen if the agent was ill-suited to a particular domain, and allows for a less linear approach to shifting problems. This idea came as a result of thinking about the proving run. Whether the agent is re-initialised on each domain or if it is the same agent running over all domains is unclear. I shall ask on the forum.