Progress: Testing for Optimal Trial Period

Decided to go into the lab today to tinker about with the agent. I noticed last time that emergency play appears to be either broken or it doesn’t do what I expect it to do, so I’ve turned it off for now and am instead optimising the number of runs for the agent to trial itself on.

Using the 5 machines in the lab I am trialling 5 different run lengths over 100 episodes with a cooling rate of 0.99. The values are 20, 50, 100, 200 and 500. Judging from a preliminary glance, the higher trial runs appear to be doing worse as they are completing episodes faster. This makes sense because if the agent happens to mutate a bad parameter set, it’ll be stuck with it for longer.

Here are some summed results for the runs:
20 piece trials
Average steps: 17035.51
Average lines: 1285.05
Average reward: 1735.36
Final policy top 3:
Mults: {3, 1, 12, 24, 1}, worth 0.5370409028638529, over 153290 steps.
Mults: {3, 1, 12, 25, 2}, worth 0.5368683651804671, over 94200 steps.
Mults: {3, 1, 12, 24, 2}, worth 0.5361183475995535, over 53740 steps.

As a side note, I noticed an incredible run of 87378 steps, completing 6647 lines gaining a reward of 9022.

50 piece trials
Average steps: 9359.25
Average lines: 697.78
Average reward: 953.62
Final policy top 3:
Mults: {2, 1, 4, 36, 2}, worth 0.5422344689378757, over 19960 steps.
Mults: {1, 1, 4, 36, 2}, worth 0.541953488372093, over 10750 steps.
Mults: {1, 1, 1, 9, 1}, worth 0.5419444444444445, over 10800 steps.

100 piece trials
Average steps: 6759.58
Average lines: 498.90
Average reward: 647.33
Final policy top 3:
Mults: {1, 3, 1, 18, 6}, worth 0.5321988655321989, over 2997 steps.
Mults: {1, 16, 2, 12, 6}, worth 0.5114893617021277, over 42300 steps.
Mults: {1, 16, 1, 15, 6}, worth 0.51, over 100 steps.

200 piece trials
Average steps: 4741.16
Average lines: 345.33
Average reward: 469.90
Final policy top 3:
Mults: {1, 9, 8, 78, 1}, worth 0.56125, over 4800 steps.
Mults: {1, 10, 8, 78, 3}, worth 0.5356818181818181, over 4400 steps.
Mults: {1, 11, 8, 78, 5}, worth 0.5352631578947369, over 3800 steps.

500 piece trials
Average steps: 5182.45
Average lines: 380.23
Average reward: 511.10
Final policy top 3:
Mults: {2, 6, 6, 28, 1}, worth 0.5514579759862779, over 3498 steps.
Mults: {1, 4, 4, 21, 1}, worth 0.5504615384615384, over 6500 steps.
Mults: {1, 3, 3, 14, 1}, worth 0.53975, over 4000 steps.

As can clearly be seen, small trial runs of 20 proves to be a better choice than larger. The general policies seem to have an emphasis on not having holes while the rest of the values are roughly equal. Perhaps this is because the default parameter set is set as {1,2,2,8,2}, but I feel that it is merely an advantage to avoid holes. I shall now trial more values based around 20: 5, 10, 15, 25, and 30 (No need to do 20 again).

5 piece trials
Average steps: 7627.97
Average lines: 566.58
Average reward: 769.91
Final policy top 3:
Mults: {1, 1, 9, 13, 1}, worth 0.541863428528596, over 13341 steps.
Mults: {1, 1, 9, 12, 1}, worth 0.5414783764441263, over 48905 steps.
Mults: {1, 1, 9, 14, 1}, worth 0.54, over 4100 steps.

10 piece trials
Average steps: 7631.96
Average lines: 566.89
Average reward: 782.26
Final policy top 3:
Mults: {9, 9, 1, 90, 9}, worth 0.5504954179652974, over 21497 steps.
Mults: {9, 9, 1, 180, 9}, worth 0.5494186046511628, over 1720 steps.
Mults: {9, 18, 1, 90, 9}, worth 0.5458333333333333, over 240 steps.

15 piece trials
Average steps: 4149.02
Average lines: 301.25
Average reward: 395.55
Final policy top 3:
Mults: {1, 24, 16, 18, 8}, worth 0.5094253917783329, over 30821 steps.
Mults: {1, 16, 12, 17, 8}, worth 0.5081148564294632, over 4005 steps.
Mults: {1, 24, 8, 18, 8}, worth 0.5080841638981174, over 9030 steps.

25 piece trials
Average steps: 7998.17
Average lines: 596.04
Average reward: 809.51
Final policy top 3:
Mults: {1, 2, 2, 13, 1}, worth 0.5385885374354674, over 76318 steps.
Mults: {1, 2, 2, 13, 2}, worth 0.5342857142857143, over 350 steps.
Mults: {1, 3, 3, 20, 1}, worth 0.5342608695652173, over 2875 steps.

30 piece trials
Average steps: 6991.40
Average lines: 517.67
Average reward: 710.36
Final policy top 3:
Mults: {2, 2, 1, 64, 2}, worth 0.5402094144379279, over 32567 steps.
Mults: {2, 2, 1, 32, 2}, worth 0.5400738688827331, over 32490 steps.
Mults: {2, 1, 1, 64, 2}, worth 0.5400625978090767, over 12780 steps.

The results are interesting. Most of them are equal in ability, yet none come close to the initial 20 piece run. I’m gonna put that run down to luck and have started another.

20 piece trials (Round 2)
Average steps: 5395.61
Average lines: 395.64
Average reward: 539.95
Final policy top 3:
Mults: {1, 17, 8, 110, 7}, worth 0.5360375239479421, over 15137 steps.
Mults: {1, 14, 4, 28, 2}, worth 0.5352860411899314, over 43700 steps.
Mults: {1, 7, 4, 64, 4}, worth 0.5351851851851852, over 1620 steps.

Ergh. Dismal results. The fluctuations the data experiences isn’t gonna let me find an optimal value, so I think 20 will do. It is clearly better than 500 anyway.