After implementing the evaluation action choosing method and fixing up a minor bug that caused bad play (was looking for holes horizontally), I have created a much better agent. Doing a console trainer run, after 100 episodes, it had completed a total of 525647 movements. That’s 5256 movements per episode on average. If a piece takes 7 movements maximum to place itself, giving a (heavily) estimated average of 3.5 moves per piece, then the agent places roughly 1502 pieces per episode.
To further my estimated mathematics, if the agent takes on average 4 pieces to make a line (it’s likely to be lower, but I’m being generous here), that’s 375 lines per episode. Not too shabby…
Note that this is when the agent is playing exploratorily. If I turn on the cooling rate, the agent becomes worse and worse, netting results the same as it had before.
This leads me to believe that my greedy function needs changing. Or rather, that my evaluation function needs changing. So, discarding the substate data (I worked so hard on!!! 🙁 ), and instead of storing data there, somehow evolve my evaluation function for optimal play. The main idea is to try and mix genetic algorithm with reinforcement learning. This will need further thought. But what is known, is that substates weren’t the idea method of play. They’re too short sighted.