Progress: Explore vs. Exploit Strategy

Just a small implementation for now regarding the choice of exploration vs. exploitation. At first, the agent will explore fully, then become more exploitative over time.

This is done by having the initial E value as 1 (1 chance of exploring) and ‘cooling’ it by a constant (currently 0.99) every time an exploratory action is chosen. This means that as E becomes small, it will cool slower as exploratory actions will be chosen less and less. To stop it from cooling to lower than the lowest value, a cap will be put on the temp.

A fault I have thought of is that when the agent is exploring, it will always take a good action but ignore bad actions. Thus, these bad actions won’t ever be updated and the greedy side of things may not be able to find a decent place to put a Tetromino and resort to using one of those unseen bad actions.
Perhaps a method of remedying this is to update similar states when updating by a small amount. For instance, {0,0,0} is similar to {0,0,1} and the latter could receive a small amount of reward whenever the former receives reward and vice-versa.
Probably not a big deal, and finding the right amount of reward could be bothersome, so I’ll leave it for now.

More to come as I get lab results.

Lab results!
The program performed ok-ish. It basically did what I expected, but definitely needs improvement. Some areas of problems were:
– The standard bug that I haven’t got around to fixing: Cannot find the first piece of a new cleared field. Must fix this eventually. Or at the very least, spam DROP until it can find one.
– The I-pieces didn’t always have a place to go. Need to give it more options for putting vertically (filling in holes 1-2 deep or aligning next to walls/towers). EMPHASIS! They are really messing up the whole field. A horizontal I-piece in the worng place is catastrophic.
– Pieces would sometimes choose less than perfect positions. For instance, given an S-piece vertically and a field of {…,-1,-1,-1,-1,-1}, the best place to put it would be at the end, the lowest point. But I have seen it put it at the second-to-lowest position, creating a big chasm.
– L and J-pieces have been seen put in less than perfect positions too. Just evidenced was a field of {0,0,0,-1…} and it put it vertically upright on the 0s, rather than on the -1.
– Pieces that have width 3 or 2, depending on there orientation, need to be biased towards having them 3 wide, to make lines quicker. So T, L, and J pieces should, if there’s room, lay them horizontally to cover the most ground, unless there’s a better vertical position.

As for learning and switching to greedy, it doesn’t appear to be doing much, but it’s hard to tell when the agent is greedy or exploratory. More to come once these above problems have been sorted.