While thinking on the learning method of my agent, I realised that it isn’t as simple as I believed it to be. TD learning uses eligibility traces (of length 1, but nonetheless), I fail to use any form of TD learning on first glance. However, I use my own form of eligibility trace which is able to update multiple states at once. The superstate updating covers this and lets me see that the learning isn’t as basic as I thought it was.
Still, I would like to make it better.
After a bit of reading, I have thought of a way to better the agent learning. Using the backward eligibility trace which is fired every time a line is made, multiple previous states could be updated as a corresponding lead up. Of course, they wouldn’t get the same reward as the winning state, but they would still receive some reward. To do this, the policy should experience no change on 0 reward, but a list of states/pieces should be updated on a 1 reward.
Initial Eligibility Trace outline:
First, a constant for the size of the eligibility trace needs to be initialised. 10 oughta do it. Therefore, the amount of reward received at each point in the list is equal to R(n) = r * (1 – n/10), where n = 0 is the latest element in the list, thus receiving full reward. The last element will receive 0.1 of r.
The idea is to maintain a stack of size 10 which contains the last 10 SubState-Tetromino pairs for updating. When a positive, non-zero reward is obtained, update all 10 of them using the formula above for eligibility scaling. This should net faster reward gain. If rewards are gained in quick succession, then the stack items will receive a large amount of updating due to the agent playing well.
The constant of stack size 10 may need modifying for optimal reward gain. I should think a constant of no less than 5 would suffice.