I received replies from the RL guys and they confirmed my suspicions. There is no data for the currently falling piece (other than the overall observation), but on the good side, the actions remain constant so the piece can be tracked.
This requires me to be able to distinguish the piece from the observation and track its progress down the field. One way to do this is to take a snapshot of the first observation and then on the second observation, find which parts are different and capture them.
Once they are captured, the agent can use the current details of the piece to modify it’s orientation so it goes where the agent wants it to go.
So now there are 2 problems:
- Finding the best substate for a particular piece.
- Finding the piece and moving/rotating it to go to the right substate.
At least with the piece finding/manipulating, that can be hard-coded and doesn’t need to be learnt as everything remains consistent.
Now, from the start, the agent must first distinguish the piece from the environment, choose a substate and piece orientation in that state, then place it. A good thing about knowing the the piece and old environment is that the new environment = old environment + new piece – lines made. Whether it’s easier to find the new environment or calculate it remains to be seen.
Edit: Later in the lab.
Finding how a piece rotates has confused the hell out of me. However, I managed to find it. The piece is rotated by its centre, which is a floating point number and needs to be rounded to an int to appropriately display the piece. The weird thing is that the x coord is rounded up and the y coord is rounded down.
I also looked into the actions a little bit as I need to know how everythng works exactly so I can extract the falling piece. The action appears to be chosen by the agent, then it is performed, then the piece falls. Note that if the action tries to do something the environment doesn’t allow, nothing will happen (such as rotating an I-piece at the top of the field, or moving a piece ‘off’ the field).
As previously posted, during both the exploratory and exploitatory stages, the agent should be more influenced to choose low, next-to-wall, final gap in line states.
– During exploration, this provides quicker learning when completing lines. Thus exploration becomes more of a ‘guided exploration’.
– During exploitation, this results in a better play strategy.
The value of the influence will need tweaking to be just right. Something like (Field height – Substate height) + Next-to-wall bonus + line/s completion bonus.