Been thinking about possible ideas to improve the agent and one that Bernhard mentioned was ASA (Adaptive Simulated Annealing). Although I already had the rough idea when I proposed it, he knew of an existing strategy already and pointed me towards it.

This strategy is necessary for ‘heating up’ the eTemperature when the agent is introduced to a new domain. The domain is likely to be different from the previous domain, so the agent needs to re-learn a better strategy for the domain.

Also, if an agent has quickly learnt a good strategy for a domain, it would be beneficial to quickly cool the temperature rather than risk further exploration.

The algorithm will work a little like this (I’m still thinking how exactly):

• The eTemp has a max value of 1 and min value of 1 – default cooling rate.
• The cooling rate needs to have max and min values, otherwise an agent would take a long time to reverse a cooling rate if a new MDP is introduced.
• The cooling rate is defaulted as 0.99 or something.
• The threshold is defaulted as 0.2. The threshold is the point at which the cooling rate either rises or falls, depending on the state of the policy.
• At each pickAction() (Every 20 pieces), the eTemperature is multiplied with the cooling rate. In the older versions, this only occured when exploring, so it would gradually cool slower as it becomes more greedy.
• Before this is done, the policy is evaluated for a value (perhaps from an average of the top 5 or top value). This value determines how the cooling rate is changed. If the value is above the threshold, the cooling rate increases (goes down as it is < 1). If below, vice versa. Note that the cooling rate can go > 1.
• The amount the cooling rate changes is determined by the amount the value differs from the threshold. Perhaps: coolingRate = coolingRate – (value – threshold) * defaultCoolingRate / 10. So if value is 0.3, threshold is 0.2, and defaultCoolingRate and coolingRate are 0.99, the new value would be:
`coolingRate = 0.99 - (0.3 - 0.2) * 0.99 / 10 = 0.99 - 0.1 * 0.99 / 10 = 0.9801`
Or for a value of 0.1:
`coolingRate = 0.99 - (0.1 - 0.2) * 0.99 / 10 = 0.99 - -0.1 * 0.99 / 10 = 0.9999`
• A static threshold might work ok in some cases, but when the MDP changes, the threshold might be unreasonable. So the threshold also changes in proportion to the cooling rate changing. Perhaps:
`threshold = threshold + (value - threshold) / 10`
This should be reasonable enough that the threshold will stabilise on the maximum value in time, encouraging values higher than the local maximum to grow.

Hopefully this whole system will allow the agent to continue running over different MDPs or on a changing Tetris game.