PhD Progress: Cross-Entropy and Exploration vs. Exploitation

I had an epiphany last night, while I was attempting to sleep. Perhaps epiphany is too strong a word, but whatever it is has given new drive behind my research.

The cross-entropy algorithm for me has just always been a means to an end – something that could easily be swapped out. But now I have realised that it is the near perfect algorithm for reinforcement learning. Perhaps it was something I glanced at while reading Szita and Lorincz’s paper, though even then, the regular cross-entropy algorithm is too limiting to dynamic rule additions.

The crux is this: RL comes with the problem of exploration vs. exploitation. In value-based methods, this is typically obvious exploration/exploitation of the state-space itself. In policy-search methods though, or at least in this one, the exploration/exploitation is on the rule-space for functioning in the environment.

As said before, I believed the cross-entropy method could be swapped out, for some other search method. Perhaps A* or Beam Search would work? But then I realised that proper exploration could not be performed using those two. CE allows values (or probabilities) to be assigned to rules, essentially giving them weight. And this is performed in a stepwise fashion, so ‘true’ values are found. A* or Beam Search would require several experiments over the rules to get the true values and search further.

CE allows us to initially explore randomly, but then start following useful looking paths by mutating the rule further. But to avoid being stuck on local maxima, all rules have a chance of being mutated. So the key point here is exploration in CERRL is through rule-space – not state-space.

This summary of the algorithm gives rise to a new structure: a tree with probability distributions. The tree may in fact already be in the algorithm, but it is not explicitly represented. Each node in the tree is a rule, with lower level rules more specialised versions. Technically, it’s not a tree, it’s an acyclic graph.

Note: Something which I have noted on paper at the lab is a new way to mutate which is better controlled. Firstly, the problem was shown in the Stack goal, where the best rule was found, but then it was further specialised again and again, swamping this perfect rule with imperfect specialisations and converging on an imperfect solution. I was aware of this problem, but it had never reared its head before (probably because my testing was focused on Ms. Pac-Man).

This new mutation strategy uses rule probabilities as limiters. When a rule is to be mutated, it is first checked if it has enough experience (as per normal) and also checked if it is better than its parent rule. Each rule stores its parent rule and for a rule to be mutated it must be better than its parent. Newly created rules have average probabilities and so to mutate them further, they must perform better than their parent rule. Again, this is like Beam Search, which only specialises if necessary. I just hope this isn’t too limiting to minor rule (low probability) specialisations. I believe it should be ok.

A possible issue with this is the existence of multiple parents – because the specialisations are connected as a graph. This could simply be noted when a rule is mutated. Every mutation records any new parents it picks up and uses (average? minimal?) parent probability to decide when to mutate.