PhD Progress: Cross-Entropy and Exploration vs. Exploitation

I had an epiphany last night, while I was attempting to sleep. Perhaps epiphany is too strong a word, but whatever it is has given new drive behind my research.

The cross-entropy algorithm for me has just always been a means to an end – something that could easily be swapped out. But now I have realised that it is the near perfect algorithm for reinforcement learning. Perhaps it was something I glanced at while reading Szita and Lorincz’s paper, though even then, the regular cross-entropy algorithm is too limiting to dynamic rule additions.

The crux is this: RL comes with the problem of exploration vs. exploitation. In value-based methods, this is typically obvious exploration/exploitation of the state-space itself. In policy-search methods though, or at least in this one, the exploration/exploitation is on the rule-space for functioning in the environment.

As said before, I believed the cross-entropy method could be swapped out, for some other search method. Perhaps A* or Beam Search would work? But then I realised that proper exploration could not be performed using those two. CE allows values (or probabilities) to be assigned to rules, essentially giving them weight. And this is performed in a stepwise fashion, so ‘true’ values are found. A* or Beam Search would require several experiments over the rules to get the true values and search further.

CE allows us to initially explore randomly, but then start following useful looking paths by mutating the rule further. But to avoid being stuck on local maxima, all rules have a chance of being mutated. So the key point here is exploration in CERRL is through rule-space – not state-space.

This summary of the algorithm gives rise to a new structure: a tree with probability distributions. The tree may in fact already be in the algorithm, but it is not explicitly represented. Each node in the tree is a rule, with lower level rules more specialised versions. Technically, it’s not a tree, it’s an acyclic graph.

Note: Something which I have noted on paper at the lab is a new way to mutate which is better controlled. Firstly, the problem was shown in the Stack goal, where the best rule was found, but then it was further specialised again and again, swamping this perfect rule with imperfect specialisations and converging on an imperfect solution. I was aware of this problem, but it had never reared its head before (probably because my testing was focused on Ms. Pac-Man).

This new mutation strategy uses rule probabilities as limiters. When a rule is to be mutated, it is first checked if it has enough experience (as per normal) and also checked if it is better than its parent rule. Each rule stores its parent rule and for a rule to be mutated it must be better than its parent. Newly created rules have average probabilities and so to mutate them further, they must perform better than their parent rule. Again, this is like Beam Search, which only specialises if necessary. I just hope this isn’t too limiting to minor rule (low probability) specialisations. I believe it should be ok.

A possible issue with this is the existence of multiple parents – because the specialisations are connected as a graph. This could simply be noted when a rule is mutated. Every mutation records any new parents it picks up and uses (average? minimal?) parent probability to decide when to mutate.

PhD Progress: Combinatorial Learning Note

Ms. PacMan experiments are chugging along, but as stated in an earlier post, the score doesn’t seem to change much. But something interesting to note is the presence of general toGhost rules.

Typically, going towards a ghost is a bad idea, unless the ghost is edible, hence a toGhost rule with edible is perfect. But due to synchronous learning, the agent learns the general rule too, because the toPowerDot rule is always in effect too. This is good evidence that rules learned in parallel works, but annoying because Ms. PacMan learns less-tha-perfect rules.

What needs to happen are focused tests for finding the better of two rules.

PhD Progress: Slot Splitting over all Predicates

Splot splitting has previously been proposed for splitting based on the type of action being acted upon, but this could be extended even further. A possible problem found in Ms. Pac-Man is lumping together the toGhost rules with edible toGhost rules, so if the agent chooses a bad toGhost rule, it ends up commiting suicide. This is fine for removing toGhost rules, but it also negatively affects the slot probability, which has an impact on edible toGhost rules.

An alternative is to split the slots in the first initial specialisation from the covering rule. So each specialisation (of a condition) of the rule becomes its own slot, which are again specialised. Technically this could go on and on, but then the policy becomes a task of finding the best rules in general. (Would that be so bad…? Yes. Because then it’s a brute force search). Granted there will be a larger number of slots, but the slot removal mechanic should deal with that.

Each of these specialised slots have a base condition to each of the rules in there (and the obvious base condition for the action in general). So for example, the toGhost rule will have a slot for rules where X is edible, blinking, not edible, not blinking. On a more general level, (moveTo X) can have a number of types associated with it (X can be dot, fruit, ghost…) and conditions (edible, blinking – each of which implicitly also imply X is a ghost). Note that conditions can also be negated, but types cannot (they COULD, but it seems a bit silly. Maybe best to talk over with Bernhard).

A problem with this system is that there WILL be double ups (if not more). And it is technically limited to splitting only once. However, type hierarchies can be taken advantage of and identical rules between slots could be linked as well. Or perhaps restrict rules to particular slots, using a placement heuristic (conditionalised rules go to condition slots? Or type slots? Gah. Who knows…).

I don’t think the double ups will be too damaging to the system, so they could probably be left alone.