PhD Progress: Guided Testing of Rules

While implementing the Mario environment, I had an idea of differing preliminary testing of rules. Initially, the agent could simply test single rule policies (each rule being either the RLGG or a single step from the RLGG). This can determine which slots to split (don’t bother splitting rules with no use) and allows the agent to quickly learn initially useful rules.

This will result in a minimal number of slots. As the agent tests out policies in a normal fashion, new slots can be created from handy rules in the slots which may not have had an initial use, but gain one later on. This is much like beam search, which expands on useful rules.

This strategy will only work if there is an intermediate reward or easily attainable goal. I just feel that the current strategy swamps the agent early on (which it does, and only lets up when slots/rules are found to be useless).

PhD Progress: Poisson vs. Normal Distribution

After some thought on the Poisson distribution implementation and the unfair testing I was performing, I think I might have to return to the Normal distribution. While the Poisson simplifies things by only using one parameter, it is insufficient in choosing exact slot values.

I have been unfairly testing my agent by fixing values in place, making the learning process different from the testing (turning the slot selection process from a randomised voting process into a decision list and temporarily ignoring the low probability rules). While this isn’t too big of a deal, the idea of an online learning agent is that it will have a result at any point – not after saying “it’s testing time, so gimme your best policy”. The reason the Poisson distribution is ill-suited for this is because the size of the distribution remains the same, regardless of parameter (in a sense). For example, if a particular slot was to be used with selection probability 1, the Poisson would morph this into: the slot will b used 0 times with probability ~0.33, once per policy with probability ~0.33, twice with probability ~0.16… However, a Normal distribution with given SD value can shape the size of it’s distribution, so a slot with selection probability 1 and SD near 0 will be used once with near 100% probability.

Also, the results seem to speak that Poisson is slower than Normal in learning, but this could be due to other implementation changes made.

PhD Progress: Problem Domains and Meta-Goal Modules

As I was reviewing a paper for NZCSRSC, I notcied how well the problem definition was set out. I cannot recall if any other problem definitions in RRL are set out quite as well, though there is Martijn’s fingerprint recognition example. So this got me thinking about possible domains for the agent to learn in.

One of which was (not so) simply driving from A to B. There are a great deal of actions to perform in driving, and multiple goals and conditions must be satisfied for the agent to perform well. These goals and conditions mirror the previous musings on general modules achieving them.

Anyway, perhaps my work needs to focus on this module/meta-goal achievement direction – breaking a problem into a number of possibly prioritised smaller simultaneous goals. For instance, in Pac-Man, the goal is to get a high score. But to do this, Pac-Man must remain alive. The agent needs to discover how to break a problem down into these areas.

StarCraft is a much bigger, and more important example. The agent needs to have that overall goal in mind, as well as the low-level goals and perhaps planned ramifications of achieving those goals.