While implementing the Mario environment, I had an idea of differing preliminary testing of rules. Initially, the agent could simply test single rule policies (each rule being either the RLGG or a single step from the RLGG). This can determine which slots to split (don’t bother splitting rules with no use) and allows the agent to quickly learn initially useful rules.
This will result in a minimal number of slots. As the agent tests out policies in a normal fashion, new slots can be created from handy rules in the slots which may not have had an initial use, but gain one later on. This is much like beam search, which expands on useful rules.
This strategy will only work if there is an intermediate reward or easily attainable goal. I just feel that the current strategy swamps the agent early on (which it does, and only lets up when slots/rules are found to be useless).
After some thought on the Poisson distribution implementation and the unfair testing I was performing, I think I might have to return to the Normal distribution. While the Poisson simplifies things by only using one parameter, it is insufficient in choosing exact slot values.
I have been unfairly testing my agent by fixing values in place, making the learning process different from the testing (turning the slot selection process from a randomised voting process into a decision list and temporarily ignoring the low probability rules). While this isn’t too big of a deal, the idea of an online learning agent is that it will have a result at any point – not after saying “it’s testing time, so gimme your best policy”. The reason the Poisson distribution is ill-suited for this is because the size of the distribution remains the same, regardless of parameter (in a sense). For example, if a particular slot was to be used with selection probability 1, the Poisson would morph this into: the slot will b used 0 times with probability ~0.33, once per policy with probability ~0.33, twice with probability ~0.16… However, a Normal distribution with given SD value can shape the size of it’s distribution, so a slot with selection probability 1 and SD near 0 will be used once with near 100% probability.
Also, the results seem to speak that Poisson is slower than Normal in learning, but this could be due to other implementation changes made.