PhD Progress: Massive Revamp

Phew. It’s been a while since I’ve posted here. I have Evernote and various other note taking applications taking care of my progress notes these days, but this one didn’t particularly seem note-ish.

So, the biggest change has been the removal of the pre-goal and the addition of parameterisable goal terms, such that onAB now uses any valid variables instead of ‘a’ and ‘b’. This took quite some time to sort out, especially with the possibilities of more rules due to the larger set of possible specialisations these goal predicates can be used in. A result of using goal terms has split the agent observations class into two: the global (or environmental) observations which are true regardless of the goal, and the local agent observations which deal solely with terms concerning the goal the predicate.

This has adversely affected the modular learning, which essentially had to be rewritten to cope with and also use these new goal terms. I’ve only just now got them working again after about a week’s worth of work and I’m happy to say they’re working very well (well, there are still some bugs, but in principle they’re working well).

Because modular rules used to be learned using the pre-goal (well using constant terms in rules – but these came about from the pre-goal anyway), this had to change. Now, the possible modules to learn can be found using the local observed goal predicates – that is, the possible predicates the goal terms can take on (e.g. ?G_0 can be highest, clear, etc.). However, because modules are recursive, they need to be learned in a particular order. This has always been an issue, but only now has it come to the fore. Probably because every module is learned, rather than just modules concerning the pre-goal. This ordering can be estimated using the number of other modules the module depends on (via the observed goal predicates). This means that each module has to be run in a prelimiary phase to get an idea of the goal observations the module uses.

Anyway, the ordering for learning from the onAB goal is: clearA, highestA, aboveAB. Learning clearA first is important as it is used in every other module. In a quick experiment, the agent learned the following policies for the modules:
(clear ?X) (clear ?Y) (above ?X ?G_0) => (move ?X ?Y)
Not ideal (I’d prefer moveFloor behaviour, but in terms of reward, the two rules are equal), but it works in optimal steps.

(clear ?X) (clear ?Y) (above ?X ?G_0) => (move ?X ?Y)
(clear ?G_0) (highest ?Y) (block ?G_0) => (move ?G_0 ?Y)
This is perfect behaviour for highest. I don’t think it learned two rules, rather it learned the one and used the clear module (but the rule isn’t showing as a module -> BUG).

(clear ?X) (highest ?Y) (above ?Y ?G_1) (not (highest ?X)) => (move ?X ?Y)
This is a crappy rule which only works sometimes. A perfect above policy would have at least two rules, each of which would use clear to clear the ?G_0 block.

(clear ?X) (above ?X ?G_0) (above ?X ?Y) (floor ?Y) => (move ?X ?Y)
(clear ?G_0) (clear ?G_1) (block ?G_0) => (move ?G_0 ?G_1)
While this isn’t what the module output (though it was present in the elites about 70% of the time), it’s what should have been output. This is quite a good case for CERRLA, in that the clear module isn’t ideal, so it adds in that first rule to keep ?G_0 clear. While it isn’t a perfect policy, it’s pretty close and will work in most cases.

So, next up is finding and fixing the bugs, then probably working on my CIG paper. Or maybe testing out the dynamic slot splitting algorithm.

PhD Progress: Guided Testing of Rules

While implementing the Mario environment, I had an idea of differing preliminary testing of rules. Initially, the agent could simply test single rule policies (each rule being either the RLGG or a single step from the RLGG). This can determine which slots to split (don’t bother splitting rules with no use) and allows the agent to quickly learn initially useful rules.

This will result in a minimal number of slots. As the agent tests out policies in a normal fashion, new slots can be created from handy rules in the slots which may not have had an initial use, but gain one later on. This is much like beam search, which expands on useful rules.

This strategy will only work if there is an intermediate reward or easily attainable goal. I just feel that the current strategy swamps the agent early on (which it does, and only lets up when slots/rules are found to be useless).

PhD Progress: Poisson vs. Normal Distribution

After some thought on the Poisson distribution implementation and the unfair testing I was performing, I think I might have to return to the Normal distribution. While the Poisson simplifies things by only using one parameter, it is insufficient in choosing exact slot values.

I have been unfairly testing my agent by fixing values in place, making the learning process different from the testing (turning the slot selection process from a randomised voting process into a decision list and temporarily ignoring the low probability rules). While this isn’t too big of a deal, the idea of an online learning agent is that it will have a result at any point – not after saying “it’s testing time, so gimme your best policy”. The reason the Poisson distribution is ill-suited for this is because the size of the distribution remains the same, regardless of parameter (in a sense). For example, if a particular slot was to be used with selection probability 1, the Poisson would morph this into: the slot will b used 0 times with probability ~0.33, once per policy with probability ~0.33, twice with probability ~0.16… However, a Normal distribution with given SD value can shape the size of it’s distribution, so a slot with selection probability 1 and SD near 0 will be used once with near 100% probability.

Also, the results seem to speak that Poisson is slower than Normal in learning, but this could be due to other implementation changes made.