As I was reviewing a paper for NZCSRSC, I notcied how well the problem definition was set out. I cannot recall if any other problem definitions in RRL are set out quite as well, though there is Martijn’s fingerprint recognition example. So this got me thinking about possible domains for the agent to learn in.
One of which was (not so) simply driving from A to B. There are a great deal of actions to perform in driving, and multiple goals and conditions must be satisfied for the agent to perform well. These goals and conditions mirror the previous musings on general modules achieving them.
Anyway, perhaps my work needs to focus on this module/meta-goal achievement direction – breaking a problem into a number of possibly prioritised smaller simultaneous goals. For instance, in Pac-Man, the goal is to get a high score. But to do this, Pac-Man must remain alive. The agent needs to discover how to break a problem down into these areas.
StarCraft is a much bigger, and more important example. The agent needs to have that overall goal in mind, as well as the low-level goals and perhaps planned ramifications of achieving those goals.
Recently I swapped the slot ordering process using a normal distribution (mean and standard deviation) for a Poisson distribution (only using mean). This is a simpler method of representing the slot ordering process. However, according to the onAB blocks results, it is slower to converge. This may be due to the Poisson distribution solely, or perhaps I also put other changes in (I think I also included the retesting of stale policies code as well). I may have to put the normal distribution code back in and test it again back to back.
Gamma being the size of the elites. Gamma too small, and the probabilities for the select few that get into the top samples shoot up, but too big and the process takes too long. This has previously been examined as using the square of the largest slot (perhaps the average slot size). The problem is, in the onAB task, the size of the slot can quickly explode. Currently, at 3.7% in on an onAB task, there are 58 rules in a slot. That’s a (proposed) gamma size of 3364! That will take quite some time to gather. Well, maybe not HEAPS of time, but some.
This may not be as large a problem once slot splitting is implemented, but it still seems wrong. In a sense, because the set of samples is floating, and the population N no longer matters in Cross-Entrobeam learning, it doesn’t particularly matter at all. But values will likely take quite some time to converge.
The current experiment is also testing the use of restricted specialisations, which seem to be slowing the specialisation. But I failed to take into account pruning. If a rule’s parent is pruned, then obviously the parent is bad, and the rule itself will likely have a larger probability than the parent (because rules are introduced with average probability). Maybe further restrictions are required: The rule must have an average or better probability. And when new rules are created, maybe they should have the same probability as the parent rule.
The system needs a little more tweaking to deal with pruned parents and low probability rules.