Previous experiments have shown that the population constant doesn’t particularly matter in Ms. PacMan, assuming infinite iterations, but they do matter in Blocks World. Or at least in module creation. One of the problems seems to be while creating the clear module, which creates a sub-optimal module due to the elite samples can be so unstable.
I think it is safest to leave the population constant at 50, which is a reasonable number of samples for Blocks World. The only problem is that of Ms. PacMan which requires many more elites.
As noted in the previous post, the curren algorithm has specified values for when a rule mutates, is removed, is converged, or can be used to create new modules. These have no current guarantees and will be the first thing to be attacked if presented in a paper.
I need to work on formalising the algorithm and recognising key points where new ruyles can be created. Certainly, every single possible rule could be created, but that isn’t very prudent and will degrade down to nothing more than a search through all possible solutions. Ideally, the agent needs a heuristical search that only expands interesting rules and removes useless rules. When the agent has finished looking at the rules (cannot find any more interesting rules), then it can begin thinking about convergence upon an optimal distribution.
I have finally finished refactoring the system to deal with StringFacts instead of Strings (cuts down on the String splitting/joining operations performed). Furthermore StringFacts are much more flexible than Strings, allowing the real work of condition noting and other observations to begin.
Something I trialled during the testing of the system was removing the averaging procedure of running a policy over three environment episodes and averaging the result. The learning should still theoretically converge to a good solution (probably the same one arrived at with averaging, assuming such an optimal solution exists), but there is a hiccup. The modularisation and mutation procedures (among other procedures) depend on settling periods for the agent to get a more clearly defined view of the environment. Because each rule only sees 1/3rd of the states, this takes 3 times longer, which has resulted in the agent converging to sub-optimal solutions before the optimal rule for Blocks World can even be created.
Clearly this needs to change. There is no ‘magic’ number at which the agent should create new mutations (or remove old ones, for that matter). I need to develop a formalism that states when a rule is ready to mutate, when a module is ready to be tested, or when the agent can truly believe that there is no better policy to learn.