I have finally finished refactoring the system to deal with StringFacts instead of Strings (cuts down on the String splitting/joining operations performed). Furthermore StringFacts are much more flexible than Strings, allowing the real work of condition noting and other observations to begin.
Something I trialled during the testing of the system was removing the averaging procedure of running a policy over three environment episodes and averaging the result. The learning should still theoretically converge to a good solution (probably the same one arrived at with averaging, assuming such an optimal solution exists), but there is a hiccup. The modularisation and mutation procedures (among other procedures) depend on settling periods for the agent to get a more clearly defined view of the environment. Because each rule only sees 1/3rd of the states, this takes 3 times longer, which has resulted in the agent converging to sub-optimal solutions before the optimal rule for Blocks World can even be created.
Clearly this needs to change. There is no ‘magic’ number at which the agent should create new mutations (or remove old ones, for that matter). I need to develop a formalism that states when a rule is ready to mutate, when a module is ready to be tested, or when the agent can truly believe that there is no better policy to learn.