I have now completed the implementation of the reintegration strategy (assuming no bugs). As far as I’m aware, it works, although the tests have not been entire runs through.
Briefly, the reintegration strategy works as follows: After the updating of weights,which occurs at the end of each episode, the distributions among the policy slots are modified by removing N of the worst performing rules in the distributions and replacing them with the best performing rules of the slots’ neighbour’s rules. These replacements are gathered using a fixed ratio of the total rule number (currently set at 0.08) and a constant polynomial decay rate (0.4). The number of rules chosen is rounded to the nearest integer.
An example is the hand-coded example. There are 38 hand-coded rules. Slot x will receive 38*0.08=3 rules from both slot x+1 and slot x-1. It will also receive 38*0.08*0.4=1 rule from slot x+2 and slot x-2. It will receive 38*0.08*0.4*0.4=0 rules from any other slots (halting the gathering process). In total, slot x will receive 2*3 + 2*1=8 rules. So, slot x removes 8 of it’s worst performing rules and replaces them with the 8 neighbouring rules. The newly added rules have a probability equal to the average of the top 2N rules at slot x (where N=8 in this case). The probabilities are then normalised.
There are special cases to this algorithm, such as where x=0 (cannot look at x-y cases), or the other case of x=max-1. Furthermore, neighbouring slots are only used if they are within the same priority level (as the rule making process differs to drastically among the priorities).
This algorithm promotes better rules by removing useless rules, hopefully speeding convergence. As for boosting convergence, I am unsure. The changes necessary for this algorithm to work include saving the rulebases (as loading from generator files will require changing rulebases). A downside to this algorithm is that many of the rule distributions in the policies will look the same. This, in a way, negates the distribution of solutions that cross-entropy brings about. I guess only experiments will show what works best.
I have several experiments to run:
Fired Policy: Testing the fired rules strategy of weight updating. The first experiment was a failure, due to a critical bug, but that has been fixed and the code should be ready to run.
Reintegration – Constant: Testing the reintegration strategy with a constant reintegration amount throughout the entire experiment. This may work well due to a global boosting among the rules, causing sampling of low probability rules to not be so damaging to the agent’s score.
Reintegration – Decreasing: Testing the reintegration strategy with a decreasing reintegration. This decreasing factor will be equal to the slot decay rate and will be applied to the sharing ratio value (0.08). This will allow the rules to settle into fixed distributions, decreasing the effect of ‘sameness’ among the distributions.
Reintegration – Delayed: Testing the reintegration strategy only once the distributions have had time to settle (i.e. when performance starts to flatten out). This removes the possibility of losing potentially good rules during the removal process early on.
There will also be a further 3 experiments on the regeneration strategy (to be detailed in a later post) and also any follow up experiments should two or more of the strategies prove beneficial.