With the extensive changes to the system comes new ideas and methods of doing things. Because the policy will look radically different to the previous policies, it will need an alternative method of creation.
In the old CE system, the policy was of fixed size, with each slot containing a large number of random rules to optimise. In the new system (still utilising CE), the policy is of an adaptive size, initially sized at the number of actions in the state specification. Each slot in this adaptive policy contains initially no rules, but these will be filled with rules covered from the environment, though it is likely there won’t be a large number. Also, each slot is bound by an action, so each rule within leads to the same action.
A problem with an adaptive policy using action bound slots is that it has no order of rules, so deterministic policies will fail if a bad rule is at the top. A possible ordering is to arrange the rules in order of specificity, such that the most specific rules are checked first. This still may cause problems, as a general rule may never be checked, even if it is right for the job. And this fault, combined with a Bernoulli distribution may result in useful slots being turned off.
The CE distribution can still be utilised for slot ordering by weighting the usefulness of a slot and creating a policy by sampling from the slot distribution in the creation of an ordered policy. Note that a slot may only occur once in a policy, so sampling is done via removal. So initially every slot (aka every action) has an equal chance of being selected for the top of the policy, but as weights are changed (through updates of firing rules), more useful slots will be placed at the top, allowing useful actions to be quickly evaluated. This strategy will still result in every slot being used, but because there is likely to be a low number of overall slots, it shouldn’t be an issue. Perhaps slots with probability < epsilon are discarded, and slots with probability > 1 – epsilon are fixed.
Something to note is while every slot will be present in the policy, updates will still influence particular slots over one-another. this is achieved by only looking at which slots (and rules) were used in experimentation. Otherwise nothing would be updated.
A future problem is how to deal with the dynamics of covering and updates. However, I will relegate this to later thought, when I’m at that stage in the code.