Another issue has arisen regarding using covering and policies. FOXCS can use covering however it likes, as it is based on a rule voting system. But I intend for this work to be based on a strict deterministic policy. Or perhaps just probabilistic, which is near the same anyway.
The problem lies with triggering covering when using CE to generate the policy. A policy generated by CE will only use some of the rules within the rulebase generated by covering. This could be a problem, as covering may be triggered again if a bad rule is chosen.
A possible solution to this, or at least a step in the right direction is to modify the whole policy creation process. A policy is created which contains at least one of each possible actions in the environment. So the CE process is now only applicable towards optimising the conditions for an action.
However, typical policies contain multiple copies of the same action, with different parameters (the onAB policy contains three rules: one has a constant-ised action, the other two are moveFloor actions regarding blocks on top of a and b). To get around this problem, the following strategy is proposed.
Use an adaptive policy (with firing rules) of initial size |A|, each slot corresponding to an action in the state description. When covering is triggered, these slots are filled, using maximally generalised covering where possible. Once a suitable rule is found regarding the action (the weighting of a single is rule is > 1-epsilon), fix the slot as that rule and create another slot of the same action, using all other rules from the slot. Note that we want the system to quickly converge, so the previously proposed CE algorithm using a sliding window for updates may work well.
Note that during this process, new rules are being covered/mutated and optimised. New rules can be created by taking almost true rules and covering them to fit the current state, or mutated if the preconditions fire, but the action does not. A mutation such as this removes the erroneous rule and adds mutations of it which fit the current state space. Mutations must follow the goal description in that they only use constants mentioned in the goal.
Regarding the order of slots within an adaptive policy, a specificity measure can be used. It is usually better to check if specific rules fire before looking at the general ones, and if the policy is deterministic, this is especially useful. As such, the more specific a rule is within the agent’s policy, the higher it will be. Specificity can be measured by the # of preconditions (disregarding type and inequal preds) + the # of constants used in the preconditions + 0.5 for every constant re-used. So the rule
on(a,b) & clear(a) = 2 + 2 + 0.5 = 4.5, and the rule
on(X,Y) & on(Y,Z) & clear(X) = 3.