I believe I talked of this strategy regarding blocks world before, but it may be a viable path for StarCraft too. With the blocks world example, it might be useful to learn a policy for creating observations, i.e. clear(X). If we want to be able to clear a given block X, we need to move all blocks from the top of it.
The same could perhaps be performed for StarCraft. Given the observations (say marine(X) or whatever the corresponding observation is), we need to firstly create a barracks and then manufacture a marine. These randomly occurring actions are unlikely to happen, so will have to be heuristically learned. This part is another issue entirely. While I could use the cross-entropy approach, it’ll certainly need tuning to speed it up. And to remove the randomness of it (use covering instead).
Anyway, I’ve sidetracked myself. This automatic modularisation can be used to create modules (or options) which the agent will make increasing use of to create further module policies. As for combining these modules for concurrent goals, perhaps they can make use of the cross-entropy distribution (assuming I continue to note rule’s probabilities). The distributions of two modules containing policy generator distribution can be directly combined, then normalised to create a combined policy generator. The policy generator will likely be bigger than the individual 2, but no bigger (possibly smaller) than the two generator sizes together. It is also likely that during combination there will be subsumption taking place for more or less general rules.
The modules are discovered at the beginning, during an experimentation stage (unless the agent starts off learning from a teacher, then it has some extra data to learn from). Given a state, it may be worth exploring the bounds of it. For instance, some blocks are clear, while others aren’t. The agent can attempt to learn a general clear module (if clear(X) then success, else above(Y,X) & clear(Y) then moveFloor(Y)). Same for on and other observations. Once it has a good idea of how these work (100% accuracy among the rules in the module over a timeframe), it can proceed with the problem, utilising these modules over lesser lower-level observations.
Ergh. Too many ideas and not enough space to write it down. I’ll try and listify my future plans:
- Implement covering, such that the rule base is dynamically generated. Basically reimplement FOXCS, but maintain the probability distribution of rules.
- Like above, use a similar measure of FOXCS accuracy for solidifying rules, mutating rules, or removing rules, based on how often they fire.
- Implement a state (and rule) similarity measure for use in generalising across states and rules. State similarity can be used for behaviour cloning (more applicable in StarCraft – using pro replays). Rule similarity can be used in module combination including subsumption.
- Automated option/module discovery. Hard to tell what to call it. I guess the end product determines its name. Module implies that the behaviour has minimal side effects, which may not be guaranteed. Perhaps use of clean-up rules will be handy.
- Module recombination. Will need a more complex world than Blocks World for this. StarCraft should have ample opportunities for following several modules (create worker & create marine).
- Behavioural cloning. Given a string of states, the agent can use these to create an internal reward function for learning behaviour. Of course, the agent has to know what it is its solving.