PhD Progress: Mario Problems

I suppose it was inevitable that I would have problems with Mario. I was just hoping that it would be ready by the conference. perhaps soem nasty mock-up will be anyway. Hopefully it can learn some rudimentary strategies too.

The problem facing me is one of environment definition. Currently, Mario is defined by jumpOn and other such actions. But when playing the game myself, I don’t perform EVERY action as a jump on action. What I’m saying is that the language I have defined is constricting to the agent and it cannot fully achieve human or better performance when bound by jumping on particular things. What would be ideal is a language defined exactly at Mario level, in that the agent has 4 actions (3 iif you consider left and right the same thing in different directions). And each observation of the environment concerns Mario directly and the relations between objects.

When using such low-level actions, Mario would have to learn higher-level behaviour, like jump on enemy. But to do that the agent needs a reward, or incentive. Unfortunately I don’t think a reward is provided when an enemy is killed and even if it is, it pales compared to reward gained by time. The behaviour may be achieved using modules: !enemy(X) which removes a particular enemy.

Another problem is policy determinism. Under the current system, Mario immediately moves to the left-most part of the level and jumps up and down on the left edge because it is closest. The only way for him to progress is to find closer things to jump on, rather than simply proceeding right like a human. The policies are founded using probabilistic rule sets, so perhaps it would be smarter to somehow make policies stochastic as well. I got around this issue using weighted sums of decisions in Ms. PacMan but in Mario the lines are less blurred. If an enemy is approaching, you want to stomp it, not ‘almost’ stomp it.

An alternative is to change behviour on-the-fly if the agent appears to be getting nowhere (jumping up and down on an edge). Maybe re-sample the rule being fired that leads to the repetitive behaviour.

This environment is frustrating me, but who said PhD research was ever easy?

PhD Progress: Formalising Mario into a Relational Environment

One of the main problems currently facing me with implmenting Mario is how to represent it. There are different tracks available to me, and I don’t really know which one would be best.

The first and possibly easiest track is to do the same with Mario as I did with PacMan – turn possible groups of actions into a single action (toDot, etc). In Mario, this would be like jumpOnGoomba, collectCoin. The problem with this is that it is practically defining the environment for a baby to learn in.

The second step up the generalisation ladder is to only use general actions, but still clear enough in their intent: jumpOver, jumpOnto, shoot. These actions are enough to suitably play Mario (there is still the problem of obstacles, when exactly to run, grabbing, hitting boxes, etc. But these should still be enough to evidence some form of play. These actions can be achieved by (theoretically) implementing a slot-splitting mechanic which splits a slot into slots of the identical action, but with different type predicates in their rules. So there’ll be jumpOnto(coin) slot and jumpOnto(enemy). The problem with this is type hierarchies. All things in Mario are objects, and some are coins, some enemies. Some enemies are also non-jumpable, which could be an observation to allow the environment ot provide. Who knows… perhaps the agent can learn which enemies are not jump-ontoable.

Note that this system was theorised for Ms. PacMan once too, and may have to be implemented before Mario is. It is only a question of how much the slot-splitting mechanic will slow/decrease learning.

The third step is at keystroke level. There are numerous problems with this, specifically the fact that the keys are not relationally linked to the objects in game, save Mario himself. Rules could be created for such things, but not currently with the rules created, which use action objects for specialising the rules.

PhD Progress: PacMan Results and Fair Rule Sampling

Ran some PacMan experiments over the weekend (well, a PacMan experiment). After the refactoring, Ms. PacMan is running much faster. Also, there’s probably the fact that the experiments are level-limited.

Anyway, I had some results from previous runs, but there is little point in displaying their graphs. Suffice to say, early convergence (where total update size is 1 * the alpha update) and low population constants (10) are not the best for learning.

Here are some results from an agent learning with a convergence value at 0.1 (which seems reasonable, given the graph) and a population constant of 30. The selection ratio was set at 0.1, but perhaps could be even lower with a greater population constant.

Something else I need to experiment with is a more precise measure of population constants and testing amount. At the moment, the experiment uses the pop const, which can result in small elite update sizes which may not be ideal. To fairly test every rule in the policy generator, there needs to be an equal chance for every rule to be included in a meta-iteration (preferably earlier than later). This will somehow relate to the maximum value for every slot’s rules, where the value of a slot is given by the number of rules squared. This gives every rule an equal chance:

For example, a slot has 3 rules, all initially at 0.33 probability. Therefore, assuming Bernoulli sampling, to fairly sample each rule would require 1/0.33 = 3 samples per rule, totalling 9 samples altogether. Givena fair distribution, each rule will have been sampled 3 times each, but if not, they were at least iven a fair chance.

The reasoning behind this is that sometimes a rule simply has no chance of being re-recognised as a useful rule. Say a rule has a sampling probability of 0.1 in a slot of x rules. Odds are, it will only be sampled one in 10 times, giving it a placement in the elite solutions sample of 1 for every 10 samples. Say the elite solution size is of 10, then the rule will only ever average with a sample distribution value of 0.1, meaning it can never grow unless it is sampled more often through random sampling. I think I might be off in a few areas here, but the system makes sense to me.

Granted, using the square of a slot’s size can get big, the pruning mechanic should keep the slot sizes low.