PhD Progress: Mario Problems

I suppose it was inevitable that I would have problems with Mario. I was just hoping that it would be ready by the conference. perhaps soem nasty mock-up will be anyway. Hopefully it can learn some rudimentary strategies too.

The problem facing me is one of environment definition. Currently, Mario is defined by jumpOn and other such actions. But when playing the game myself, I don’t perform EVERY action as a jump on action. What I’m saying is that the language I have defined is constricting to the agent and it cannot fully achieve human or better performance when bound by jumping on particular things. What would be ideal is a language defined exactly at Mario level, in that the agent has 4 actions (3 iif you consider left and right the same thing in different directions). And each observation of the environment concerns Mario directly and the relations between objects.

When using such low-level actions, Mario would have to learn higher-level behaviour, like jump on enemy. But to do that the agent needs a reward, or incentive. Unfortunately I don’t think a reward is provided when an enemy is killed and even if it is, it pales compared to reward gained by time. The behaviour may be achieved using modules: !enemy(X) which removes a particular enemy.

Another problem is policy determinism. Under the current system, Mario immediately moves to the left-most part of the level and jumps up and down on the left edge because it is closest. The only way for him to progress is to find closer things to jump on, rather than simply proceeding right like a human. The policies are founded using probabilistic rule sets, so perhaps it would be smarter to somehow make policies stochastic as well. I got around this issue using weighted sums of decisions in Ms. PacMan but in Mario the lines are less blurred. If an enemy is approaching, you want to stomp it, not ‘almost’ stomp it.

An alternative is to change behviour on-the-fly if the agent appears to be getting nowhere (jumping up and down on an edge). Maybe re-sample the rule being fired that leads to the repetitive behaviour.

This environment is frustrating me, but who said PhD research was ever easy?