PhD progress: This week’s work

I’ve finished reading the book, so now I will be reading related papers and articles. I will be making note of which articles I read and noting down comments and ideas the paper inspires.

Using background knowledge to speed reinforcement learning in physical agents
This paper talks about Icarus, a hierarchical reactive reinforcement learning agent. The agent is given a general hierarchy for what to do and the agent fine-tunes how to do stuff using the hierarchy. The paper uses the example of driving a car on an infinite highway, which could be a possible example. The paper also looks at the results of obscuring the agent’s knowledge by constricting the number of questions an agent can ask, rendering the domain a POMDP. The paper shows results of the hierarchical Icarus performing better than a general reinforcement learner, even after a great number of iterations. The paper highlights the benefits of domain knowledge for learning.

Relational Reinforcement Learning
The thesis that started my interest (sort of). Though I only skimmed, it brought up important issues in environments. How do you convert a typical environment into relations? For example, Digger. The player has a position and the state is set up in a coordinated fashion. The answer lies in how the agent receives the info. The environment appears to send the agent information on the set questions it can ask (closest emerald, is there a monster in my line of fire, etc). I have begun to understand that the language bias is indeed the strongest thing an agent can receive.

In setting up my environments, I need to store them in whatever way is necessary, but give the agent relational observations to use. This allows the agent to operate on abstractions rather than directly on the environment itself. However, there is the problem of numbers. For instance, when measuring distance between the agent and something, a number would usually be given. But logic doesn’t really allow numbers. This sort of problem has been solved in ML Weka algorithms, so I’ll need to investigate some of them. I have a feeling it will be dealing with typed logic. E.g. distanceBetween3: distanceBetween(agent, wall, 5.3) could be abstracted as distanceBetween(agent, X, numberLessThan(6)) or distanceBetween(agent, X, numberBetween(2.5, 7.5)). It will certainly be a stumbling block.

Combining Model-Based and Instance-Based Learning for First Order Regression
This publication talked of Trendi, the child of TG and RIB. The paper stated that it performed better than both TG and RIB, but slightly worse than KBR. However, it runs much faster than KBR. Could be a possible implementation beginning point.

Transfer Learning in Reinforcement Learning Problems Through Partial Policy Recycling
This publication outlines the TGR algorithm, an extension of the TG algorithm that deals with transfer learning and concept drift. The results were quite good, making transfer learning a viable opportunity. The flexibility of the learner was achieved by using more tree-restructuring operations, allowing the learner to re-adapt its tree.

TD(λ) networks: temporal-difference networks with eligibility traces
This paper talks about using a predictive representation as a solution for the POMDP problem. It attempts to predict future states using a network of states. The paper was a bit math-intensive (or I was just getting tired), so I didn’t follow the exact algorithm.

Reinforcement Learning in Relational Domains: A Policy-Language Approach
This paper concerns the LRW-API learner. This technique of policy iteration uses many tricks to achieve a quick learning and low computation policy learner.
One of the tricks is policy rollout (previously seen in the book, but never fully grasped). Policy rollout computes π^, an approximation of π’, the improved policy (normally computed by iterating between π and Vπ(s)). This approximation is computed by estimating Qπ(s,a) for every action by taking w trajectories of length h for each action, following π. The estimated Q value is then given as the average of the cumulative discounted (remember that rewards obtained later on are more discounted than earlier) rewards obtained. w accounts for the transition probabilities (averaged out) and h accounts for long-term reward.

This policy-rollout is used in the Improved-Trajectories procedure, which creates n length h trajectories starting from random initial states. These rollouts store more than just the action selected by π, they also store the Q^-values (approximate Q-values) for each state in h. For example, at state i, the information about the state is (s, π(s), {Q^(s,a1), …, Q^(s,am)}). These Q^-values are filled by the trajectories, allowing an approximate policy π^ to be extracted from the Q^-values. Storing the Q^-values allows the learner to make trade-offs, such as avoiding states with heavy penalties for mistakes.

A further minor technique used is the existence of predicates with relation to the goal. These predicates take the for gsomething(X,Y) for goal predicates (gon(a,b) implies that a is on b in the goal state), and csomething(X,Y) for comparison predicates (con(a,b) implies that a is on b in the goal state and in the current state)

Taxonomic syntax is a language used for denoting sets of objects. The syntax is made up of predicates (max arity of 2) and variables. For instance, the class expression (on on-table) denotes the set of blocks that are on blocks on the table (The blocks directly above the blocks on the table). The * operator denotes a chain of objects: (on* a) denotes all blocks above a (linked by the on operator). In a similar vein, (min on) denotes all objects that are on something, but nothing is on them (a minimal chain).

An example of the taxonomic decision list policy put together is as follows. The goal is clear(red). So, a policy in the following format will solve any problem:
putdown(x1) : x1 ∈ holding
pickup(x1) : x1 ∈ clear, x1 ∈ (on* (on red))

This is getting a bit big, so I’ve moved it into powerpoint format. Stored in my Google docs.

An extra note. If I implement a similar bootstrapping system to that used in LRW-API, I could use the bootstrapping process to help realise sub-goals in a hierarchical manner.