||Reinforcement Learning and
RLAI research projects under consideration at the University of Alberta
ambition of this
page is to list RLAI research projects as they were envisioned at UofA
in Fall 2003 and Jan 2005. Its long-term
ambition is to be refactored into other pages or provide a
readable page listing active projects.
for the relevant meeting notes.
Learning to TD Nets and why it's a great idea
- The Grounded World Modeling Project (GWMP). See the announcement
of an ICML Workshop to get the general idea. This is likely to
include work on
This stuff can be broken down into two challenges:
- Predictive State Representations
- The bit2bit problem
a broad range of world knowledge in a predictive, sensori-motor form,
so that conventional ontologies and knowledge representation can be
- Making RL work without assuming state is available,
i.e., without the Markov assumption.
- Off-policy learning with function approximation. It has long been
known that it is difficult to find sound RL algorithms with all three
of these desireable characteristics:
For example, many examples tasks are known in which Q-learning with
linear function approximation will diverge to infinity with time.
Several ideas have been proposed for solving this problem:
- Off-policy - the ability to learn about one policy
while following another, just as Q-learning can learn the optimal
policy while behaving randomly. This characteristic is critical to
being able to learn a variety of things at once. You can only follow
one policy, but you'd like to learn about many, in parallel.
- estimating of some quantity (typically a prediction) based on an
existing estimate. "Learning a guess from a guess," as in
temporal-difference learning. Bootstrapping seems essential for
off-policy learning, and in practice it seems essential in order for
on-policy learning to be efficient.
- Function approximation
- able to generalize, in at least a linear sense, from observed states
or state-action pairs to unobserved ones.
- Use an averaging function approximator. Is this
practical? Can we get good FA with an averager? Or is the additional
generality of the full linear case needed.
- Use a second
order method, such as LSTD(lambda). Some have suggested that such
methods somehow avoid the problem. If so, probably the key ideas that
enable this can be extended to the conventional first-order setting.
- Use importance sampling, as in Precup, Sutton & Dasgupta.
- 2 other ideas from rich's notes
- Generalization Sculpting. What can we do to make function
approximation work better. How can we learn the right biases in a
life-long learning context? Online cross validation. Adaptive step
- Robots learning from interacting with people.
- Policy gradient methods
- Stopping mu.
- Turnpike-horizon idea.
- See also Rich's research
summary (NSERC proposal).