Reinforcement Learning and
Artificial
Intelligence (RLAI)
Some
RLAI research projects under consideration at the University of Alberta
The immediate
ambition of this
page is to list RLAI research projects as they were envisioned at UofA
in Fall 2003 and Jan 2005. Its long-term
ambition is to be refactored into other pages or provide a
readable page listing active projects.
The Grounded World Modeling Project (GWMP). See the announcement
of an ICML Workshop to get the general idea. This is likely to
include work on
options
Predictive State Representations
The bit2bit problem
This stuff can be broken down into two challenges:
Representing
a broad range of world knowledge in a predictive, sensori-motor form,
so that conventional ontologies and knowledge representation can be
avoided.
Making RL work without assuming state is available,
i.e., without the Markov assumption.
Off-policy learning with function approximation. It has long been
known that it is difficult to find sound RL algorithms with all three
of these desireable characteristics:
Off-policy - the ability to learn about one policy
while following another, just as Q-learning can learn the optimal
policy while behaving randomly. This characteristic is critical to
being able to learn a variety of things at once. You can only follow
one policy, but you'd like to learn about many, in parallel.
Bootstrapping
- estimating of some quantity (typically a prediction) based on an
existing estimate. "Learning a guess from a guess," as in
temporal-difference learning. Bootstrapping seems essential for
off-policy learning, and in practice it seems essential in order for
on-policy learning to be efficient.
Function approximation
- able to generalize, in at least a linear sense, from observed states
or state-action pairs to unobserved ones.
For example, many examples tasks are known in which Q-learning with
linear function approximation will diverge to infinity with time.
Several ideas have been proposed for solving this problem:
Use an averaging function approximator. Is this
practical? Can we get good FA with an averager? Or is the additional
generality of the full linear case needed.
Use a second
order method, such as LSTD(lambda). Some have suggested that such
methods somehow avoid the problem. If so, probably the key ideas that
enable this can be extended to the conventional first-order setting.
Use importance sampling, as in Precup, Sutton & Dasgupta.
2 other ideas from rich's notes
Generalization Sculpting. What can we do to make function
approximation work better. How can we learn the right biases in a
life-long learning context? Online cross validation. Adaptive step
sizes.