First, a quick guide to the highlights, roughly in order of the talk's potential current interest:

- The Future of
Artificial Intelligence: UBC, Puerto Rico, AI&Humanity

- Reinforcement Learning Tutorial at NIPS-2015

- Reinforcement Learning and Psychology
- Emphatic TD learning

- Learning About Sensorimotor Data

- Toward Learning Human-level Predictive Knowledge
- Deconstructing
Reinforcement Learning

- New gradient-TD algorithms

- Critterbot project

- Micro-stimulus model of Dopamine at CalTech
- Tracking talk at ICML'07

- Experience-Oriented Artificial Intelligence
- overview of predictive representations
- Step-size adaptation: IDBD

- Provocative remarks at the DARPA Cognitive Systems Conference, 5/20/05
- RL Past, Present, and Future (ICML/COLT/UAI 7/25/98, SEAL98)
- RL's
Computational Theory of Mind 2/14/03 Rutgers/Indiana

- Artificial Intelligence Should Be About Predictions 12/7/01 AT&T, CEC2000
- From
MDPs to AI 5/14/03 U Alberta

- Overcoming the Curse of Dimensionality with RL (MIT ORC 4/19/01)
- From Reflex to Reason (Cornell 12/8/00)
- Temporal Difference Networks, presented at NIPS-04
- Constructive Induction Needs a Methodology based on Continuing Learning (ICML94)e
- RL Tutorial 7/23/98

The Future of Artificial Intelligence (University of British Columbia, Feb 25, 2016)

Video.
Pdf slides.

When mankind finally comes to understand the principles of intelligence and how they can be embodied in machines, it will be the most important discovery of our age, perhaps of any age. In recent years, with progress in deep learning and other areas, this great scientific prize has begun to appear almost within reach. Artificial superintelligences are not imminent, but they may well occur within our lifetimes. The consequences, benefits, and dangers for humanity have become popular topics in the press, at public policy meetings (e.g., Davos) and at scientific meetings; luminaries such as Stephen Hawking and Elon Musk have weighed in. Is it all hyperbole and fear mongering, or are there genuine scientific advances underlying the current excitement? In this talk, I try to provide some perspective on these issues, informed and undoubtedly biased by my 38 years of research in AI. I seek to contribute to the conversation in two ways: 1) by seeing current developments as part of the longest trend in AI---towards cheaper computation and thus a greater role for search, learning, and all things meta, and 2) by sketching one possible path to AI (the one I am currently treading) and what it might look like for it to succeed.

When mankind finally comes to understand the principles of intelligence and how they can be embodied in machines, it will be the most important discovery of our age, perhaps of any age. In recent years, with progress in deep learning and other areas, this great scientific prize has begun to appear almost within reach. Artificial superintelligences are not imminent, but they may well occur within our lifetimes. The consequences, benefits, and dangers for humanity have become popular topics in the press, at public policy meetings (e.g., Davos) and at scientific meetings; luminaries such as Stephen Hawking and Elon Musk have weighed in. Is it all hyperbole and fear mongering, or are there genuine scientific advances underlying the current excitement? In this talk, I try to provide some perspective on these issues, informed and undoubtedly biased by my 38 years of research in AI. I seek to contribute to the conversation in two ways: 1) by seeing current developments as part of the longest trend in AI---towards cheaper computation and thus a greater role for search, learning, and all things meta, and 2) by sketching one possible path to AI (the one I am currently treading) and what it might look like for it to succeed.

Introduction to Reinforcement Learning with Function Approximation (NIPS Dec 7, 2015)

Pdf
slides. Video.

Reinforcement learning is a body of theory and techniques for optimal sequential decision making developed in the last thirty years primarily within the machine learning and operations research communities, and which has separately become important in psychology and neuroscience. This tutorial will develop an intuitive understanding of the underlying formal problem (Markov decision processes) and its core solution methods, including dynamic programming, Monte Carlo methods, and temporal-difference learning. It will focus on how these methods have been combined with parametric function approximation, including deep learning, to find good approximate solutions to problems that are otherwise too large to be addressed at all. Finally, it will briefly survey some recent developments in function approximation, eligibility traces, and off-policy learning.

Reinforcement learning is a body of theory and techniques for optimal sequential decision making developed in the last thirty years primarily within the machine learning and operations research communities, and which has separately become important in psychology and neuroscience. This tutorial will develop an intuitive understanding of the underlying formal problem (Markov decision processes) and its core solution methods, including dynamic programming, Monte Carlo methods, and temporal-difference learning. It will focus on how these methods have been combined with parametric function approximation, including deep learning, to find good approximate solutions to problems that are otherwise too large to be addressed at all. Finally, it will briefly survey some recent developments in function approximation, eligibility traces, and off-policy learning.

The Future of Artificial Intelligence (LABMP class Sep 10, 2015)

More on the Future of AI (LABMP class Jan 14, 2016)

Creating Human-level AI: How and When? (Future of AI Conference, Puerto Rico, Jan 3, 2015)

Deep Questions (Neural Computation and Adaptive Perception Workshop, Dec 5, 2015)

Emphatic Temporal-difference Learning (Google Deepmind, July 16 2015)

This
talk will present a new algorithmic idea—emphasis—that may have
wide-ranging implications for all kinds of temporal-difference
learning. With function approximation, it is not possible to get the
value estimates of all states exactly correct. If some are more
accurate, then others must be less so. How are the function
approximation resources to be allocated? As I formulate it, this
question comes down to asking about the intensity or emphasis of the
learning update at each time step. The emphasis is a positive number
M_{t} multiplied by the learning rate
(step size) of each learning
update. As the emphasis varies from step to step, it changes the
effective distribution of the learning updates. In recent work,
Mahmood, White, Yu, and I have shown that emphasis alone can cause the
TD(lambda) algorithm to become stable and convergent under general
off-policy learning. This is possibly a landmark result, but it is
technical and already available on arxiv. In this talk I will focus on
a further interesting possibility that is more accessible but has not
yet been properly worked out: that emphasis changes and may improve the
asymptotic solution of temporal-difference learning algorithms.
Emphasis is based on how the values of different states are
interrelated by bootstrapping; if this can be understood, it seems
likely that a bound on asymptotic mean-square error can be found that
improves on the classic bound by Tsitsiklis and Van Roy (1997). This
talk is an invitation to join with me in thinking about this issue and
discovering the better bound.

Multi-step Prediction (Neural Computation and Adaptive Perception Workshop, Dec 6, 2014)

Reinforcement Learning and Psychology (or Neuroscience): A Personal Story (Annual Meeting of the Society for Mathematical Psychology July 20 2014, and other places)

The modern field of reinforcement
learning (RL) has a long, intertwined relationship with psychology.
Almost all the powerful ideas of RL came originally from psychology,
and today they are recognized as having significantly increased our
ability to solve difficult engineering problems such as playing
backgammon, flying helicopters, and optimal placement of internet
advertisements. Psychology should celebrate this and take credit for
it! RL has also begun to give something back to the study of natural
minds, as RL algorithms are providing insights into classical
conditioning, the neuroscience of brain reward systems, and the role of
mental replay in thought. I have been working in the field of RL for
much of this journey, back and forth between nature and engineering,
and have played a role in some of the key steps. In this talk I tell
the story as it seemed to happen from my point of view, summarizing it
in four things that I think every psychologist should know about RL: 1)
that it is a formalization of learning by trial and error, with
engineering uses, 2) that it is a formalization of the propagation of
reward predictions which closely matches behavioral and neuroscience
data, 3) that it is a formalization of thought as learning from
replayed experience that again matches data from natural systems, and
4) that there is a beautiful confluence of psychology, neuroscience,
and computational theory on common ideas and elegant algorithms.

The AI Singularity and Prospects for Digital Immortality. (Kim Solez's LABMP course Dec 3, 2013)

Gradient Temporal-Difference Learning Algorithms (Gatsby London 2011)

Learning About Sensorimotor Data (NIPS Dec 13, 2011 and JNNS Dec 15 2011)

Pdf slides.
Quicktime.
Full keynote source with videos. Associated paper on the Horde
architecture. Associated
paper on on-policy nexting. Maei's thesis on gradient-TD
algorithms. On
videolectures.net.

Temporal-difference (TD) learning of
reward predictions underlies both reinforcement-learning algorithms and
the standard dopamine model of reward-based learning in the brain. This
confluence of computational and neuroscientific ideas is perhaps the
most successful since the Hebb synapse. Can it be extended beyond
reward? The brain certainly predicts many things other than
reward---such as in a forward model of the consequences of various ways
of behaving---and TD methods can be used to make these predictions. The
idea and advantages of using TD methods to learn large numbers of
predictions about many states and stimuli, in parallel, have been
apparent since the 1990s, but technical issues have prevented this
vision from being practically implemented...until now. A key
breakthrough was the development of a new family of gradient-TD
methods, introduced at NIPS in 2008 (by Maei, Szepesvari, and myself).
Using these methods, and other ideas, we are now able to learn
thousands of non-reward predictions in real-time at 10Hz from a single
sensorimotor data stream from a physical robot. These predictions are
temporally extended (ranging up to tens of seconds of anticipation),
goal oriented, and policy contingent. The new algorithms enable
learning to be off-policy and in parallel, resulting in dramatic
increases in the amount that can be learned in a given amount of time.
Our effective learning rate scales linearly with computational
resources. On a consumer laptop we can learn thousands of predictions
in real-time. On a larger computer, or on a comparable laptop in a few
years, the same methods could learn millions of meaningful predictions
about different alternate ways of behaving. These predictions in
aggregate constitute a rich detailed model of the world that can
support planning methods such as approximate dynamic programming.

Toward Learning
Human-level Predictive Knowledge (AGI keynote March 5, 2010)Slides.
Video of the talk part1, part2. This talk worked out
pretty well for content -- about handling knowledge predictively with
general value functions. The slides and video are
best viewed side by side. In part 2, you should skip from 5:50 to
15:50, as this portion has bad sound and is repeated again with better
sound starting at 15:50. Perhaps this was done to show the slides
better during these 10 minutes, but this can be done better by
downloading the slides. The total talk lasts about one hour and 25
minutes (if 5:50-15:50 is skipped). The sound is very poor in the
lengthy question period starting at 27:20 of part 2.

Deconstructing Reinforcement Learning (MSRL June 18, 2009)

The premise of this symposium is that
the ideas of reinforcement learning have impacted many fields,
including artificial intelligence, neuroscience, control theory,
psychology, and economics. But what are these ideas and which of them
is key? Is it the idea of reward and reward prediction as a way of
structuring the problem facing both natural and artificial systems? Is
it temporal-difference learning as a sample-based algorithm for
approximating dynamic programming? Or is it the idea of learning
online, by trial and error, searching to find a way of behaving that
might not be known by any human supervisor? Or is it all of these ideas
and others, all coming to renewed prominence and significance as these
fields focus on the common problem that faces animals, machines, and
societies---how to predict and control a hugely complex world that can
never be understood incompletely, but only as a gross, ever-changing
approximation? In this talk I seek to start the process of phrasing and
answering these questions. In some cases, from my own experience, I can
identify which ideas have been the most important, and guess which will
be in the future. For others I can only ask the other speakers and
attendees to provide informed perspectives from their own fields.

Fast gradient-descent methods for temporal-difference learning with linear function approximation (ICML-09 June 16, 2009)

Associated paper.

Sutton, Szepesvari and Maei (2009)
recently introduced the first temporal-difference learning algorithm
compatible with both linear function approximation and off-policy
training, and whose complexity scales only linearly in the size of the
function approximator. Although their gradient temporal difference
(GTD) algorithm converges reliably, it can be very slow compared to
conventional linear TD (on on-policy problems where TD is convergent),
calling into question its practical utility. In this paper we
introduce two new related algorithms with better convergence
rates. The first algorithm, GTD2, is derived and proved
convergent just as GTD was, but uses a different objective function and
converges significantly faster (but still not as fast as conventional
TD). The second new algorithm, linear TD with gradient correction, or
TDC, uses the same update rule as conventional TD except for an
additional term which is initially zero. In our experiments on small
test problems and in a Computer Go application with a million features,
the learning rate of this algorithm was comparable to that of
conventional TD. This algorithm appears to extend linear TD to
off-policy learning with no penalty in performance while only doubling
computational requirements.

Core learning algorithms for intrinsically motivated robots (Venice IM-CLeVeR Workshop Nov 17, 2009)

This talk will present recent progress
in the development of core learning algorithms that may be useful in
creating systems with nontrivial cognitive-developmental trajectories.
The most distinctive feature of such systems is that their learning is
continual and cumulative. They never stop learning, and new learning
builds upon the old. For such learning, the algorithms must be
incremental and operate in real-time, and in the past the most suitable
algorithms for such cases have been gradient-descent algorithms. Our
new algorithms are extensions of temporal-difference learning so that
it is a true gradient-descent algorithm, which greatly extends its
robustness and generality. In particular, we have obtained for the
first time temporal-difference methods for off-policy learning with
function approximation, including nonlinear function approximation, and
for intra-option learning. This talk will not present these algorithms
in technical detail, but instead stress the several natural roles that
they could play in systems that set their own goals and explore the
world in a structured, intrinsically motivated way.

This is joint work with Hamid Maei, Csaba Szepesvari, Doina Precup, Shalabh Bhatnagar, David Silver, Michael Delp, and Eric Weiwiora

Online Representation Learning in
the State Update Function (Tea-time 2009)This is joint work with Hamid Maei, Csaba Szepesvari, Doina Precup, Shalabh Bhatnagar, David Silver, Michael Delp, and Eric Weiwiora

New Temporal-Difference Methods Based on Gradient Descent (USC 2/18/09)

ABSTRACT: Temporal-difference methods
based on gradient descent and parameterized function approximators form
a core part of the modern field of reinforcement learning and are
essential to many of its large-scale applications. However, the
most popular methods, including TD(lambda), Q-learning, and Sarsa, are
not true gradient-descent methods and, as a result, the conditions
under which they converge are narrower and less robust than can usually
be guaranteed for gradient-descent methods. In this paper we
introduce a new family of temporal-difference algorithms whose expected
updates are in the direction of the gradient of a natural performance
measure that we call the "mean squared projected Bellman error".
Because these are true gradient-descent methods, we are able to apply
standard techniques to prove them convergent and stable under general
conditions including, for the first time, off-policy training. The new
methods are of the same order of complexity as TD(lambda) and, when
TD(lambda) converges, they converge at a similar rate to the same
fixpoints. The new methods are similar to GTD(0) (Sutton,
Szepesvari & Maei, 2009), but based on a different objective
function and much more efficient, as we demonstrate in a series of
computational experiments.

Applications of Reinforcement Learning in the Power Systems Industry (Invited Talk at the First Annual Power and Energy Forum, Nov 6, 2008)

Critterbot Project Overview (Aug 6, 2008)

Mind and Time: A View of Constructivist Reinforcement Learning

This an invited talk I gave at the
European Workshop on Reinforcement Learning in summer 2008. The
basic idea is that in order to learn fast it is necessary to learn
slow, that the key to fast reinforcement learning is to prepare for it
by a slow continual process of constructing a model of the world's
state and dynamics. Although I don't know exactly how to do this,
I have many ideas and suggestions, and an outline of how to
proceed. I try to communicate these in this talk.

How simple can mind be? (International Workshop on Natural and Artificial Cognition, University of Oxford 6/26/07)

On the Role of Tracking in Stationary Environments (ICML'07 6/21/07) Associated paper.

ABSTRACT: It is often thought that
learning algorithms that track the best
solution, as opposed to converging to it, are important only on
nonstationary problems. We present three results suggesting that this
is not so. First we illustrate in a simple concrete example, the Black
and White problem, that tracking can perform better than any converging
algorithm on a stationary problem. Second, we show the same point on a
larger, more realistic problem, an application of temporal-difference
learning to computer Go. Our third result suggests that tracking in
stationary problems could be important for meta-learning research
(e.g., learning to learn, feature selection, transfer). We apply a
meta-learning algorithm for step-size adaptation, IDBD,e to the Black
and White problem, showing that meta-learning has a dramatic long-term
effect on performance whereas, on an analogous converging problem,
meta-learning has only a small second-order effect. This small result
suggests a way of eventually overcoming a major obstacle to
meta-learning research: the lack of an independent methodology for task
selection.

Stimulus Representation in Temporal-Difference Models of the Dopamine System (Cal Tech 6/4/07)

The neurotransmitter dopamine plays an
important role in the processing of reward-related information in the
brain. A prominent theory of this function is that the phasic firing of
dopamine neurons encodes a reward prediction error as formalized by the
temporal-difference (TD) algorithm in reinforcement learning. Most of
these TD models of the dopamine system have assumed a "complete serial
compound" representation in which every moment within a trial is
represented distinctly with no similarity to neighboring moments. In
this paper we present a more realistic temporal representation in which
external stimuli spawn a series of internal microstimuli which grow
weaker and more diffuse over time. We show that if these microstimuli
are used as inputs to the TD model, then its match to experimental data
is improved for hitherto problematic cases in which reward is omitted
or received early. We also note that the new model never produces
large negative errors, suggesting that a second neurotransmitter for
representing negative errors may not be necessary. Generally, we
conclude that choosing a stimulus representation with a more realistic
temporal profile can significantly alter the predictions of the TD
model of dopamine function.

Experience-Oriented Artificial Intelligence (Machine Learning Seminar at the University of Toronto, 4/3/06)

If intelligence is a computation, then
the temporal stream of sensations is its input, and the temporal stream
of actions is its output. These two intermingled time series make up
experience. They are the basis on which all intelligent decisions
are made and the basis on which those decisions are judged. A focus on
experience has implications for many aspects of AI; in this talk we
consider its implications for knowledge representation. I propose that
it is possible and desirable for an AI agent's knowledge of the world
to be expressed entirely as predictions about its low-level experience.
Even abstract concepts, such as the concept of a chair, can be
expressed as predictions, e.g., about what will happen if we try to
sit. The predictive approach is appealing because it connects knowledge
directly to data, allowing knowledge to be autonomously verified and
tuned, perhaps even learned. However, there is a tremendous gap between
human-level knowledge (e.g., about space, objects, people, or water)
and low-level experience. The purpose of this talk is to present
some recent work suggesting how this gap might someday be
bridged. I describe a series of small experiments in which
extensions of reinforcement learning methods are used to learn
predictive representations of abstract commonsense knowledge in
micro-worlds. These are first steps on a long journey toward
understanding how a mind might make sense of the blooming, buzzing
confusion of its sensori-motor experience.

Predictive Representations of State and Knowledge (ICML'05 workshop on Rich Representations for Reinforcement Learning, 8/7/05)

What is knowledge? The empiricist
answer, dating back to the 19th century, is that knowledge is the
ability to predict. In a modern version of this idea, reinforcement
learning researchers have proposed that artificial agents should
represent their knowledge as predictions of their low-level sensations
and actions. This predictive representations (PR) approach is
appealing because it connects knowledge directly to data, thereby
facilitating learning and clarifying semantics. Most PR research
has emphasized representing the world's _state. In this talk I
will survey the main results and mathematical ideas of that work. A
natural follow on, just beginning to be explored, is to use PRs for all
kinds of world knowledge, of dynamics as well as of state, of
abstractions as well as specifics. I will survey this work as
well and attempt to make vivid the potential of PRs for artificial
intelligence.

Grounding knowledge in subjective experience (provocative remarks at the 2nd Cognitive Systems Conference, 5/20/05)

Experience-Oriented Artificial Intelligence (McGill 11/30/05)

I propose that experience - the
explicit sequence of actions
and
sensations over an agent's life - should play a central role in all
aspects of artificial intelligence. In particular:

1. Knowledge representation should be
in terms of
experience. Recent work
has shown that a surprisingly wide range of
world knowledge can be expressed as predictions of experience,
enabling it to be automatically verified and tuned, and grounding its
meaning in data rather than in human understanding.

2. Planning/reasoning should be in terms of experience. It is natural to think of planning as comparing alternative future experiences. General methods, such as dynamic programming, can be used to plan using knowledge expressed in the aforementioned predictive form.

3. State representation should be in terms of experience. Rather than talk about objects and their metric or even topological relationships, we represent states by the predictions that can be made from them. For example, the state "John is in the coffee room" corresponds to the prediction that going to the coffee room will produce the sight of John.

2. Planning/reasoning should be in terms of experience. It is natural to think of planning as comparing alternative future experiences. General methods, such as dynamic programming, can be used to plan using knowledge expressed in the aforementioned predictive form.

3. State representation should be in terms of experience. Rather than talk about objects and their metric or even topological relationships, we represent states by the predictions that can be made from them. For example, the state "John is in the coffee room" corresponds to the prediction that going to the coffee room will produce the sight of John.

Much here has yet to be worked out. Each of the "should"s
above can
also be read as a "could", or even a "perhaps could". I am optimistic
and enthusiastic because of the potential for developing a compact and
powerful theory of AI in the long run, and for many easy experimental
tests in the short run.

Grounding Commonsense Knowledge in Question Networks. (University of Michigan 9/28/04)

A long-standing challenge in artificial
intelligence has been to relate the kind of commonsense knowledge that
people have about the world (for example, about space, objects, people,
trees and water) to the low-level stream of sensations and
actions. In this talk, we present new work that brings us a few
steps closer to realizing this goal. We introduce the idea of
question networks, a way of expressing arbitrary machine-readable
questions about future sensations and actions, and a
temporal-difference algorithm for learning answers to the
questions. In a series of small experiments, we illustrate the
learning efficency of these methods and their ability to handle
non-Markov problems. Finally, we present their extension to
temporally abstract knowledge in terms of closed-loop macro-actions
known as options. Overall, we argue that these steps bring us
qualitatively closer to understanding the blooming, buzzing confusion
of sensori-motor experience.

Temporal Difference Networks. Presented at NIPS-04. Larger Version.

We introduce a generalization of
temporal-difference (TD) learning to networks of interrelated
predictions.
Rather than relating a single prediction to itself at a later time, as
in conventional TD methods, a TD network relates each prediction in a
set of predictions to other predictions in the set at a later time. TD
networks can represent and apply TD learning to a much wider class of
predictions than has previously been possible. Using a random-walk
example, we show that these networks can be used to learn to predict by
a fixed interval, which is not possible with conventional TD methods.
Secondly, we show that when actions are introduced, and the
inter-prediction relationships made contingent on them, the usual
learning-efficiency advantage of TD methods over Monte Carlo
(supervised learning) methods becomes particularly pronounced. Thirdly,
we demonstrate that TD networks can learn predictive state
representations that enable exact solution of a non-Markov problem. A
very broad range of inter-predictive temporal relationships can be
expressed in these networks. Overall we argue that TD networks
represent a substantial extension of the abilities of TD methods and
bring us closer to the goal of representing world knowledge in entirely
predictive, grounded terms.

Knowledge Representation in TD Networks (AAAI Symposium on MDPs and POMDPs: Advances and Challenges (7/26/04) Large (1024 x 768) version

We introduce a generalization of
temporal-difference (TD) learning to networks of interrelated
predictions.
Rather than relating a single prediction to itself at a later time, as
in conventional TD methods, a TD network relates each prediction in a
set of predictions to other predictions in the set at a later time. TD
networks can represent and apply TD learning to a much wider class of
predictions than has previously been possible. Using a random-walk
example, we show that these networks can be used to learn to predict by
a fixed interval, which is not possible with conventional TD methods.
Secondly, we show that when actions are introduced, and the
inter-prediction relationships made contingent on them, the usual
learning-efficiency advantage of TD methods over Monte Carlo
(supervised learning) methods becomes particularly pronounced. Thirdly,
we demonstrate that TD networks can learn predictive state
representations that enable exact solution of a non-Markov problem. A
very broad range of inter-predictive temporal relationships can be
expressed in these networks. Overall we argue that TD networks
represent a substantial extension of the abilities of TD methods and
bring us closer to the goal of representing world knowledge in entirely
predictive, grounded terms.

Toward a Computational Theory of Intelligence -- iCORE talk on Reinforcement Learning and Artificial Intelligence (University of Calgary 2/25/04). Video here.

This talk was to a general university audience (videocast to U. Alberta and U. Lethbridge). To showcase the ideas and power of RL, i collected a bunch of videos from other peoples' work. It's not often you can do this appropriately, but I think it was ok this time, and certainly it was fun. The accompanying videos:

- Robot dogs learning to walk fast (Kohl & Stone, U. Texas, 2004)
- Finnegan Southey devilsticking
- Robot devilsticking (Stefan Schaal & Chris Atkeson Univ. of Southern California)
- Helicopter by Ng (Stanford), Kim, Jordan, & Sastry (UC Berkeley) 2004
- Adversarial Robot Learning (Bowling & Veloso, CMU)

All save the last are quicktimable and will play directly in safari.
The last seems to require mplayer.

Adapting bias by gradient descent: An incremental version of delta-bar-delta (University of Alberta 2/2/04)

Appropriate bias is widely viewed as the key to efficient learning and generalization. I present a new algorithm, the Incremental Delta-Bar-Delta (IDBD) algorithm, for the learning of appropriate biases based on previous learning experience. The IDBD algorithm is developed for the case of a simple, linear learning system---the LMS or delta rule with a separate learning-rate parameter for each input. The IDBD algorithm adjusts the learning-rate parameters, which are an important form of bias for this system. Because bias in this approach is adapted based on previous learning experience, the appropriate testbeds are drifting or non-stationary learning tasks. For particular tasks of this type, I show that the IDBD algorithm performs better than ordinary LMS and in fact finds the optimal learning rates. The IDBD algorithm extends and improves over prior work by Jacobs and by me in that it is fully incremental and has only a single free parameter. This paper also extends previous work by presenting a derivation of the IDBD algorithm as gradient descent in the space of learning-rate parameters. Finally, I offer a novel interpretation of the IDBD algorithm as an incremental form of hold-one-out cross validation.

From Markov Decision Processes to Artificial Intelligence (University of Alberta 5/14/03)

The path to general, human-level intelligence may go through Markov decision processes (MDPs), a discrete-time, probabilistic formulation of sequential decision problems in terms of states, actions, and rewards. Developed in the 1950s, MDPs were extensively explored and applied in operations research and engineering before coming to the attention of artificial intelligence researchers about 15 years ago. Much of the new interest has come from the field of reinforcement learning, where novel twists on classical dynamic programming methods have enabled the solution of more and vastly larger problems, such as backgammon (Tesauro, 1995) and elevator control (Crites and Barto, 1996). Despite remaining technical issues, real progress seems to have been made toward general learning and planning methods relevant to artificial intelligence. We suggest that the MDP framework can be extended further, to the threshold of human-level intelligence, by abstracting and generalizing each of its three components - actions, states, and rewards. We briefly survey recent work on temporally abstract actions (Precup, 2000; Parr, 1998), predictive representations of state (Littman et al., 2002), and non-reward subgoals (Sutton, Precup & Singh, 1998) to make this suggestion.

The reinforcement learning approach to
understanding
intelligence
is now about 20 years old, which should be time enough for a mature
perspective on what it is and what it has contributed. Reinforcement
learning methods, particularly temporal-difference learning, have
been widely used in control and robotics applications, in playing
games such as chess and backgammon, in operations research, and as
models of animal learning and neural reward systems. Holding these
diverse applications together, and posing as a fundamental statement
about cognition and decision-making, is a computational theory (in
the sense of Marr) of mind. Reinforcement learning methods are
centered around the interaction and simultaneous evolution of two
primary functional objects, the policy, which says what to do in each
situation, and the value function, which says how desirable it is to
be in each situation. In this talk, I will survey several examples
of reinforcement learning in the attempt to make this underlying
theory vivid. Finally, I will mention some of the theory's
limitations and shortcomings, and ongoing efforts to make it relevant
to the extremely powerful and flexible cognition that we see in
humans.

Experience-Oriented Artificial Intelligence (Nov 2002)

I propose that experience - the
explicit sequence of actions
and
sensations over an agent's life - should play a central role in all
aspects of artificial intelligence. In particular:

[some of this is joint work with Doina Precup, Michael Littman, Satinder Singh & Peter Stone]

1. Knowledge representation should be
in terms of
experience. Recent work
has shown that a surprisingly wide range of
world knowledge can be expressed as predictions of experience,
enabling it to be automatically verified and tuned, and grounding its
meaning in data rather than in human understanding.

2. Planning/reasoning should be in terms of experience. It is natural to think of planning as comparing alternative future experiences. General methods, such as dynamic programming, can be used to plan using knowledge expressed in the aforementioned predictive form.

3. State representation should be in terms of experience. Rather than talk about objects and their metric or even topological relationships, we represent states by the predictions that can be made from them. For example, the state "John is in the coffee room" corresponds to the prediction that going to the coffee room will produce the sight of John.

Much here has yet to be worked out. Each of the "should"s
above can
also be read as a "could", or even a "perhaps could". I am optimistic
and enthusiastic because of the potential for developing a compact and
powerful theory of AI in the long run, and for many easy experimental
tests in the short run. 2. Planning/reasoning should be in terms of experience. It is natural to think of planning as comparing alternative future experiences. General methods, such as dynamic programming, can be used to plan using knowledge expressed in the aforementioned predictive form.

3. State representation should be in terms of experience. Rather than talk about objects and their metric or even topological relationships, we represent states by the predictions that can be made from them. For example, the state "John is in the coffee room" corresponds to the prediction that going to the coffee room will produce the sight of John.

[some of this is joint work with Doina Precup, Michael Littman, Satinder Singh & Peter Stone]

Artificial Intelligence Should Be About Predictions (AT&T 12/7/01)

What keeps the knowledge in an AI
system correct? Usually
people do, but
that is a dead end; eventually the AI must do it itself. Building AIs
that
can maintain their own knowledge is probably the greatest single
challenge
facing AI today.

It would be relatively easy to self-maintain knowledge if it were expressed as predictions: you would predict something and then see what actually happened. In this talk I propose that much of our knowledge of the world can be expressed as predictions that can be verified in this way. Certainly much of our everyday decision-making is based on predictions about alternative alternative courses of action. Even abstract concepts, such as the concept of a chair, can be expressed as predictions, e.g., about what would happen if we try to sit. Emphasizing ideas rather than technical details, I will describe some of the challenges to this predictive view and partial solutions. The main challenge is to be able to express in predictive form the wide variety of knowledge we have of the world. This can be done in large part by allowing the predictions to be conditional on action and to terminate flexibly, as in the "options" framework. A second challenge is to be fully grounded, to relate the meaning of predictions directly to data. Finally, we consider the pragmatic challenges: how to make progress with these ideas? Building a self-maintaining AI based on predictive knowledge is not difficult, but requires new ways of thinking, determination to do it right, and a willingness to proceed slowly.

We Have Not Yet Begun to
Learn (19th Reinforcement Learning Workshop, AT&T 9/20/01)It would be relatively easy to self-maintain knowledge if it were expressed as predictions: you would predict something and then see what actually happened. In this talk I propose that much of our knowledge of the world can be expressed as predictions that can be verified in this way. Certainly much of our everyday decision-making is based on predictions about alternative alternative courses of action. Even abstract concepts, such as the concept of a chair, can be expressed as predictions, e.g., about what would happen if we try to sit. Emphasizing ideas rather than technical details, I will describe some of the challenges to this predictive view and partial solutions. The main challenge is to be able to express in predictive form the wide variety of knowledge we have of the world. This can be done in large part by allowing the predictions to be conditional on action and to terminate flexibly, as in the "options" framework. A second challenge is to be fully grounded, to relate the meaning of predictions directly to data. Finally, we consider the pragmatic challenges: how to make progress with these ideas? Building a self-maintaining AI based on predictive knowledge is not difficult, but requires new ways of thinking, determination to do it right, and a willingness to proceed slowly.

Mind is About Predictions (Northeastern 7/31/01)

In this talk I will describe recent
research in artificial
intelligence
which has given greater credance to the old idea that much of our
knowledge of the world is in the form of predictions. From the
blooming,
buzzing confusion we extract what is predictable, and in so doing
discover
useful concepts and ways of behaving. Certainly, much of our everyday
reasoning and decision making is based on predictions about alternative
courses of action. Even abstract concepts, such as the concept of a
chair, can be expressed as predictions, e.g., about what will happen if
we
try to sit. In this talk I will briefly cover three ideas: 1) an
expanded
notion of prediction capable of expressing a broad range of knowledge,
2)
a kind of planning, or reasoning, as the combination of predictions to
yield new predictions, and 3) a way of representing the state of the
world
(as well as its dynamics) as predictions. All this suggests that
working
with predictions is what the mind is all about---that predictions are
the
coin of the mental realm.

(Some of the newer bits of this are joint work with Michael Littman, Doina Precup, and Satinder Singh; also many thanks to David McAllester for constructive criticism.)

(Some of the newer bits of this are joint work with Michael Littman, Doina Precup, and Satinder Singh; also many thanks to David McAllester for constructive criticism.)

Off-policy temporal-difference learning with function approximation (ICML 7/1/01)

We introduce the first algorithm for off-policy
temporal-difference
learning that is stable with linear function approximation.
Off-policy learning is of interest because it forms the
basis for popular reinforcement learning methods such as Q-learning,
which has been known to diverge with linear function approximation, and
because it is critical to the practical utility of multi-scale,
multi-goal, learning frameworks such as options, HAMs, and
MAXQ. Our new algorithm combines TD(lambda)
over state-action pairs with importance sampling ideas from our
previous work. We prove that, given training under any epsilon-soft
policy, the algorithm converges w.p.1 to a close approximation (as in
Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the
action-value function for an arbitrary target policy.
Variations of the algorithm designed to reduce variance introduce
additional bias but are also guaranteed convergent.
We also illustrate our method empirically on a small policy evaluation
problem, showing reduced variance compared to the most obvious
importance sampling algorithm for this problem.
Our current results are limited to episodic tasks with episodes of
bounded length.

Overcoming the Curse of Dimensionality with Reinforcement Learning (MIT ORC 4/19/01)

Technological advances in the last few
decades have made
computation and
memory vastly cheaper and thus available in massive quantities. The
field
of reinforcement learning attempts to take advantage of this trend when
solving large-scale stochastic optimal control problems. Dynamic
programming can solve small instances of such problems, but suffers
from
Bellman's "curse of dimensionality," the tendency of the state space
and
thus computational complexity to scale exponentially with the number of
state variables (and thus to quickly exceed even the "massive"
computational
resources now available). Reinforcement learning brings in two new
techniques: 1) parametric approximation of the value function, and 2)
sampling of state trajectories (rather than sweeps through the state
space). These enable finding approximate solutions, improving in
quality
with the available computational resources, on problems too large to
even be
attempted with conventional dynamic programming. However, these
techniques
also complicate theory, and there remain substantial gaps between the
reinforcement learning methods proven effective and those that appear
most
effective in practice.
In this talk, I present results
extending the convergence result of Tsitsiklis and Van Roy for
on-policy
evaluation with linear function approximation to the off-policy case,
reviving the possibility of convergence results for value-based
off-policy
control methods such as Q-learning. I also present an application to
RoboCup
soccer illustrating the linear approach to function approximation.
(This
is joint work with Doina Precup, Satinder Singh, Peter Stone, and
Sanjoy
Dasgupta.)

The Right Way to do Reinforcement Learning with Function Approximation (NIPS'00 12/2/00)

From Reflex to Reason (Cornell 12/8/00)

How close are we to a computational understanding of the
mind? Perhaps
closer than is usually thought. In this talk I discuss a small set of
principles drawn from reinforcement learning and other parts of
artificial intelligence that cover a broad range of mental phenomena,
from
reflexes through various kinds of learning, planning, and reasoning.
These principles include rewards, value functions, state-space search,
and, as I emphasize in this talk, representing our knowledge of the
world
as predictions of future observations. First, I show how predictive
representations provide a new theory of that simplest of learning
phenomena, Pavlovian conditioning or the learning of replexes. Second,
I
briefly outline how model-based reinforcement learning with mental
simulation can serve as a theory of reasoning. I argue that
representing
knowledge as predictions, including the possibility of
action-contingent
and temporally indefinite predictions, solves critical problems in the
semantics and grounding of classical symbolic approaches to knowledge
representation.

Toward Grounding Knowledge in Prediction (CEC2000 7/18/00)

Any attempt to build intelligent machines must come to grips
with the question of knowledge, of what kind of information about
the world the machine stores and manipulates. Traditionally
there have been two approaches, the horns of a dilemma. One uses
verbal statements like "John loves Mary" or "Socrates is a man"
whose meaning is clear only to people, not to machines; such
knowledge is ungrounded. The other uses mathematical statements
like differential equations or transition matrices which,
although clear and grounded, have never seemed adequate for
expressing the commonsense knowledge we all have about the world
and use everyday. In this talk we suggest that this dilemma can
be broken by grounding knowledge in an enlarged notion of
conditional prediction. In particular, if we allow predictions
conditional on outcomes (as in Precup, 2000; Parr, 1999) then
much more can be expressed as predictions without losing
grounding and mathematical clarity. In addition, this approach
suggests a radical theory of reasoning---combining knowledge to
yield new knowledge---as simple composition of predictions.

A Least Common Denominator for Temporal Abstraction in Reinforcement Learning (NIPS workshop 12/5/98)

Improved Switching Among Temporally Abstract Actions (NIPS 12/2/98)

In robotics and other control applications it is
commonplace to have a pre-existing set of controllers for solving
subtasks, perhaps hand-crafted or previously learned or planned,
and still face a difficult problem of how to choose and switch
among the controllers to solve an overall task as well as
possible. In this paper we present a framework based on Markov
decision processes and semi-Markov decision processes for
phrasing this problem, a basic theorem regarding the improvement
in performance that can be obtained by switching flexibly between
given controllers, and example applications of the theorem. In
particular, we show how an agent can plan with these high-level
controllers and then use the results of such planning to find an
even better plan, by modifying the existing controllers, with
negligible additional cost and no re-planning. In one of our
examples, the complexity of the problem is reduced from 24
billion state-action pairs to less than a million
state-controller pairs.

Reinforcement Learning: How Far Can It Go? (Past, Present, and Future) (ICML/COLT/UAI 7/25/98, Extended abstract)

Between MDPs and Semi-MDPs (Stanford 3/5/98)

A key challenge for AI is how to learn,
plan, and represent
knowledge at multiple levels of temporal abstraction. In this
talk I develop an approach based on the mathematical framework of
reinforcement learning and Markov decision processes (MDPs). The
usual framework is extended to include closed-loop multi-step *options*---whole
courses of behavior that may be
temporally extended, stochastic, and contingent on events. Examples of
options include picking up an object, going to lunch,
and traveling to a distant city, as well as primitive actions
such as muscle twitches and joint torques. Options can be used
interchangeably with primitive actions in reinforcement learning
and planning methods, and can be analyzed in terms of a
generalized kind of MDP known as a semi-Markov decision process
(SMDP) (e.g., Puterman, 1994; Bradtke and Duff, 1995; Parr, 1998;
Precup and Sutton, 1997). In this talk I focus on the interplay
between the MDP and SMDP levels of analysis. I show how a set of
options can be improved by changing their termination conditions
to improve over SMDP planning methods with no additional cost. I
also present novel *intra-option *temporal-difference
methods that substantially improve over SMDP methods. Finally, I
discuss how options themselves can be learned, introducing a new
notion of subgoal and subtask into reinforcement learning. Overall, I
argue that options and models of options provide
hitherto missing aspects of a powerful, clear, and expressive
framework for representing and organizing knowledge.
(Joint work with Doina Precup and Satinder Singh.)

Reinforcement Learning: A Tutorial (GP-98 7/23/98)

Reinforcement learning is learning about, from, and while
interacting
with a environment in order to achieve a goal. In other words, it is a
relatively direct model of the learning that people and animals do in
their normal lives. In the last two decades, this age-old problem has
come to be much better understood by integrating ideas from psychology,
optimal control, artificial neural networks, and artificial
intelligence. New methods and combinations of methods have enabled
much better solutions to large-scale applications than had been
possible by all other means. This tutorial will provide a top-down
introduction to the field, covering Markov decision processes and
approximate value functions as the formulation of the problem, and
dynamic programming, temporal-difference learning, and Monte Carlo
methods as the principal solution methods. The role of neural networks
and planning will also be covered. The
emphasis will be on understanding the capabilities and appropriate role
of each of class of methods within in an integrated system for learning
and decision making

Reinforcement Learning: Lessons for Artificial Intelligence (IJCAI 8/28/97)

The field of reinforcement learning has
recently produced
world-class applications and, as we survey in this talk,
scientific insights that may be relevant to all
of AI. In my view, the main things that we have
learned from reinforcement learning are 1) the power
of learning from experience as opposed to labeled
training examples, 2) the central role of *modifiable *evaluation
functions in organizing sequential behavior, and 3) that
learning and planning could be radically similar.

Reinforcement Learning and Information Access (AAAI-SS 3/26/96)

Constructive Induction Needs a Methodology based on Continuing Learning (ICML94, Workshop on Constructive Induction, panel remarks)