Rich Sutton's Publications
Last updated June 2021.
First, a quick guide to some of the highlights, roughly in order
of the work's popularity or potential current interest:
Also, some RL pubs that aren't mine, available for researchers:
- Chris Watkins's thesis
- Boyan's LSTD(λ),
1999
- Barto and Bradtke
LSTD, 1996
- Williams, 1992
- Lin, 1992
- Ross, 1983, Chapter 2
- Minsky, 1960, Steps to AI
- Good, 1965,
Speculations concerning the first ultraintelligent machine
- Selfridge, 1958,
Pandemonium
- Samuel, 1959
- Dayan, 1992
- Tesauro, 1992, TD-Gammon
- Watkins and Dayan,
1992
- Gurvitz, Lin, and
Hanson, 1995
- An, Miller, and Parks
(1991)
- Intro to Andreae
(2017) and Andreae
(2017)
Phd theses
- Doina Precup, 2000
- David Silver, 2009,
Reinforcement Learning and Simulation-Based Search in
Computer Go
- Hamid Maei, 2011, Gradient
Temporal-Difference Learning Algorithms
- Adam White, 2015, Developing a
predictive approach to knowledge
- Rupam Mahmood, 2017, Incremental Off-policy
Reinforcement Learning Algorithms
- Sina Ghiassian, 2022, Online Off-policy
Prediction
- Katya Kudashkina, 2022, Model-based
Reinforcement Learning Methods for Developing Intelligent
Assistants
- Yi Wan, 2023, Learning and Planning with the
Average-Reward Formulation
- Alexandra Kearney, 2023, Letting the Agent Take the
Wheel: Principles for Constructive and Predictive Knowledge
- Banafsheh Rafiee, 2024, State Construction in
Reinforcement Learning
- Kenny Young, 2024, Leveraging Generic Problem
Structure for Efficient Reinforcement Learning
- Abhishek Naik, 2024, Reinforcement Learning
for Continuing Problems using Average Reward
- Kris De Asis, 2024, Explorations in the
Foundations of Value-based Reinforcement Learning
MSc theses
- Brian Tanner, 2005, Temporal-Difference
Networks
- Adam White, 2006, A
Standard Benchmarking System for Reinforcement Learning
- Eddie
Rafols, 2006, Temporal
Abstraction in Temporal-difference Networks
- Cosmin Paduraru, 2008, Planning
with Approximate and Learned Models of Markov Decision
Processes
- Anna Koop, 2008, Investigating Experience:
Temporal Coherence and Empirical Knowledge Representation
- Masoud
Shahamiri, 2008,
Reinforcement Learning in Environments with Independent
Delayed-Sense Dynamics
- Rupam Mahmood, 2010, Automatic Step-size
Adaptation in Incremental Supervised Learning
- Mike Delp, 2011, Experiments in Off-Policy
Reinforcement Learning with the GQ(λ) Algorithm
- MahdiehSadat Mirian
HosseinAbadi, 2012, A
Computational Model of Learning from Replayed Experience in
Spatial Navigation
- Leah Hackman, 2012, Faster Gradient-TD
Algorithms
- Kavosh Asadi, 2015,
Strengths, Weaknesses, and Combinations of Model-based and
Model-free Reinforcement Learning
- Travis Dick, 2015, Policy
Gradient Reinforcement Learning Without Regret
- Vivek Veeriah, 2017, Beyond Clever Hans:
Learning From People Without Their Really Trying
- Shangtong Zhang, 2018, Learning with Artificial
Neural Networks
- Banafsheh Rafiee, 2018, Predictive Knowledge in
Robots: An Empirical Comparison of Learning Algorithms
- Kris De Asis, 2018, A Unified View of
Multi-step Temporal Difference Learning
- Tian Tian, 2018, Extending the Sliding-step
Technique of Stochastic Gradient Descent to Temporal
Difference Learning
- Kenny Young, 2019, Learning What to Remember:
Strategies for Selective External Memory in Online
Reinforcement Learning Agents
- Juan Fernando Hernandez Garcia, 2019, Unifying n-Step
Temporal-Difference Action-Value Methods
- Chen Ma, 2020, An
Exploration of Predictive Representations of State
- Shibhansh Dohare, 2020, The Interplay of Search
and Gradient Descent in Semi-stationary Learning Problems
- Dylan Ashley, 2020,
Understanding Forgetting in Artificial Neural Networks
- Brendan Bennett, 2020, Estimating
Variance of Returns using Temporal Difference Methods
- Parash Rahman, 2020, Toward
Generate-and-Test Algorithms for Continual Feature Discovery
- Jingjiao Ni, 2020, Toward
Emphatic Reinforcement Learning
- Cam Linke, 2020, Adapting
Behavior via Intrinsic Reward
- Valliappa Chockalingam, 2020, Interrelating Prediction and
Control Objectives in Episodic Actor-Critic Reinforcement
Learning
- Amir Samani, 2022, Learning Agent State
Online with Recurrent Generate-and-Test
For any broken links, please send email
to rich@richsutton.com.
Degris, T., Javed, K., Sharifnassab, A.,
Liu, Y., Sutton, R.S. (2024). Step-size Optimization
for Continual Learning. ArXiv 2401.17401.
ABSTRACT: In continual
learning, a learner has to keep learning from
the data over its whole life time. A key issue
is to decide what knowledge to keep and what
knowledge to let go. In a neural network, this
can be implemented by using a step-size vector
to scale how much gradient samples change
network weights. Common algorithms, like
RMSProp and Adam, use heuristics, specifically
normalization, to adapt this step-size vector.
In this paper, we show that those heuristics
ignore the effect of their adaptation on the
overall objective function, for example by
moving the step-size vector away from better
step-size vectors. On the other hand,
stochastic meta-gradient descent algorithms,
like IDBD (Sutton, 1992), explicitly optimize
the step-size vector with respect to the
overall objective function. On simple
problems, we show that IDBD is able to
consistently improve step-size vectors, where
RMSProp and Adam do not. We explain the
differences between the two approaches and
their respective limitations. We conclude by
suggesting that combining both approaches
could be a promising future direction to
improve the performance of neural networks in
continual learning.
Dohare,
S., Hernandez-Garcia, J.F.,
Lan, Q., Rahman, P., Sutton,
R. S., Mahmood,
A.R. (2024). Loss
of Plasticity
in Deep Continual Learning.
Nature 632:768-774.
ABSTRACT:
Artificial neural networks, deep-learning
methods and the backpropagation algorithm form
the foundation of modern machine learning and
artificial intelligence. These methods are
almost always used in two phases, one in which
the weights of the network are updated and one
in which the weights are held constant while the
network is used or evaluated. This contrasts
with natural learning and many applications,
which require continual learning. It has been
unclear whether or not deep learning methods
work in continual learning settings. Here we
show that they do not—that standard
deep-learning methods gradually lose plasticity
in continual-learning settings until they learn
no better than a shallow network. We show such
loss of plasticity using the classic ImageNet
dataset and reinforcement-learning problems
across a wide range of variations in the network
and the learning algorithm. Plasticity is
maintained indefinitely only by algorithms that
continually inject diversity into the network,
such as our continual backpropagation algorithm,
a variation of backpropagation in which a small
fraction of less-used units are continually and
randomly reinitialized. Our results indicate
that methods based on gradient descent are not
enough—that sustained deep learning requires a
random, non-gradient component to maintain
variability and plasticity.
Javed,
K., Sharifnassab, A., Sutton, R. S.
(2024). SwiftTD: A Fast and Robust Algorithm for Temporal
Difference Learning. Reinforcement
Learning Journal 1(1). Presented at the
Reinforcement Learning Conference (RLC), Amherst
Massachusetts, August 9–12, 2024.
ABSTRACT:
Learning to make temporal predictions is a key
component of reinforcement learning algorithms.
The dominant paradigm for learning predictions
from an online stream of data is Temporal
Difference (TD) learning. In this work we
introduce a new TD algorithm---SwiftTD---that
learns more accurate predictions than existing
algorithms. SwiftTD combines True Online TD(λ)
with per-feature step-size parameters, step-size
optimization, a bound on the rate of learning,
and step-size decay. Per-feature step-size
parameters and step-size optimization improve
credit assignment by increasing step-size
parameters of important signals and reducing
them for irrelevant signals. The bound on the
rate of learning prevents overcorrections.
Step-size decay reduces step-size parameters if
they are too large. We benchmark SwiftTD on the
Atari Prediction Benchmark and show that even
with linear function approximation it can learn
accurate predictions. We further show that
SwiftTD can be combined with neural networks to
improve their performance. Finally, we show that
all three ideas---step-size optimization, the
bound on the rate of learning, and step-size
decay---contribute to the strong performance of
SwiftTD.
Naik,
A, Wan, Y., Tomar, M.,
Sutton,
R. S.
(2024). Reward centering. Reinforcement
Learning Journal 1(1). Presented
at the Reinforcement Learning
Conference (RLC), Amherst
Massachusetts, August 9–12,
2024.
ABSTRACT: We show that discounted methods
for solving continuing reinforcement learning
problems can perform significantly better if
they center their rewards by subtracting out the
rewards' empirical average. The improvement is
substantial at commonly used discount factors
and increases further as the discount factor
approaches one. In addition, we show that if a problem's
rewards are shifted by a constant, then standard
methods perform much worse, whereas methods with
reward centering are unaffected. Estimating the
average reward is straightforward in the
on-policy setting; we propose a slightly more
sophisticated method for the off-policy setting.
Reward centering is a general idea, so we expect
almost every reinforcement-learning algorithm to
benefit by the addition of reward centering.
De
Asis, K., Sutton,
R. S.
(2024). An
Idiosyncrasy of Time-discretization in
Reinforcement Learning. Reinforcement
Learning Journal 1(1). Presented
at the Reinforcement Learning
Conference (RLC), Amherst
Massachusetts, August 9–12,
2024.
ABSTRACT: Many reinforcement learning
algorithms are built on an assumption that an
agent interacts with an environment over
fixed-duration, discrete time steps. However,
physical systems are continuous in time,
requiring a choice of time-discretization
granularity when digitally controlling them.
Furthermore, such systems do not wait for
decisions to be made before advancing the
environment state, necessitating the study of
how the choice of discretization may affect a
reinforcement learning algorithm. In this work,
we consider the relationship between the
definitions of the continuous-time and
discrete-time returns. Specifically, we
acknowledge an idiosyncrasy with naively
applying a discrete-time algorithm to a
discretized continuous-time environment, and
note how a simple modification can better align
the return definitions. This observation is of
practical consideration when dealing with
environments where time-discretization
granularity is a choice, or situations where
such granularity is inherently stochastic.
Dohare,
S., Hernandez-Garcia, J.F., Rahman, P., Sutton, R. S., Mahmood,
A.R. (2023). Maintaining Plasticity in Deep Continual
Learning. arXiv:2210.14361.
ABSTRACT:
Modern deep-learning systems are specialized to
problem settings in which training occurs once
and then never again, as opposed to
continual-learning settings in which training
occurs continually. If deep-learning systems are
applied in a continual learning setting, then it
is well known that they may fail to remember
earlier examples. More fundamental, but less
well known, is that they may also lose their
ability to learn on new examples, a phenomenon
called loss of plasticity. We provide direct
demonstrations of loss of plasticity using the
MNIST and ImageNet datasets repurposed for
continual learning as sequences of tasks. In
ImageNet, binary classification performance
dropped from 89% accuracy on an early task down
to 77%, about the level of a linear network, on
the 2000th task. Loss of plasticity occurred
with a wide range of deep network architectures,
optimizers, activation functions, batch
normalization, dropout, but was substantially
eased by L2-regularization, particularly when
combined with weight perturbation. Further, we
introduce a new algorithm -- continual
backpropagation -- which slightly modifies
conventional backpropagation to reinitialize a
small fraction of less-used units after each
example and appears to maintain plasticity
indefinitely.
Rafiee, B., Ghiassian, S., Jin, J., Sutton, R.,
Luo, J., & White, A. (2023). Auxiliary task discovery through
generate-and-test. In Proceedings of the Second Conference on
Lifelong Learning Agents. Also arXiv:2210.14361.
ABSTRACT: In
this paper, we explore an approach to auxiliary task discovery
in reinforcement learning based on ideas from representation
learning. Auxiliary tasks tend to improve data efficiency by
forcing the agent to learn auxiliary prediction and control
objectives in addition to the main task of maximizing reward,
and thus producing better representations. Typically these
tasks are designed by people. Meta-learning offers a promising
avenue for automatic task discovery; however, these methods
are computationally expensive and challenging to tune in
practice. In this paper, we explore a complementary approach
to the auxiliary task discovery: continually generating new
auxiliary tasks and preserving only those with high utility.
We also introduce a new measure of auxiliary tasks usefulness
based on how useful the features induced by them are for the
main task. Our discovery algorithm significantly outperforms
random tasks, hand-designed tasks, and learning without
auxiliary tasks across a suite of environments.
De Asis, K., Graves, E., & Sutton, R. S. (2023). Value-aware
Importance Weighting for Off-policy Reinforcement Learning. In Proceedings
of the Second Conference on Lifelong Learning Agents.
Appeared August 22, 2023. Also arXiv:2306.15625.
ABSTRACT: Importance sampling is a
central idea underlying off-policy prediction in reinforcement
learning. It provides a strategy for re-weighting samples from
a distribution to obtain unbiased estimates under another
distribution. However, importance sampling weights tend to
exhibit extreme variance, often leading to stability issues in
practice. In this work, we consider a broader class of
importance weights to correct samples in off-policy learning.
We propose the use of value-aware importance weights
which take into account the sample space to provide lower
variance, but still unbiased, estimates under a target
distribution. We derive how such weights can be computed, and
detail key properties of the resulting importance weights. We
then extend several reinforcement learning prediction
algorithms to the off-policy setting with these weights, and
evaluate them empirically.
Sharifnassab, A., & Sutton, R. S. (2023). Toward
Efficient Gradient-Based Value Estimation. In International
Conference on Machine Learning (pp. 30827-30849). Also Proceedings
of Machine Learning Research 202:30827-30849.
ABSTRACT: Gradient-based methods
for value estimation in reinforcement learning have favorable
stability properties, but they are typically much slower than
Temporal Difference (TD) learning methods. We study the root
causes of this slowness and show that Mean Square Bellman Error
(MSBE) is an ill-conditioned loss function in the sense that its
Hessian has large condition-number. To resolve the adverse
effect of poor conditioning of MSBE on gradient based methods,
we propose a low complexity batch-free proximal method that
approximately follows the Gauss-Newton direction and is
asymptotically robust to parameterization. Our main algorithm,
called RANS, is efficient in the sense that it is significantly
faster than the residual gradient methods while having almost
the same computational complexity, and is competitive with TD on
the classic problems that we tested.
Mathewson,
K. W., Parker, A. S., Sherstan, C., Edwards, A. L., Sutton, R. S.,
& Pilarski, P. M. (2023). Communicative
capital: A key resource for human-machine shared agency and
collaborative capacity. Neural Computing and
Applications 35(23):16805-16819.
ABSTRACT: In this work, we present
a perspective on the role machine intelligence can play in
supporting human abilities. In particular, we consider research
in rehabilitation technologies such as prosthetic devices, as
this domain requires tight coupling between human and machine.
Taking an agent-based view of such devices, we propose that
human–machine collaborations have a capacity to perform tasks
which is a result of the combined agency of the human and the
machine. We introduce communicative capital as a
resource developed by a human and a machine working together in
ongoing interactions. Development of this resource enables the
partnership to eventually perform tasks at a capacity greater
than either individual could achieve alone. We then examine the
benefits and challenges of increasing the agency of prostheses
by surveying literature which demonstrates that building
communicative resources enables more complex, task-directed
interactions. The viewpoint developed in this article extends
current thinking on how best to support the functional use of
increasingly complex prostheses, and establishes insight toward
creating more fruitful interactions between humans and
supportive, assistive, and augmentative technologies.
Javed,
K., Shah, H., Sutton, R. S., & White, M. (2023). Scalable
Real-Time Recurrent Learning Using Columnar-Constructive
Networks. Journal of Machine Learning Research 24(256):1-34.
ABSTRACT:
Constructing
states from
sequences of
observations
is an
important
component of
reinforcement
learning
agents. One
solution for
state
construction
is to use
recurrent
neural
networks.
Back-propagation
through time
(BPTT), and
real-time
recurrent
learning
(RTRL) are two
popular
gradient-based
methods for
recurrent
learning. BPTT
requires
complete
trajectories
of
observations
before it can
compute the
gradients and
is unsuitable
for online
updates. RTRL
can do online
updates but
scales poorly
to large
networks. In
this paper, we
propose two
constraints
that make RTRL
scalable. We
show that by
either
decomposing
the network
into
independent
modules or
learning the
network in
stages, we can
make RTRL
scale linearly
with the
number of
parameters.
Unlike prior
scalable
gradient
estimation
algorithms,
such as UORO
and
Truncated-BPTT,
our algorithms
do not add
noise or bias
to the
gradient
estimate.
Instead, they
trade off the
functional
capacity of
the network
for
computationally
efficient
learning. We
demonstrate
the
effectiveness
of our
approach over
Truncated-BPTT
on a
prediction
benchmark
inspired by
animal
learning and
by doing
policy
evaluation of
pre-trained
policies for
Atari 2600
games.
Sutton,
R. S.,
Machado, M.
C., Holland,
G. Z.,
Timbers, D. S.
F., Tanner,
B., &
White, A.
(2023). Reward-respecting
subtasks for
model-based
reinforcement
learning.
Artificial
Intelligence
324. ArXiv:2202.03466.
ABSTRACT:
To
achieve the ambitious goals of
artificial intelligence,
reinforcement learning must
include planning with a model of
the world that is abstract in
state and time. Deep learning has
made progress with state
abstraction, but temporal
abstraction has rarely been used,
despite extensively developed
theory based on the options
framework. One reason for this is
that the space of possible options
is immense, and the methods
previously proposed for option
discovery do not take into account
how the option models will be used
in planning. Options are typically
discovered by posing subsidiary
tasks, such as reaching a
bottleneck state or maximizing the
cumulative sum of a sensory signal
other than reward. Each subtask is
solved to produce an option, and
then a model of the option is
learned and made available to the
planning process. In most previous
work, the subtasks ignore the
reward on the original problem,
whereas we propose subtasks that
use the original reward plus a
bonus based on a feature of the
state at the time the option
terminates. We show that option
models obtained from such
reward-respecting subtasks are
much more likely to be useful in
planning than eigenoptions,
shortest path options based on
bottleneck states, or
reward-respecting options
generated by the option- critic.
Reward respecting subtasks
strongly constrain the space of
options and thereby also provide a
partial solution to the problem of
option discovery. Finally, we show
how values, policies, options, and
models can all be learned online
and off-policy using standard
algorithms and general value
functions.
Wan,
Y., Sutton, R. S. (2022). Toward
discovering options that achieve
faster planning. ArXiv:2205.12515.
ABSTRACT:
We
propose a new objective for option
discovery that emphasizes the
computational advantage of using
options in planning. For a given
set of episodic tasks and a given
number of options, the objective
prefers options that can be used
to achieve a high return by
composing few options. By
composing few options, fast
planning can be achieved. When
faced with new tasks similar to
the given ones, the discovered
options are also expected to
accelerate planning. Our objective
extends the objective proposed by
Harb et al. (2018) for the
single-task setting to the
multi-task setting. A closer look
at Harb et al.'s objective shows
that the best options discovered
given one task are not likely to
be useful for future unseen tasks
and that the multi-task setting is
indeed necessary for this purpose.
In the same paper, Harb et al.
also proposed an algorithm to
optimize their objective, and the
algorithm can be naturally
extended to the multi-task
setting. We empirically show that
in the four-room domain the
extension does not achieve a high
objective value and propose a new
algorithm that better optimizes
the proposed objective. In the
same four-room domain, we show
that 1) a higher objective value
is typically associated with
options with which fewer planning
iterations are needed to achieve
near-optimal performance, 2) our
new algorithm achieves a high
objective value, which is close to
the value achieved by a set of
human-designed options, 3) the
best number of planning iterations
given the discovered options is
much smaller and matches it
obtained given human-designed
options, and 4) the options
produced by our algorithm also
make intuitive sense because they
move to and terminate at cells
near hallways connecting two
neighbor rooms.
Tian,
T., Young, K., & Sutton, R. S.
(2022). Doubly-asynchronous
value iteration: Making value
iteration asynchronous in actions.
In Advances in Neural
Information Processing Systems 35.
Also ArXiv:2207.01613.
ABSTRACT:
Value
iteration (VI) is a foundational
dynamic programming method,
important for learning and
planning in optimal control and
reinforcement learning. VI
proceeds in batches, where the
update to the value of each state
must be completed before the next
batch of updates can begin.
Completing a single batch is
prohibitively expensive if the
state space is large, rendering VI
impractical for many applications.
Asynchronous VI helps to address
the large state space problem by
updating one state at a time,
in-place and in an arbitrary
order. However, Asynchronous VI
still requires a maximization over
the entire action space, making it
impractical for domains with large
action space. To address this
issue, we propose
doubly-asynchronous value
iteration (DAVI), a new algorithm
that generalizes the idea of
asynchrony from states to states
and actions. More concretely, DAVI
maximizes over a sampled subset of
actions that can be of any
user-defined size. This simple
approach of using sampling to
reduce computation maintains
similarly appealing theoretical
properties to VI without the need
to wait for a full sweep through
the entire action space in each
update. In this paper, we show
DAVI converges to the optimal
value function with probability
one, converges at a near-geometric
rate with probability 1-delta, and
returns a near-optimal policy in
computation time that nearly
matches a previously established
bound for VI. We also empirically
demonstrate DAVI's effectiveness
in several experiments.
Sutton,
R. S. (2022). A history of
meta-gradient: Gradient Methods for
meta-learning. ArXiv:2202.09701.
ABSTRACT:
Systems
for learning parameters often have
meta-parameters such as step
sizes, initial weights, or
dimensional weightings. With a
given setting of the
meta-parameters, the learning
system is complete and capable of
finding parameters that are suited
to the task, but its efficiency
typically depends on the
particular choice of
meta-parameters. This has led to
interest in learning processes
that can find good choices for
meta-parameters automatically from
experience. These higher-level
learning methods are often
characterized as "learning to
learn" or, as we shall call them
here, meta-learning. In
this article we briefly review the
history of meta-learning.
Sutton,
R. S., Bowling, M., Pilarski, P. M.
(2022). The Alberta plan for AI
research. ArXiv:2208.11173.
ABSTRACT:
Herein we describe our
approach to artificial intelligence (AI) research,
which we call the Alberta Plan. The
Alberta Plan is pursued within our research groups
in Alberta and by others who are like minded
throughout the world. We welcome all who would
join us in this pursuit.
Sutton,
R. S. (2022). The quest for a common model of the
intelligent agent. In 5th Multi-disciplinary
Conference on Reinforcement Learning and Decision
Making. ArXiv:2202.13252
ABSTRACT: The premise of the Multi-disciplinary
Conference on Reinforcement Learning and
Decision Making is that multiple disciplines
share an interest in goal-directed decision making
over time. The idea of this paper is to sharpen
and deepen this premise by proposing a perspective
on the decision maker that is substantive and
widely held across psychology, artificial
intelligence, economics, control theory, and
neuroscience, which I call the common model of
the intelligent agent. The common model does
not include anything specific to any organism,
world, or application domain. The common model
does in- clude aspects of the decision maker’s
interaction with its world (there must be input
and output, and a goal) and internal components of
the decision maker (for perception,
decision-making, internal evaluation, and a world
model). I identify these aspects and components,
note that they are given different names in
different disciplines but refer essentially to the
same ideas, and discuss the challenges and
benefits of devising a neutral terminology that
can be used across disciplines. It is time to
recognize and build on the convergence of multiple
diverse disciplines on a substantive common model
of the intelligent agent.
Javed, K., Shah,
H., Sutton, R. S., “Scalable
Online State Construction using Recurrent Networks.”
In Multi-disciplinary Conference on
Reinforcement Learning and Decision Making,
2022.
ABSTRACT:State construction from
sensory observations is an important component of
a reinforcement learning agent. One solution for
state construction is to use recurrent neural
networks. Two popular gradient-based methods for
recurrent learning are back-propagation through
time (BPTT), and real-time recurrent learning
(RTRL). BPTT looks at the complete sequence of
observations before computing gradients and is
unsuitable for online updates. RTRL can do online
updates but scales poorly to large networks. In
this paper, we propose two constraints that make
RTRL scalable. We show that by either decomposing
the network into independent modules or learning a
recurrent network incrementally, we can make RTRL
scale linearly with the number of parameters.
Unlike prior scalable gradient estimation
algorithms, such as UORO and Truncated-BPTT, our
algorithms do not add bias or noise to the
gradient estimate. Instead, they trade off the
functional capacity of the recurrent network to
achieve scalable learning. We demonstrate the
effectiveness of our approach on a prediction
learning benchmark inspired by animal learning.
Wan, Y., Sutton,
R. S., “Discovering
Options by Minimizing the Number of Composed
Options to Solve Multiple Tasks.” In Multi-disciplinary
Conference on Reinforcement Learning and Decision
Making, 2022.
ABSTRACT: We propose a new metric
for discovering options in the tabular setting: a
set options is better than the other set of same
number of options if, for a given set of episodic
tasks, on average, optimal solutions to the tasks
can be obtained by composing fewer options. By
composing fewer options to obtain solutions,
planning requires less computation and can be
faster. We propose a new objective function and
prove that the set of options that minimizes the
proposed objective results in the least number of
composed options on average in the tabular
setting. In the function approximation setting,
where generally the optimal solutions are not
achievable, this objective trades off higher
returns against fewer composed options. We further
propose a tabular algorithm that optimizes the
proposed objective, based on the option-critic
algorithm (Bacon et al., 2017). In a four-room
domain with 16 tasks, we demonstrate that the
options produced by our algorithm are consistent
with human intuition because they seem to lead to
cells near hallways connecting two rooms.
De Asis, K. Chan,
A., Sutton, R. S., “Inverse Policy
Evaluation for Value-based Decision Making.”
In Multi-disciplinary Conference on
Reinforcement Learning and Decision Making,
2022.
ABSTRACT: Value-based methods for
control often involve approximate value iteration
(e.g., Q-learning), and behaving greedily with
respect to the resulting value estimates with some
degree of entropy to ensure the state-space is
sufficiently explored. Such a greedy policy is an
improvement over the previous policy, under which
the current values were estimated. However,
value-iteration may produce value functions that
do not correspond with any policy. This is
especially relevant with function approximation,
when the true value function cannot be perfectly
represented. In this work, we explore the use of
Inverse Policy Evaluation, the process of solving
for a likely policy given a value function. We
derive a simple, incremental algorithm for the
procedure, analyze how inaccuracies in the value
function manifest in the corresponding policy, and
provide empirical results emphasizing key
properties of the mapping from value functions to
policies.
Dohare, S.,
Mahmood, R., Sutton, R. S., “Continual
Backprop: Stochastic Gradient Descent with
Persistent Randomness.” In Multi-disciplinary
Conference on Reinforcement Learning and Decision
Making, 2022.
ABSTRACT:The Backprop algorithm for
learning in neural networks utilizes two
mechanisms: first, stochastic gradient descent and
second, initialization with small random weights,
where the latter is essential to the effectiveness
of the former. We show that in continual learning
setups, Backprop performs well initially, but over
time its performance degrades. Stochastic gradient
descent alone is insufficient to learn
continually; the initial randomness enables only
initial learning but not continual learning. To
the best of our knowledge, ours is the first
result showing this degradation in Backprop’s
ability to learn. To address this degradation in
Backprop’s plasticity, we propose an algorithm
that continually injects random features alongside
gradient descent using a new generate-and-test
process. We call this the Continual Backprop
algorithm. We show that, unlike Backprop,
Continual Backprop is able to continually adapt in
both supervised and reinforcement learning (RL)
problems. Continual Backprop has the same
computational complexity as Backprop and can be
seen as a natural extension of Backprop for
continual learning.
Sutton, R. S.,
Machado, M. C., Holland, G. Z., Szepesvari, D.,
Timbers, F., Tanner, B., White, A. “Reward-Respecting
Subtasks for Model-Based Reinforcement Learning.”
In Multi-disciplinary Conference on
Reinforcement Learning and Decision Making,
2022.
ABSTRACT: To achieve the ambitious
goals of artificial intelligence, reinforcement
learning must include planning with a model of the
world that is abstract in state and time. Deep
learning has made progress in state abstraction,
but, although the theory of time abstraction has
been extensively developed based on the options
framework, in practice options have rarely been
used in planning. One reason for this is that the
space of possible options is immense and the
methods previously proposed for option discovery
do not take into account how the option models
will be used in planning. Options are typically
discovered by posing subsidiary tasks such as
reaching a bottleneck state, or maximizing a
sensory signal other than the reward. Each subtask
is solved to produce an option, and then a model
of the option is learned and made available to the
planning process. The subtasks proposed in most
previous work ignore the reward on the original
problem, whereas we propose subtasks that use the
original reward plus a bonus based on a feature of
the state at the time the option stops. We show
that options and option models obtained from such
reward-respecting subtasks are much more likely to
be useful in planning and can be learned online
and off-policy using existing learning algorithms.
Reward-respecting subtasks strongly constrain the
space of options and thereby also provide a
partial solution to the problem of option
discovery. Finally, we show how the algorithms for
learning values, policies, options, and models can
be unified using general value functions.
Naik, A., Sutton,
R. S. (2022). Multi-Step
Average-Reward Prediction via Differential TD(λ).
In Multi-disciplinary Conference on
Reinforcement Learning and Decision Making.
ABSTRACT:We present Differential
TD(λ), an average-reward algorithm that extends
Wan et al.’s (2021) Differential TD(0) from the
one-step tabular case to that of multi-step linear
function approximation using eligibility traces.
We characterize the algorithm’s fixed point and
present a convergence theorem. Our analysis builds
on prior work by Tsitsiklis and Van Roy (1999),
who proposed a trace-based prediction algorithm
that is restricted to the on-policy setting.
Preliminary experiments show that Differential
TD(λ) converges to the fixed point predicted by
the theory and that an intermediate value of the
bootstrapping parameter λ results in the most
sample efficiency.
Rafiee,
B., Abbas, Z., Ghiassian, S., Kumaraswamy,
R., Sutton, R., Ludvig, E., White, A.
(2022). From
Eye-blinks to State Construction: Diagnostic
Benchmarks for Online Representation Learning.
Adaptive Behavior. Also arXiv:2011.04590.
ABSTRACT: Experiments in classical
conditioning show that animals such as rabbits, pigeons, and
dogs can make long temporal associations that enable
multi-step prediction. To replicate this remarkable ability,
an agent must construct an internal state representation that
summarizes its interaction history. Recurrent neural networks
can automatically construct state and learn temporal
associations. But the current training methods are
prohibitively expensive for online prediction -- continual
learning on every time step -- which is the focus of this
paper. To facilitate research in online prediction, we present
three new diagnostic prediction problems inspired by
classical-conditioning experiments. The proposed problems test
the learning capabilities that animals readily exhibit and
highlight the current recurrent learning methods' limitations.
While the proposed problems are nontrivial, they are still
amenable to extensive testing and analysis in the
small-compute regime, thereby enabling researchers to study
issues in isolation carefully, ultimately accelerating
progress towards scalable online representation learning
methods.
Wan,
Y., Naik, A.,
Sutton, R. (2021). Average-Reward
Learning and
Planning with
Options. Advances
in Neural
Information
Processing
Systems, 34,
22758-22769.
ABSTRACT: We extend the options
framework for temporal abstraction in reinforcement learning
from discounted Markov decision processes (MDPs) to
average-reward MDPs. Our contributions include general
convergent off-policy inter-option learning algorithms,
intra-option algorithms for learning values and models, as
well as sample based planning variants of our learning
algorithms. Our algorithms and convergence proofs extend those
recently developed by Wan, Naik, and Sutton. We also extend
the notion of option-interrupting behavior from the discounted
to the average-reward formulation. We show the efficacy of the
proposed algorithms with experiments on a continuing version
of the Four-Room domain.
Samani,
A., Sutton, R. S.
(2021). Learning
agent state online
with recurrent
generate-and-test. ArXiv:2112.15236.
ABSTRACT:
Learning
continually and online from a
continuous stream of data is
challenging, especially for a
reinforcement learning agent with
sequential data. When the
environment only provides
observations giving partial
information about the state of the
environment, the agent must learn
the agent state based on the data
stream of experience. We refer to
the state learned directly from
the data stream of experience as
the agent state. Recurrent neural
networks can learn the agent
state, but the training methods
are computationally expensive and
sensitive to the hyper-parameters,
making them unideal for online
learning. This work introduces
methods based on the
generate-and-test approach to
learn the agent state. A
generate-and-test algorithm
searches for state features by
generating features and testing
their usefulness. In this process,
features useful for the agent's
performance on the task are
preserved, and the least useful
features get replaced with newly
generated features. We study the
effectiveness of our methods on
two online multi-step prediction
problems. The first problem, trace
conditioning, focuses on the
agent's ability to remember a cue
for a prediction multiple steps
into the future. In the second
problem, trace patterning, the
agent needs to learn patterns in
the observation signals and
remember them for future
predictions. We show that our
proposed methods can effectively
learn the agent state online and
produce accurate predictions.
Ashley, D. R.,
Ghiassian, S., Sutton, R. S. (2021) Does the Adam
Optimizer Exacerbate Catastrophic Forgetting?
ArXiv:2102.07686.
ABSTRACT: Catastrophic
forgetting remains a severe hindrance to the broad
application of artificial neural networks (ANNs), however,
it continues to be a poorly understood phenomenon. Despite
the extensive amount of work on catastrophic forgetting,
we argue that it is still unclear how exactly the
phenomenon should be quantified, and, moreover, to what
degree all of the choices we make when designing learning
systems affect the amount of catastrophic forgetting. We
use various testbeds from the reinforcement learning and
supervised learning literature to (1) provide evidence
that the choice of which modern gradient-based
optimization algorithm is used to train an ANN has a
significant impact on the amount of catastrophic
forgetting and show that-surprisingly-in many instances
classical algorithms such as vanilla SGD experience less
catastrophic forgetting than the more modern algorithms
such as Adam. We empirically compare four different
existing metrics for quantifying catastrophic forgetting
and (2) show that the degree to which the learning systems
experience catastrophic forgetting is sufficiently
sensitive to the metric used that a change from one
principled metric to another is enough to change the
conclusions of a study dramatically. Our results suggest
that a much more rigorous experimental methodology is
required when looking at catastrophic forgetting. Based on
our results, we recommend inter-task forgetting in
supervised learning must be measured with both retention
and relearning metrics concurrently, and intra-task
forgetting in reinforcement learning must---at the very
least---be measured with pairwise interference.
Kudashkina, K., Wan, Y., Naik, A., Sutton, R. S. (2021) Planning with
Expectation Models for Control, ArXiv:2104.08543.
ABSTRACT: In
model-based reinforcement learning (MBRL), Wan et al. (2019)
showed conditions under which the environment model could
produce the expectation of the next feature vector rather than
the full distribution, or a sample thereof, with no loss in
planning performance. Such expectation models are of interest
when the environment is stochastic and non-stationary, and the
model is approximate, such as when it is learned using
function approximation. In these cases a full distribution
model may be impractical and a sample model may be either more
expensive computationally or of high variance. Wan et al.
considered only planning for prediction to evaluate a fixed
policy. In this paper, we treat the control case - planning to
improve and find a good approximate policy. We prove that
planning with an expectation model must update a state-value
function, not an action-value function as previously suggested
(e.g., Sorg & Singh, 2010). This opens the question of how
planning influences action selections. We consider three
strategies for this and present general MBRL algorithms for
each. We identify the strengths and weaknesses of these
algorithms in computational experiments. Our algorithms and
experiments are the first to treat MBRL with expectation
models in a general setting.
Silver,
D., Singh, S., Precup, D., Sutton, R. S. (2021) Reward Is Enough. Artificial
Intelligence 299, 103535.
ABSTRACT: In
this article we hypothesise that intelligence, and its
associated abilities, can be understood as subserving the
maximisation of reward. Accordingly, reward is enough to drive
behaviour that exhibits abilities studied in natural and
artificial intelligence, including knowledge, learning,
perception, social intelligence, language, generalisation and
imitation. This is in contrast to the view that specialised
problem formulations are needed for each ability, based on other
signals or objectives. Furthermore, we suggest that agents that
learn through trial and error experience to maximise reward
could learn behaviour that exhibits most if not all of these
abilities, and therefore that powerful reinforcement learning
agents could constitute a solution to artificial general
intelligence.
Ghiassian, S., Sutton, R. S. (2021) An Empirical Comparison
of Off-policy Prediction Learning Algorithms on the Collision
Task. ArXiv:2106.00922.
ABSTRACT: Off-policy
prediction---learning the value function for one policy from
data generated while following another policy---is one of the
most challenging subproblems in reinforcement learning. This
paper presents empirical results with eleven prominent
off-policy learning algorithms that use linear function
approximation: five Gradient-TD methods, two Emphatic-TD
methods, Off-policy TD(λ), Vtrace, and versions of Tree Backup
and ABQ modified to apply to a prediction setting. Our
experiments used the Collision task, a small idealized
off-policy problem analogous to that of an autonomous car trying
to predict whether it will collide with an obstacle. We assessed
the performance of the algorithms according to their learning
rate, asymptotic error level, and sensitivity to step-size and
bootstrapping parameters. By these measures, the eleven
algorithms can be partially ordered on the Collision task. In
the top tier, the two Emphatic-TD algorithms learned the
fastest, reached the lowest errors, and were robust to parameter
settings. In the middle tier, the five Gradient-TD algorithms
and Off-policy TD(λ) were more sensitive to the bootstrapping
parameter. The bottom tier comprised Vtrace, Tree Backup, and
ABQ; these algorithms were no faster and had higher asymptotic
error than the others. Our results are definitive for this task,
though of course experiments with more tasks are needed before
an overall assessment of the algorithms' merits can be made.
Zhang, S., Wan, Y., Naik, A., Sutton, R.
S., Whiteson, S. (2021). Average-Reward
Off-Policy Policy Evaluation with Function
Approximation. In Proceedings
of the International Conference on Machine Learning.
ABSTRACT: We
consider off-policy policy evaluation with function
approximation (FA) in average-reward MDPs, where the
goal is to estimate both the reward rate and the
differential value function. For this problem,
bootstrapping is necessary and, along with off-policy
learning and FA, results in the deadly triad (Sutton
& Barto, 2018). To address the deadly triad, we
propose two novel algorithms, reproducing the
celebrated success of Gradient TD algorithms in the
average-reward setting. In terms of estimating the
differential value function, the algorithms are the
first convergent off-policy linear function
approximation algorithms. In terms of estimating the
reward rate, the algorithms are the first convergent
off-policy linear function approximation algorithms
that do not require estimating the density ratio. We
demonstrate empirically the advantage of the proposed
algorithms, as well as their nonlinear variants, over
a competitive density-ratio-based approach, in a
simple domain as well as challenging robot simulation
tasks.
Average-Reward Off-Policy Policy Evaluation
with Function Approximation
Wan,
Y., Naik, A., Sutton, R. S. (2021). Learning and Planning in
Average-Reward Markov Decision Processes. In
Proceedings of the International Conference on
Machine Learning.
ABSTRACT: We
introduce learning and planning algorithms for
average-reward MDPs, including 1) the first
general proven-convergent off-policy model-free
control algorithm without reference states, 2)
the first proven-convergent off-policy
model-free prediction algorithm, and 3) the
first off-policy learning algorithm that
converges to the actual value function rather
than to the value function plus an offset. All
of our algorithms are based on using the
temporal-difference error rather than the
conventional error when updating the estimate of
the average reward. Our proof techniques are a
slight generalization of those by Abounadi,
Bertsekas, and Borkar (2001). In experiments
with an Access-Control Queuing Task, we show
some of the difficulties that can arise when
using methods that rely on reference states and
argue that our new algorithms can be
significantly easier to use.
Lee,
J. Y., Sutton, R. S. (2021). Policy iterations for
reinforcement learning problems in continuous time and space —
Fundamental theory and methods. Automatica 126, 109421.
ABSTRACT:
Policy iteration (PI) is a recursive process of policy
evaluation and improvement for solving an optimal
decision-making/control problem, or in other words, a
reinforcement learning (RL) problem. PI has also served as the
fundamental to develop RL methods. In this paper, we propose two
PI methods, called differential PI (DPI) and integral PI (IPI),
and their variants to solve a general RL problem in continuous
time and space (CTS), with the environment modeled by a system
of ordinary differential equations (ODEs). The proposed methods
inherit the current ideas of PI in classical RL and optimal
control and theoretically support the existing RL algorithms in
CTS: TD-learning and value-gradient-based (VGB) greedy policy
update. We also provide case studies including 1) discounted RL
and 2) optimal control tasks. Fundamental mathematical
properties — admissibility, uniqueness of the solution to the
Bellman equation, monotone improvement, convergence, and
optimality of the solution to the Hamilton-Jacobi-Bellman
equation (HJBE) — are all investigated in-depth and improved
from the existing theory, along with the general and case
studies. Finally, the proposed ones are simulated with an
inverted-pendulum model and their model-based and partially
model-free implementations to support the theory and further
investigate them.
Barto, A. G., Sutton, R. S., Anderson, C. W.
(2021) Looking
Back on the Actor-Critic Architecture. IEEE Transactions
on Systems, Man, and Cybernetics: Systems. 51(1).
ABSTRACT: This
retrospective describes the overall research project that gave
rise to the authors' paper “Neuronlike adaptive elements that
can solve difficult learning control problems” that was
published in the 1983 Neural and Sensory Information Processing
special issue of the IEEE Transactions on Systems, Man, and
Cybernetics. This look back explains how this project came
about, presents the ideas and previous publications that
influenced it, and describes our most closely related subsequent
research. It concludes by pointing out some noteworthy aspects
of this article that have been eclipsed by its main
contributions, followed by commenting on some of the directions
and cautions that should inform future research.
Chan, A., De Asis,
K., & Sutton, R. S. (2020).
Inverse policy evaluation for
value-based sequential
decision-making. ArXiv:2008.11329.
ABSTRACT:
Value-based
methods for
reinforcement
learning lack
generally
applicable
ways to derive
behavior from
a value
function. Many
approaches
involve
approximate
value
iteration
(e.g.,
Q-learning),
and acting
greedily with
respect to the
estimates with
an arbitrary
degree of
entropy to
ensure that
the
state-space is
sufficiently
explored.
Behavior based
on explicit
greedification
assumes that
the values
reflect those
of some
policy, over
which the
greedy policy
will be an
improvement.
However,
value-iteration
can produce
value
functions that
do not
correspond to
any
policy. This
is especially
relevant in
the
function-approximation
regime, when
the true value
function can't
be perfectly
represented.
In this work,
we explore the
use of inverse
policy
evaluation,
the process of
solving for a
likely policy
given a value
function, for
deriving
behavior from
a value
function. We
provide
theoretical
and empirical
results to
show that
inverse policy
evaluation,
combined with
an approximate
value
iteration
algorithm, is
a feasible
method for
value-based
control.
Dohare,
S., Sutton,
R. S., Mahmood,
A. R. (2021). Continual backprop:
Stochastic gradient descent with
persistent randomness. ArXiv:2108.06325.
ABSTRACT:
The
Backprop algorithm for learning in
neural networks utilizes two
mechanisms: first, stochastic
gradient descent and second,
initialization with small random
weights, where the latter is
essential to the effectiveness of
the former. We show that in
continual learning setups,
Backprop performs well initially,
but over time its performance
degrades. Stochastic gradient
descent alone is insufficient to
learn continually; the initial
randomness enables only initial
learning but not continual
learning. To the best of our
knowledge, ours is the first
result showing this degradation in
Backprop's ability to learn. To
address this degradation in
Backprop's plasticity, we propose
an algorithm that continually
injects random features alongside
gradient descent using a new
generate-and-test process. We call
this the Continual Backprop
algorithm. We show that, unlike
Backprop, Continual Backprop is
able to continually adapt in both
supervised and reinforcement
learning (RL) problems. Continual
Backprop has the same
computational complexity as
Backprop and can be seen as a
natural extension of Backprop for
continual learning.
Ghiassian, S., Sutton, R. S. (2021).
An empirical comparison of
off-policy prediction learning
algorithms in the four rooms
environment. ArXiv:2109.05110.
ABSTRACT:
Many
off-policy prediction learning
algorithms have been proposed in
the past decade, but it remains
unclear which algorithms learn
faster than others. We empirically
compare 11 off-policy prediction
learning algorithms with linear
function approximation on two
small tasks: the Rooms task, and
the High Variance Rooms task. The
tasks are designed such that
learning fast in them is
challenging. In the Rooms task,
the product of importance sampling
ratios can be as large as 214
and can sometimes be two. To
control the high variance caused
by the product of the importance
sampling ratios, step size should
be set small, which in turn slows
down learning. The High Variance
Rooms task is more extreme in that
the product of the ratios can
become as large as 214×25.
This paper builds upon the
empirical study of off-policy
prediction learning algorithms by
Ghiassian and Sutton (2021). We
consider the same set of
algorithms as theirs and employ
the same experimental methodology.
The algorithms considered are:
Off-policy TD(λ), five Gradient-TD
algorithms, two Emphatic-TD
algorithms, Tree Backup(λ),
Vtrace(λ), and ABTD(ζ). We found
that the algorithms' performance
is highly affected by the variance
induced by the importance sampling
ratios. The data shows that Tree
Backup(λ), Vtrace(λ), and ABTD(ζ)
are not affected by the high
variance as much as other
algorithms but they restrict the
effective bootstrapping parameter
in a way that is too limiting for
tasks where high variance is not
present. We observed that Emphatic
TD(λ) tends to have lower
asymptotic error than other
algorithms, but might learn more
slowly in some cases. We suggest
algorithms for practitioners based
on their problem of interest, and
suggest approaches that can be
applied to specific algorithms
that might result in substantially
improved algorithms.
De Asis, K., Chan, A., Pitis, S., Sutton, R., &
Graves, D. (2020). Fixed-horizon
temporal difference methods for stable reinforcement learning.
In Proceedings of the AAAI Conference on Artificial Intelligence
(Vol. 34, No. 04, pp. 3741-3748).
ABSTRACT: We explore
fixed-horizon temporal difference (TD) methods, reinforcement
learning algorithms for a new kind of value function that
predicts the sum of rewards over a fixed number of future time
steps. To learn the value function for horizon h, these
algorithms bootstrap from the value function for horizon h−1, or
some shorter horizon. Because no value function bootstraps from
itself, fixed-horizon methods are immune to the stability
problems that plague other off-policy TD methods using function
approximation (also known as “the deadly triad”). Although
fixed-horizon methods require the storage of additional value
functions, this gives the agent additional predictive power,
while the added complexity can be substantially reduced via
parallel updates, shared weights, and n-step bootstrapping. We
show how to use fixed-horizon value functions to solve
reinforcement learning problems competitively with methods such
as Q-learning that learn conventional value functions. We also
prove convergence of fixed-horizon temporal difference methods
with linear and general function approximation. Taken together,
our results establish fixed-horizon TD methods as a viable new
way of avoiding the stability problems of the deadly triad.
Dalrymple, A. N., Roszko, D. A., Sutton,
R. S., & Mushahwar, V. K. (2020). Pavlovian control of
intraspinal microstimulation to produce over-ground walking.
Journal of Neural Engineering 17(3), 036002.
ABSTRACT: Objective. Neuromodulation technologies
are increasingly used for improving function after neural
injury. To achieve a symbiotic relationship between device and
user, the device must augment remaining function, and
independently adapt to day-to-day changes in function. The goal
of this study was to develop predictive control strategies to
produce over-ground walking in a model of hemisection spinal
cord injury (SCI) using intraspinal microstimulation (ISMS). Approach.
Eight cats were anaesthetized and placed in a sling over a
walkway. The residual function of a hemisection SCI was mimicked
by manually moving one hind-limb through the walking cycle. ISMS
targeted motor networks in the lumbosacral enlargement to
activate muscles in the other, presumably ‘paralyzed’ limb,
using low levels of current (<130 μA). Four people took turns
to move the ‘intact’ limb, generating four different walking
styles. Two control strategies, which used ground reaction force
and angular velocity information about the manually moved
‘intact’ limb to control the timing of the transitions of the
‘paralyzed’ limb through the step cycle, were compared. The
first strategy used thresholds on the raw sensor values to
initiate transitions. The second strategy used reinforcement
learning and Pavlovian control to learn predictions about the
sensor values. Thresholds on the predictions were then used to
initiate transitions. Main results. Both control
strategies were able to produce alternating, over-ground
walking. Transitions based on raw sensor values required manual
tuning of thresholds for each person to produce walking, whereas
Pavlovian control did not. Learning occurred quickly during
walking: predictions of the sensor signals were learned rapidly,
initiating correct transitions after ≤4 steps. Pavlovian control
was resilient to different walking styles and different cats,
and recovered from induced mistakes during walking. Significance.
This work demonstrates, for the first time, that Pavlovian
control can augment remaining function and facilitate
personalized walking with minimal tuning requirements.
Kudashkina, K., Pilarski, P. M., & Sutton, R. S. (2020). Document-editing
Assistants and Model-based Reinforcement Learning as a Path to
Conversational AI. ArXiv:2008.12095.
ABSTRACT: Intelligent assistants that follow
commands or answer simple questions, such as Siri and Google
search, are among the most economically important applications
of AI. Future conversational AI assistants promise even greater
capabilities and a better user experience through a deeper
understanding of the domain, the user, or the user's purposes.
But what domain and what methods are best suited to researching
and realizing this promise? In this article we argue for the
domain of voice document editing and for the methods of
model-based reinforcement learning. The primary advantages of
voice document editing are that the domain is tightly scoped and
that it provides something for the conversation to be about (the
document) that is delimited and fully accessible to the
intelligent assistant. The advantages of reinforcement learning
in general are that its methods are designed to learn from
interaction without explicit instruction and that it formalizes
the purposes of the assistant. Model-based reinforcement
learning is needed in order to genuinely understand the domain
of discourse and thereby work efficiently with the user to
achieve their goals. Together, voice document editing and
model-based reinforcement learning comprise a promising research
direction for achieving conversational AI.
Young, K., & Sutton, R. S. (2020). Understanding the
Pathologies of Approximate Policy Evaluation when Combined with
Greedification in Reinforcement Learning. ArXiv:2010.15268.
ABSTRACT: Despite
empirical success, the theory of reinforcement learning (RL)
with value function approximation remains fundamentally
incomplete. Prior work has identified a variety of pathological
behaviours that arise in RL algorithms that combine approximate
on-policy evaluation and greedification. One prominent example
is policy oscillation, wherein an algorithm may cycle
indefinitely between policies, rather than converging to a fixed
point. What is not well understood however is the quality of the
policies in the region of oscillation. In this paper we present
simple examples illustrating that in addition to policy
oscillation and multiple fixed points -- the same basic issue
can lead to convergence to the worst possible policy for a given
approximation. Such behaviours can arise when algorithms
optimize evaluation accuracy weighted by the distribution of
states that occur under the current policy, but greedify based
on the value of states which are rare or nonexistent under this
distribution. This means the values used for greedification are
unreliable and can steer the policy in undesirable directions.
Our observation that this can lead to the worst possible policy
shows that in a general sense such algorithms are unreliable.
The existence of such examples helps to narrow the kind of
theoretical guarantees that are possible and the kind of
algorithmic ideas that are likely to be helpful. We demonstrate
analytically and experimentally that such pathological
behaviours can impact a wide range of RL and dynamic programming
algorithms; such behaviours can arise both with and without
bootstrapping, and with linear function approximation as well as
with more complex parameterized functions like neural networks.
Tian, T., Sutton, R. S. (2020). “Extending sliding-step
importance weighting from supervised learning to reinforcement
learning.” Best of IJCAI
Workshops 2019, Springer.
ABSTRACT:Stochastic
gradient descent (SGD) has been in the center of many advances
in modern machine learning. SGD processes examples sequentially,
updating a weight vector in the direction that would most reduce
the loss for that example. In many applications, some examples
are more important than others and, to capture this, each
example is given a non-negative weight that modulates its
impact. Unfortunately, if the importance weights are highly
variable they can greatly exacerbate the difficulty of setting
the step-size parameter of SGD. To ease this difficulty,
Karampatziakis and Langford [6] developed a class of elegant
algorithms that are much more robust in the face of highly
variable importance weights in supervised learning. In this
paper we extend their idea, which we call “sliding step,” to
reinforcement learning, where importance weighting can be
particularly variable due to the importance sampling involved in
off-policy learning algorithms. We compare two alternative ways
of doing the extension in the linear function approximation
setting, then introduce specific sliding-step versions of the
TD(0) and Emphatic TD(0) learning algorithms. We prove the
convergence of our algorithms and demonstrate their
effectiveness on both on-policy and off-policy problems.
Overall, our new algorithms appear to be effective in bringing
the robustness of the sliding-step technique from supervised
learning to reinforcement learning.
Sutton, R. S. (2020) John
McCarthy's Definition of Intelligence. Journal of Artificial
General Intelligence 11(2), 66-67. doi:10.2478/jagi-2020-0003, 2
pages.
Osband, I., Doron, Y., Hessel, M., Aslanides, J., Sezener, E.,
Saraiva, A., McKinney, K., Lattimore, T., Szepezvari, C., Singh,
S., Van Roy, B., Sutton, R., Silver, D., van Hasselt, H. (2020). “Behaviour suite for
reinforcement learning.” Proceedings
of
the International Conference on Learning Representations,
2020.
ABSTRACT: This
paper introduces the Behaviour
Suite for Reinforcement Learning, or bsuite for
short. bsuite is a collection of carefully-designed
experiments that investigate core capabilities of reinforcement
learning (RL) agents with two objectives. First, to collect
clear, informative and scalable problems that capture key issues
in the design of general and efficient learning algorithms.
Second, to study agent behaviour through their performance on
these shared benchmarks. To complement this effort, we open
source github.com/deepmind/bsuite, which
automates evaluation and analysis of any agent on bsuite. This library facilitates reproducible and
accessible research on the core issues in RL, and ultimately the
design of superior learning algorithms. Our code is Python, and
easy to use within existing projects. We include examples with
OpenAI Baselines, Dopamine as well as new reference
implementations. Going forward, we hope to incorporate more
excellent experiments from the research community, and commit to
a periodic review of bsuite from a committee of prominent researchers.
Hernandez-Garcia, J. F.,
& Sutton, R. S. (2019). Learning Sparse
Representations Incrementally in Deep Reinforcement Learning.
ArXiv:1912.04002.
ABSTRACT: Sparse representations have
been shown to be useful in deep reinforcement learning for
mitigating catastrophic interference and improving the
performance of agents in terms of cumulative reward. Previous
results were based on a two step process were the
representation was learned offline and the action-value
function was learned online afterwards. In this paper, we
investigate if it is possible to learn a sparse representation
and the action-value function simultaneously and
incrementally. We investigate this question by employing
several regularization techniques and observing how they
affect sparsity of the representation learned by a DQN agent
in two different benchmark domains. Our results show that with
appropriate regularization it is possible to increase the
sparsity of the representations learned by DQN agents.
Moreover, we found that learning sparse representations also
resulted in improved performance in terms of cumulative
reward. Finally, we found that the performance of the agents
that learned a sparse representation was more robust to the
size of the experience replay buffer. This last finding
supports the long standing hypothesis that the overlap in
representations learned by deep neural networks is the leading
cause of catastrophic interference.
Naik, A., Shariff, R., Yasui, N., Yao, H., & Sutton, R.
S. (2019). Discounted
reinforcement learning is not an optimization problem. arXiv
preprint arXiv:1910.02140.
ABSTRACT: Discounted reinforcement
learning is fundamentally incompatible with function
approximation for control in continuing tasks. It is not an
optimization problem in its usual formulation, so when using
function approximation there is no optimal policy. We
substantiate these claims, then go on to address some
misconceptions about discounting and its connection to the
average reward formulation. We encourage researchers to adopt
rigorous optimization approaches, such as maximizing average
reward, for reinforcement learning in continuing tasks
Wan,
Y., Zaheer, M., White, A., White, M., Sutton, R.S. (2019). “Planning with
expectation models.” In Proceedings
of the International Joint Conference on Artificial Intelligence.
ABSTRACT:
Distribution and sample models are two popular model choices in
model-based reinforcement learning (MBRL). However, learning
these models can be intractable, particularly when the state and
action spaces are large. Expectation models, on the other hand,
are relatively easier to learn due to their compactness and have
also been widely used for deterministic environments. For
stochastic environments, it is not obvious how expectation
models can be used for planning as they only partially
characterize a distribution. In this paper, we propose a sound
way of using expectation models for MBRL. In particular, we 1)
show that planning with an expectation model is equivalent to
planning with a distribution model if the state value function
is linear in state-feature vector, 2) analyze two common
parametrization choices for approximating the expectation:
linear and non-linear expectation models, 3) propose a sound
model-based policy evaluation algorithm and present its
convergence results, and 4) empirically demonstrate the
effectiveness of the proposed planning algorithm.
Sutton,
R. (2019). The
bitter lesson. Incomplete Ideas (blog). One and a half
pages.
Hernandez-Garcia, J. F., & Sutton, R. S. (2019). Understanding multi-step
deep reinforcement learning: A systematic study of the DQN
target. arXiv preprint arXiv:1901.07510.
ABSTRACT: Multi-step methods such
as Retrace(λ) and n-step Q-learning have become a crucial
component of modern deep reinforcement learning agents. These
methods are often evaluated as a part of bigger architectures
and their evaluations rarely include enough samples to draw
statistically significant conclusions about their performance.
This type of methodology makes it difficult to understand how
particular algorithmic details of multi-step methods influence
learning. In this paper we combine the n-step action-value
algorithms Retrace, Q-learning, Tree Backup, Sarsa, and Q(σ)
with an architecture analogous to DQN. We test the performance
of all these algorithms in the mountain car environment; this
choice of environment allows for faster training times and
larger sample sizes. We present statistical analyses on the
effects of the off-policy correction, the backup length
parameter n, and the update frequency of the target network on
the performance of these algorithms. Our results show that (1)
using off-policy correction can have an adverse effect on the
performance of Sarsa and Q(σ); (2) increasing the backup length
n consistently improved performance across all the different
algorithms; and (3) the performance of Sarsa and Q-learning was
more robust to the effect of the target network update frequency
than the performance of Tree Backup, Q(σ), and Retrace in this
particular task.
Gu, X., Ghiassian, S., & Sutton, R. S. (2019). Should All Temporal Difference
Learning Use Emphasis?. ArXiv:1903.00194.
ABSTRACT: Emphatic Temporal
Difference (ETD) learning has recently been proposed as a
convergent off-policy learning method. ETD was proposed mainly
to address convergence issues of conventional Temporal
Difference (TD) learning under off-policy training but it is
different from conventional TD learning even under on-policy
training. A simple counterexample provided back in 2017 pointed
to a potential class of problems where ETD converges but TD
diverges. In this paper, we empirically show that ETD converges
on a few other well-known on-policy experiments whereas TD
either diverges or performs poorly. We also show that ETD
outperforms TD on the mountain car prediction problem. Our
results, together with a similar pattern observed under
off-policy training in prior works, suggest that ETD might be a
good substitute over conventional TD.
Kearney, A., Veeriah, V., Travnik, J., Pilarski, P. M., &
Sutton, R. S. (2019). Learning
feature relevance through step size adaptation in
temporal-difference learning. ArXiv:1903.03252.
ABSTRACT: There is a long history of using meta learning as
representation learning, specifically for determining the
relevance of inputs. In this paper, we examine an instance of
meta-learning in which feature relevance is learned by adapting
step size parameters of stochastic gradient descent—building on
a variety of prior work in stochastic approximation, machine
learning, and artificial neural networks. In particular, we
focus on stochastic meta-descent introduced in the Incremental
Delta-Bar-Delta (IDBD) algorithm for setting individual step
sizes for each feature of a linear function approximator. Using
IDBD, a feature with large or small step sizes will have a large
or small impact on generalization from training examples. As a
main contribution of this work, we extend IDBD to
temporal-difference (TD) learning—a form of learning which is
effective in sequential, non i.i.d. problems. We derive a
variety of IDBD generalizations for TD learning, demonstrating
that they are able to distinguish which features are relevant
and which are not. We demonstrate that TD IDBD is effective at
learning feature relevance in both an idealized gridworld and a
real-world robotic prediction task.
Rafiee, B., Ghiassian, S., White, A., Sutton, R.S. (2019). “Prediction in intelligence: An
empirical comparison of off-policy algorithms on robots.” In
Proceedings of the 18th
International Conference on Autonomous Agents and Multiagent
Systems, pp. 332-340.
ABSTRACT: The
ability to continually make predictions about the world may be
central to intelligence. Off-policy learning and general value
functions (GVFs) are well-established algorithmic techniques for
learning about many signals while interacting with the world. In
the past couple of years, many ambitious works have used
off-policy GVF learning to improve control performance in both
simulation and robotic control tasks. Many of these works use
semi-gradient temporal-difference (TD) learning algorithms, like
Q-learning, which are potentially divergent. In the last decade,
several TD learning algorithms have been proposed that are
convergent and computationally efficient, but not much is known
about how they perform in practice, especially on robots. In
this work, we perform an empirical comparison of modern
off-policy GVF learning algorithms on three different robot
platforms, providing insights into their strengths and
weaknesses. We also discuss the challenges of conducting fair
comparative studies of off-policy learning on robots and develop
a new evaluation methodology that is successful and applicable
to a relatively complicated robot domain.
Ghiassian, S., Patterson, A., White, M.,
Sutton, R. S., & White, A. (2018). Online off-policy
prediction. arXiv preprint arXiv:1811.02597.
ABSTRACT: This paper investigates the problem of online
prediction learning, where learning proceeds continuously as the
agent interacts with an environment. The predictions made by the
agent are contingent on a particular way of behaving,
represented as a value function. However, the behavior used to
select actions and generate the behavior data might be different
from the one used to define the predictions, and thus the
samples are generated off-policy. The ability to learn
behavior-contingent predictions online and off-policy has long
been advocated as a key capability of predictive-knowledge
learning systems but remained an open algorithmic challenge for
decades. The issue lies with the temporal difference (TD)
learning update at the heart of most prediction algorithms:
combining bootstrapping, off-policy sampling and function
approximation may cause the value estimate to diverge. A
breakthrough came with the development of a new objective
function that admitted stochastic gradient descent variants of
TD. Since then, many sound online off-policy prediction
algorithms have been developed, but there has been limited
empirical work investigating the relative merits of all the
variants. This paper aims to fill these empirical gaps and
provide clarity on the key ideas behind each method. We
summarize the large body of literature on off-policy learning,
focusing on 1- methods that use computation linear in the number
of features and are convergent under off-policy sampling, and 2-
other methods which have proven useful with non-fixed, nonlinear
function approximation. We provide an empirical study of
off-policy prediction methods in two challenging microworlds. We
report each method's parameter sensitivity, empirical
convergence rate, and final performance, providing new insights
that should enable practitioners to successfully extend these
new methods to large-scale applications
Travnik, J. B., Mathewson, K. W., Sutton, R. S., Pilarski, P. M.
(2018). Reactive
Reinforcement
Learning in Asynchronous Environments. Frontiers in Robotics and AI
5.
ABSTRACT: The
relationship between a reinforcement learning (RL) agent and an
asynchronous environment is often ignored. Frequently used
models of the interaction between an agent and its environment,
such as Markov Decision Processes (MDP) or Semi-Markov Decision
Processes (SMDP), do not capture the fact that, in an
asynchronous environment, the state of the environment may
change during computation performed by the agent. In an
asynchronous environment, minimizing reaction time—the time it
takes for an agent to react to an observation—also minimizes the
time in which the state of the environment may change following
observation. In many environments, the reaction time of an agent
directly impacts task performance by permitting the environment
to transition into either an undesirable terminal state or a
state where performing the chosen action is inappropriate. We
propose a class of reactive
reinforcement learning algorithms that address this
problem of asynchronous environments by immediately acting after
observing new state information. We compare a reactive SARSA
learning algorithm with the conventional SARSA learning
algorithm on two asynchronous robotic tasks (emergency stopping
and impact prevention), and show that the reactive RL algorithm
reduces the reaction time of the agent by approximately the
duration of the algorithm's learning update. This new class of
reactive algorithms may facilitate safer control and faster
decision making without any change to standard learning
guarantees.
Kearney, A., Veeriah, V., Travnik, J. B., Sutton, R. S., Pilarski,
P. M. (2018). TIDBD:
Adapting Temporal-difference Step-sizes Through Stochastic
Meta-descent. ArXiv:1804.03334.
ABSTRACT: In
this paper, we introduce a method for adapting the step-sizes of
temporal difference (TD) learning. The performance of TD methods
often depends on well chosen step-sizes, yet few algorithms have
been developed for setting the step-size automatically for TD
learning. An important limitation of current methods is that
they adapt a single step-size shared by all the weights of the
learning system. A vector step-size enables greater optimization
by specifying parameters on a per-feature basis. Furthermore,
adapting parameters at different rates has the added benefit of
being a simple form of representation learning. We generalize
Incremental Delta Bar Delta (IDBD)—a vectorized adaptive
step-size method for supervised learning—to TD learning, which
we name TIDBD. We demonstrate that TIDBD is able to find
appropriate step-sizes in both stationary and non-stationary
prediction tasks, outperforming ordinary TD methods and TD
methods with scalar step-size adaptation; we demonstrate that it
can differentiate between features which are relevant and
irrelevant for a given task, performing representation learning;
and we show on a real-world robot prediction task that TIDBD is
able to outperform ordinary TD methods and TD methods augmented
with AlphaBound and RMSprop.
Ghiassian, S., Yu, H., Rafiee, B., Sutton, R. S. (2018). Two geometric input
transformation methods for fast online reinforcement learning
with neural nets. ArXiv:1805.07476.
ABSTRACT: We
apply neural nets with ReLU gates in online reinforcement
learning. Our goal is to train these networks in an incremental
manner, without the computationally expensive experience replay.
By studying how individual neural nodes behave in online
training, we recognize that the global nature of ReLU gates can
cause undesirable learning interference in each node’s learning
behavior. We propose reducing such interferences with two
efficient input transformation methods that are geometric in
nature and match well the geometric property of ReLU gates. The
first one is tile coding, a classic binary encoding scheme
originally designed for local generalization based on the
topological structure of the input space. The second one (EmECS)
is a new method we introduce; it is based on geometric
properties of convex sets and topological embedding of the input
space into the boundary of a convex set. We discuss the behavior
of the network when it operates on the transformed inputs. We
also compare it experimentally with some neural nets that do not
use the same input transformations, and with the classic
algorithm of tile coding plus a linear function approximator,
and on several online reinforcement learning tasks, we show that
the neural net with tile coding or EmECS can achieve not only
faster learning but also more accurate approximations. Our
results strongly suggest that geometric input transformation of
this type can be effective for interference reduction and takes
us a step closer to fully incremental reinforcement learning
with neural nets.
De Asis, K., Sutton, R. S. (2018). Per-decision
Multi-step
Temporal Difference Learning with Control Variates. Proceedings of the 2018 Conference on Uncertainty
in Artificial Intelligence.
ABSTRACT:
Multi-step temporal difference (TD) learning is an important
approach in reinforcement learning, as it unifies one-step TD
learning with Monte Carlo methods in a way where intermediate
algorithms can outperform either extreme. They address a
bias-variance trade off between reliance on current estimates,
which could be poor, and incorporating longer sampled reward
sequences into the updates. Especially in the off-policy
setting, where the agent aims to learn about a policy different
from the one generating its behaviour, the variance in the
updates can cause learning to diverge as the number of sampled
rewards used in the estimates increases. In this paper, we
introduce per-decision control variates for multi-step TD
algorithms, and compare them to existing methods. Our results
show that including the control variates can greatly improve
performance on both on and off-policy multi-step temporal
difference learning tasks.
Wan, Y., Zaheer, M., White, M., Sutton, R. S. (2018). Model-based
Reinforcement
Learning with Non-linear Expectation Models and Stochastic
Environments. Workshop of the Federated Artificial
Intelligence Meeting, Stockholm, Sweden.
ABSTRACT: In model-based
reinforcement learning (MBRL), the model of a stochastic
environment provides, for each state and action, either 1) the
complete distribution of possible next states, 2) a sample next
state, or 3) the expectation of the next state’s feature vector.
The third case, that of an expectation
model, is particularly appealing because the
expectation is compact and deterministic; this is the case most
commonly used, but often in a way that is not sound for
non-linear models such as those obtained with deep learning. In
this paper we introduce the first MBRL algorithm that is sound
for non-linear expectation models and stochastic environments.
Key to our algorithm, based on the Dyna architecture, is that
the model is never iterated to produce a trajectory, but only
used to generate single expected transitions to which a Bellman
backup with a linear
approximate value function is applied. In our results, we also
consider the extension of the Dyna architecture to partial
observability. We show the effectiveness of our algorithm by
comparing it with model-free methods on partially-observable
navigation tasks.
De Asis, K., Bennett, B., Sutton, R. S.
(2018). Predicting
Periodicity with Temporal Difference Learning. arXiv:1809.07435.
ABSTRACT:
Temporal difference (TD) learning is an important approach in
reinforcement learning, as it combines ideas from dynamic
programming and Monte Carlo methods in a way that allows for
online and incremental model-free learning. A key idea of TD
learning is that it is learning predictive knowledge about the
environment in the form of value functions, from which it can
derive its behavior to address long-term sequential decision
making problems. The agent’s horizon of interest, that is, how
immediate or long-term a TD learning agent predicts into the
future, is adjusted through a discount rate parameter. In this
paper, we introduce an alternative view on the discount rate,
with insight from digital signal processing, to include
complex-valued discounting. Our results show that setting the
discount rate to appropriately chosen complex numbers allows for
online and incremental estimation of the Discrete Fourier
Transform (DFT) of a signal of interest with TD learning. We
thereby extend the types of knowledge representable by value
functions, which we show are particularly useful for identifying
periodic effects in the reward sequence.
Sherstan, C., Ashley, D. R., Bennett,
B., Young, K., White, A., White, M., Sutton, R. S. (2018). Comparing
Direct
and Indirect Temporal-Difference Methods for Estimating the
Variance of the Return. Proceedings of the 2018 Conference
on Uncertainty in Artificial Intelligence.
ABSTRACT: Temporal-difference (TD) learning methods are
widely used in reinforcement learning to estimate the expected
return for each state, without a model, because of their
significant advantages in computational and data efficiency. For
many applications involving risk mitigation, it would also be
useful to estimate the variance
of the return by TD methods. In this paper, we describe a way of
doing this that is substantially simpler than those proposed by
Tamar, Di Castro, and Mannor in 2012, or those proposed by White
and White in 2016. We show that two TD learners operating in
series can learn expectation and variance estimates. The trick
is to use the square of the TD error of the expectation learner
as the reward of the variance learner, and the square of the
expectation learner’s discount rate as the discount rate of the
variance learner. With these two modifications, the variance
learning problem becomes a conventional TD learning problem to
which standard theoretical results can be applied. Our formal
results are limited to the table lookup case, for which our
method is still novel, but the extension to function
approximation is immediate, and we provide some empirical
results for the linear function approximation case. Our
experimental results show that our direct method behaves just as
well as a comparable indirect method, but is generally more
robust.
Young, K. J., Sutton, R. S., Yang, S. (2018). Integrating Episodic
Memory into a Reinforcement Learning Agent using Reservoir
Sampling. ArXiv:1806.00540.
ABSTRACT:
Episodic memory is a psychology term which refers to the ability
to recall specific events from the past. We suggest one
advantage of this particular type of memory is the ability to
easily assign credit to a specific state when remembered
information is found to be useful. Inspired by this idea, and
the increasing popularity of external memory mechanisms to
handle long-term dependencies in deep learning systems, we
propose a novel algorithm which uses a reservoir sampling
procedure to maintain an external memory consisting of a fixed
number of past states. The algorithm allows a deep reinforcement
learning agent to learn online to preferentially remember those
states which are found to be useful to recall later on.
Critically this method allows for efficient online computation
of gradient estimates with respect to the write process of the
external memory. Thus unlike most prior mechanisms for external
memory it is feasible to use in an online reinforcement learning
setting.
De Asis, K., Hernandez-Garcia, J. F.,
Holland, G. Z., Sutton, R. S. (2018). Multi-step
reinforcement learning: A unifying algorithm. Proceedings of the Conference of the
Association for the Advancement of Artificial Intelligence.
Also ArXiv:1703.01327.
ABSTRACT:
Unifying seemingly disparate algorithmic ideas to produce better
performing algorithms has been a longstanding goal in
reinforcement learning. As a primary example, TD(λ) elegantly
unifies one-step TD prediction with Monte Carlo methods through
the use of eligibility traces and the trace-decay parameter λ.
Currently, there are a multitude of algorithms that can be used
to perform TD control, including Sarsa, Q-learning, and Expected
Sarsa. These methods are often studied in the one-step case, but
they can be extended across multiple time steps to achieve
better performance. Each of these algorithms is seemingly
distinct, and no one dominates the others for all problems. In
this paper, we study a new multi-step action-value algorithm
called Q(σ) that unifies and generalizes these existing
algorithms, while subsuming them as special cases. A new
parameter, σ, is introduced to allow the degree of sampling
performed by the algorithm at each step during its backup to be
continuously varied, with Sarsa existing at one extreme (full
sampling), and Expected Sarsa existing at the other (pure
expectation). Q(σ) is generally applicable to both on- and
off-policy learning, but in this work we focus on experiments in
the on-policy case. Our results show that an intermediate value
of σ, which results in a mixture of the existing algorithms,
performs better than either extreme. The mixture can also be
varied dynamically which can result in even greater performance.
Yu, H., Mahmood, A. R., Sutton, R. S. (2018). On Generalized Bellman Equations and
Temporal-Difference Learning. Journal of Machine Learning Research 19.
ABSTRACT: We
consider off-policy temporal-difference (TD) learning in
discounted Markov decision processes, where the goal is to
evaluate a policy in a model-free way by using observations of a
state process generated without executing the policy. To curb
the high variance issue in off-policy TD learning, we propose a
new scheme of setting the λ-parameters of TD, based on
generalized Bellman equations. Our scheme is to set λ according
to the eligibility trace iterates calculated in TD, thereby
easily keeping these traces in a desired bounded range. Compared
with prior work, this scheme is more direct and flexible, and
allows much larger λ values for off-policy TD learning with
bounded traces. As to its soundness, using Markov chain theory,
we prove the ergodicity of the joint state-trace process under
nonrestrictive conditions, and we show that associated with our
scheme is a generalized Bellman equation (for the policy to be
evaluated) that depends on both the evolution of λ and the
unique invariant probability measure of the state-trace process.
These results not only lead immediately to a characterization of
the convergence behavior of least-squares based implementation
of our scheme, but also prepare the ground for further analysis
of gradient-based implementations.
Lee, J. Y., Sutton, R. S. (2017). Integral Policy
Iterations for Reinforcement Learning Problems in Continuous
Time and Space. ArXiv:1705.03520.
ABSTRACT:
Policy iteration (PI) is a recursive process of policy
evaluation and improvement to solve an optimal decision-making,
e.g., reinforcement learning (RL) or optimal control problem and
has served as the fundamental to develop RL methods. Motivated
by integral PI (IPI) schemes in optimal control and RL methods
in continuous time and space (CTS), this paper proposes
on-policy IPI to solve the general RL problem in CTS, with its
environment modelled by an ordinary differential equation (ODE).
In such continuous domain, we also propose four off-policy IPI
methods—two are the ideal PI forms that use advantage and
Q-functions, respectively, and the other two are natural
extensions of the existing off-policy IPI schemes to our general
RL framework. Compared to the IPI methods in optimal control,
the proposed IPI schemes can be applied to more general
situations and do not require an initial stabilizing policy to
run; they are also strongly relevant to the RL algorithms in CTS
such as advantage updating, Q-learning, and value-gradient based
(VGB) greedy policy improvement. Our on-policy IPI is basically
model-based but can be made partially model-free; each
off-policy method is also either partially or completely
model-free. The mathematical properties of the IPI
methods—admissibility, monotone improvement, and convergence
towards the optimal solution—are all rigorously proven, together
with the equivalence of on- and off-policy IPI. Finally, the IPI
methods are simulated with an inverted-pendulum model to support
the theory and verify the performance.
Ghiassian, S., Rafiee, B., & Sutton,
R. S. (2017). A first empirical
study of emphatic temporal difference learning. arXiv
preprint arXiv:1705.04185.
ABSTRACT: In this paper we present the first empirical
study of the emphatic temporal-difference learning algorithm
(ETD), comparing it with conventional temporal-difference
learning, in particular, with linear TD(0), on on-policy and
off-policy variations of the Mountain Car problem. The initial
motivation for developing ETD was that it has good convergence
properties under off-policy training (Sutton, Mahmood &
White 2016), but it is also a new algorithm for the on-policy
case. In both our on-policy and off-policy experiments, we found
that each method converged to a characteristic asymptotic level
of error, with ETD better than TD(0). TD(0) achieved a still
lower error level temporarily before falling back to its higher
asymptote, whereas ETD never showed this kind of “bounce”. In
the off-policy case (in which TD(0) is not guaranteed to
converge), ETD was significantly slower.
Mahmood, A. R., Yu, H., Sutton, R. S. (2017). Multi-step
off-policy learning without importance sampling ratios.
ArXiv:1702.03006.
ABSTRACT: To
estimate the value functions of policies from exploratory data,
most model-free off-policy algorithms rely on importance
sampling, where the use of importance sampling ratios often
leads to estimates with severe variance. It is thus desirable to
learn off-policy without using the ratios. However, such an
algorithm does not exist for multi-step learning with function
approximation. In this paper, we introduce the first such
algorithm based on temporal-difference (TD) learning updates. We
show that an explicit use of importance sampling ratios can be
eliminated by varying the amount of bootstrapping in TD updates
in an action-dependent manner. Our new algorithm achieves
stability using a two-timescale gradient-based TD update. A
prior algorithm based on lookup table representation called Tree
Backup can also be retrieved using action-dependent
bootstrapping, becoming a special case of our algorithm. In two
challenging off-policy tasks, we demonstrate that our algorithm
is stable, effectively avoids the large variance issue, and can
perform substantially better than its state-of-the-art
counterpart.
Barto, A. G., Thomas, P. S., Sutton, R.
S. (2017). Some Recent Applications
of Reinforcement Learning. In Proceedings of the Eighteenth Yale Workshop on Adaptive
and Learning Systems.
ABSTRACT: Five
relatively recent applications of reinforcement learning methods
are described. These examples were chosen to illustrate a
diversity of application types, the engineering needed to build
applications, and most importantly, the impressive results that
these methods are able to achieve. This paper is based on a
case-study chapter of the forthcoming second edition of Sutton
and Barto’s 1998 book “Reinforcement Learning: An Introduction.”
Pilarski, P. M., Sutton, R. S.,
Mathewson, K. W., Sherstan, C., Parker, A. S., Edwards, A. L.
(2017). Communicative
Capital for Prosthetic Agents. ArXiv:1711.03676.
ABSTRACT: This
work presents an overarching perspective on the role that
machine intelligence can play in enhancing human abilities,
especially those that have been diminished due to injury or
illness. As a primary contribution, we develop the hypothesis
that assistive devices, and specifically artificial arms and
hands, can and should be viewed as agents in order for us to
most effectively improve their collaboration with their human
users. We believe that increased agency will enable more
powerful interactions between human users and next generation
prosthetic devices, especially when the sensorimotor space of
the prosthetic technology greatly exceeds the conventional
control and communication channels available to a prosthetic
user. To more concretely examine an agency-based view on
prosthetic devices, we propose a new schema for interpreting the
capacity of a human-machine collaboration as a function of both
the human’s and machine’s degrees of agency. We then introduce
the idea of communicative
capital as a way of thinking about the communication
resources developed by a human and a machine during their
ongoing interaction. Using this schema of agency and capacity,
we examine the benefits and disadvantages of increasing the
agency of a prosthetic limb. To do so, we present an analysis of
examples from the literature where building communicative
capital has enabled a progression of fruitful, task-directed
interactions between prostheses and their human users. We then
describe further work that is needed to concretely evaluate the
hypothesis that prostheses are best thought of as agents. The
agent-based viewpoint developed in this article significantly
extends current thinking on how best to support the natural,
functional use of increasingly complex prosthetic enhancements,
and opens the door for more powerful interactions between humans
and their assistive technologies.
Zhang, S., Sutton, R. S. (2017). A Deeper Look at
Experience Replay. Symposium
on Deep Reinforcement Learning at the 31st Conference on Neural
Information Processing Systems. ArXiv:1712.01275.
ABSTRACT:
Recently experience replay is widely used in various deep
reinforcement learning (RL) algorithms, in this paper we rethink
the utility of experience replay. It introduces a new
hyper-parameter, the memory buffer size, which needs carefully
tuning. However unfortunately the importance of this new
hyper-parameter has been underestimated in the community for a
long time. In this paper we did a systematic empirical study of
experience replay under various function representations. We
showcase that a large replay buffer can significantly hurt the
performance. Moreover, we propose a simple O(1) method to remedy
the negative influence of a large replay buffer. We showcase its
utility in both simple grid world and challenging domains like
Atari games.
Veeriah, V., van Seijen, H., Sutton, R.
S. (2017). Forward
actor-critic
for nonlinear function approximation in reinforcement learning.
In Proceedings of the 16th
Conference on Autonomous Agents and Multiagent Systems
(pp. 556-564), Sao Paulo, Brazil..
ABSTRACT:
Multi-step methods are important in reinforcement learning (RL).
Eligibility traces, the usual way of handling them, works well
with linear function approximators. Recently, van Seijen (2016)
had introduced a delayed learning approach, without eligibility
traces, for handling the multi-step λ-return with nonlinear
function approximators. However, this was limited to
action-value methods. In this paper, we extend this approach to
handle n-step returns, generalize this approach to policy
gradient methods and empirically study the effect of such
delayed updates in control tasks. Specifically, we introduce two
novel forward actor-critic methods and empirically investigate
our proposed methods with the conventional actor-critic method
on mountain car and pole-balancing tasks. From our experiments,
we observe that forward actor-critic dramatically outperforms
the conventional actor-critic in these standard control tasks.
Notably, this forward actor-critic method has produced a new
class of multi-step RL algorithms without eligibility traces.
Yu, H., Mahmood, A. R., Sutton, R. S.
(2017). On Generalized Bellman
Equations and Temporal-Difference Learning. Proceedings of the Canadian
Artificial Intelligence Conference.
ABSTRACT: We
consider off-policy temporal-difference (TD) learning in
discounted Markov decision processes, where the goal is to
evaluate a policy in a model-free way by using observations of a
state process generated without executing the policy. To curb
the high variance issue in off-policy TD learning, we propose a
new scheme of setting the λ-parameters of TD, based on
generalized Bellman equations. Our scheme is to set λ according
to the eligibility trace iterates calculated in TD, thereby
easily keeping these traces in a desired bounded range. Compared
with prior work, this scheme is more direct and flexible, and
allows much larger λ values for off-policy TD learning with
bounded traces. As to its soundness, using Markov chain theory,
we prove the ergodicity of the joint state-trace process under
nonrestrictive conditions, and we show that associated with our
scheme is a generalized Bellman equation (for the policy to be
evaluated) that depends on both the evolution of λ and the
unique invariant probability measure of the state-trace process.
These results not only lead immediately to a characterization of
the convergence behavior of least-squares based implementation
of our scheme, but also prepare the ground for further analysis
of gradient-based implementations.
Veeriah, V., Zhang, S., Sutton, R. S. (2017). Crossprop: Learning Representations
by Stochastic Meta-Gradient Descent in Neural Networks. European Conference on Machine
Learning and Principles and Practice of Knowledge Discovery in
Databases (ECML-PKDD).
ABSTRACT:
Representations are fundamental to artificial intelligence. The
performance of a learning system depends on the type of
representation used for representing the data. Typically, these
representations are hand-engineered using domain knowledge. More
recently, the trend is to learn these representations through
stochastic gradient descent in multi-layer neural networks,
which is called backprop. Learning the representations directly
from the incoming data stream reduces the human labour involved
in designing a learning system. More importantly, this allows in
scaling of a learning system for difficult tasks. In this paper,
we introduce a new incremental learning algorithm called
crossprop, which learns incoming weights of hidden units based
on the meta-gradient descent approach, that was previously
introduced by Sutton (1992) and Schraudolph (1999) for learning
step-sizes. The final update equation introduces an additional
memory parameter for each of these weights and generalizes the
backprop update equation. From our experiments, we show that
crossprop learns and reuses its feature representation while
tackling new and unseen tasks whereas backprop re- learns a new
feature representation.
Ludvig, E.
A., Mirian, M. S., Kehoe, E. J., Sutton, R. S. Associative
learning from replayed experience. http://biorxiv.org/content/early/2017/01/16/100800.
ABSTRACT: We
develop an extension of the Rescorla-Wagner model of associative
learning. In addition to learning from the current trial, the
new model supposes that animals store and replay previous
trials, learning from the replayed trials using the same
learning rule. This simple idea provides a unified explanation
for diverse phenomena that have proved challenging to earlier
associative models, including spontaneous recovery, latent
inhibition, retrospective revaluation, and trial spacing
effects. For example, spontaneous recovery is explained by
supposing that the animal replays its previous trials during the
interval between extinction and test. These include earlier
acquisition trials as well as recent extinction trials, and thus
there is a gradual re-acquisition of the conditioned response.
We present simulation results for the simplest version of this
replay idea, where the trial memory is assumed empty at the
beginning of an experiment, all experienced trials are stored
and none removed, and sampling from the memory is performed at
random. Even this minimal replay model is able to explain the
challenging phenomena, illustrating the explanatory power of an
associative model enhanced by learning from remembered as well
as real experiences.
Murphy, S. A., Deng, Y., Laber, E. B.,
Maei, H. R., Sutton, R. S., Witkiewitz, K. (2016). A batch, off-policy,
actor-critic algorithm for optimizing the average reward.
arXiv preprint arXiv:1607.05047.
ABSTRACT:We develop an off-policy
actor–critic algorithm for learning an optimal policy from a
training set composed of data from multiple individuals. This
algorithm is developed with a view toward its use in mobile
health.
Veeriah, V., Pilarski, P. M., Sutton, R. S. (2016). Face valuing:
Training user interfaces with facial expressions and reinforcement
learning. Int. J. Conference on
Artificial Intelligence, Workshop on Interactive Machine
Learning. Also arXiv 1606.02807.
ABSTRACT: An
important application of interactive machine learning is
extending or amplifying the cognitive and physical capabilities
of a human. To accomplish this, machines need to learn about
their human users' intentions and adapt to their preferences. In
most current research, a user has conveyed preferences to a
machine using explicit corrective or instructive feedback;
explicit feedback imposes a cognitive load on the user and is
expensive in terms of human effort. The primary objective of the
current work is to demonstrate that a learning agent can reduce
the amount of explicit feedback required for adapting to the
user's preferences pertaining to a task by learning to perceive
a value of its behavior from the human user, particularly from
the user's facial expressions---we call this face valuing. We
empirically evaluate face valuing on a grip selection task. Our
preliminary results suggest that an agent can quickly adapt to a
user's changing preferences with minimal explicit feedback by
learning a value function that maps facial features extracted
from a camera image to expected future reward. We believe that
an agent learning to perceive a value from the body language of
its human user is complementary to existing interactive machine
learning approaches and will help in creating successful
human-machine interactive applications.
Sutton, R. S.,
Mahmood, A. R., White, M. (2016). An
emphatic approach to the problem of off-policy
temporal-difference learning. Journal of Machine Learning Research 17(73):1-29.
Also arXiv:1503.04269.
ABSTRACT: In
this paper we introduce the idea of improving the performance of
parametric temporal-difference (TD) learning algorithms by
selectively emphasizing or de-emphasizing their updates on
different time steps. In particular, we show that varying the
emphasis of linear TD(λ)’s updates in a particular way causes
its expected update to become stable under off-policy training.
The only prior model-free TD methods to achieve this with
per-step computation linear in the number of function
approximation parameters are the gradient-TD family of methods
including TDC, GTD(λ), and GQ(λ). Compared to these methods, our
emphatic TD(λ) is
simpler and easier to use; it has only one learned parameter
vector and one step-size parameter. Our treatment includes
general state-dependent discounting and bootstrapping functions,
and a way of specifying varying degrees of interest in
accurately valuing different states.
Sutton, R. S. (2015). True online Emphatic
TD(λ): Quick Reference and Implementation Guide.
ArXiv. Code is available in Python and C++ by downloading the
source files of this arXiv paper as a zip archive.
ABSTRACT: In
this paper we introduce the idea of improving the performance
of parametric temporal-difference (TD) learning algorithms by
selectively emphasizing or de-emphasizing their updates on
different time steps. In particular, we show that varying the
emphasis of linear TD(λ)’s updates in a particular way causes
its expected update to become stable under off-policy
training. The only prior model-free TD methods to achieve this
with per-step computation linear in the number of function
approximation parameters are the gradient-TD family of methods
including TDC, GTD(λ), and GQ(λ). Compared to these methods,
our emphatic TD(λ)
is simpler and easier to use; it has only one learned
parameter vector and one step-size parameter. Our treatment
includes general state-dependent discounting and bootstrapping
functions, and a way of specifying varying degrees of interest
in accurately valuing different states.
Edwards,
A. L., Dawson, M. R., Hebert, J. S., Sherstan, C., Sutton, R.
S., Chan, K. M., Pilarski, P. M. (2015). Application of real-time
machine learning to myoelectric prosthesis control: A case
series in adaptive switching. Prosthetics and Orthotics International, first
published September 15, 2015, doi:0309364615605373.
ABSTRACT:
Background: Myoelectric
prostheses
currently used by amputees can be difficult to control.
Machine learning, and in particular learned predictions about
user intent, could help to reduce the time and cognitive load
required by amputees while operating their prosthetic device.
Objectives: The goal
of this study was to compare two switching-based methods of
controlling a myoelectric arm: non-adaptive (or conventional)
control and adaptive control (involving real-time prediction
learning).
Study Design: Case
series study.
Methods: We compared
non-adaptive and adaptive control in two different
experiments. In the first, one amputee and one non-amputee
subject controlled a robotic arm to perform a simple task; in
the second, three able-bodied subjects controlled a robotic
arm to perform a more complex task. For both tasks, we
calculated the mean time and total number of switches between
robotic arm functions over three trials.
Results: Adaptive
control significantly decreased the number of switches and
total switching time for both tasks compared with the
conventional control method.
Conclusion: Real-time
prediction learning was successfully used to improve the
control interface of a myoelectric robotic arm during
uninterrupted use by an amputee subject and able-bodied
subjects.
Clinical Relevance: Adaptive
control
using real-time prediction learning has the potential to help
decrease both the time and the cognitive load required by
amputees in real-world functional situations when using
myoelectric prostheses.
van Hasselt, H., Sutton, R. S. (2015)
Learning to Predict
Independent of Span. ArXiv:1508.04582.
ABSTRACT: We
consider how to learn multi-step predictions effciently.
Conventional algorithms wait until observing actual outcomes
before performing the computations to update their
predictions. If predictions are made at a high rate or span
over a large amount of time, substantial computation can be
required to store all relevant observations and to update all
predictions when the outcome is finally observed. We show that
the exact same predictions can be learned in a much more
computationally congenial way, with uniform per-step
computation that does not depend on the span of the
predictions. We apply this idea to various settings of
increasing generality, repeatedly adding desired properties
and each time deriving an equivalent span-independent
algorithm for the conventional algorithm that satisfies these
desiderata. Interestingly, along the way several known
algorithmic constructs emerge spontaneously from our
derivations, including dutch eligibility traces, temporal
difference errors, and averaging. This allows us to link these
constructs one-to-one to the corresponding desiderata,
unambiguously connecting the ‘how’ to the ‘why’. Each step, we
make sure that the derived algorithm subsumes the previous
algorithms, thereby retaining their properties. Ultimately we
arrive at a single general temporal-difference algorithm that
is applicable to the full setting of reinforcement learning.
Pilarski, P. M., Sutton, R. S., Mathewson, K. W. (2015). Prosthetic Devices
as Goal-Seeking Agents. Workshop on Present and Future of Non-invasive
Peripheral Nervous System Machine Interfaces at the
International Conference on Rehabilitation Robotics,
Singapore.
ABSTRACT: In
this article we develop the perspective that assistive
devices, and specifically artificial arms and hands, may be
beneficially viewed as goal-seeking agents. We further suggest
that taking this perspective enables more powerful
interactions between human users and next generation
prosthetic devices, especially when the sensorimotor space of
the prosthetic technology greatly exceeds the conventional
myoelectric control and communication channels available to a
prosthetic user. As a principal contribution, we propose a
schema for thinking about the capacity of a human-machine
collaboration as a function of both the human and machine’s
degrees of agency. Using this schema, we present a brief
analysis of three examples from the literature where agency or
goal-seeking behaviour by a prosthesis has enabled a
progression of fruitful, task-directed interactions between a
prosthetic assistant and a human director. While preliminary,
the agent-based viewpoint developed in this article extends
current thinking on how best to support the natural,
functional use of increasingly complex prosthetic
enhancements.
van
Seijen, H., Mahmood, A. R., Pilarski, P. M., Machado, M. C.,
Sutton, R. S. (2016). True online temporal-difference learning.
Journal
of Machine Learning Research 17(145):1-40. Also
arXiv:1502.04087.
ABSTRACT: The
temporal-difference methods TD(λ) and Sarsa(λ) form a core
part of modern reinforcement learning. Their appeal comes from
their good performance, low computational cost, and their
simple interpretation, given by their forward view. Recently,
new versions of these methods were introduced, called true
online TD(λ) and true online Sarsa(λ), respectively (van
Seijen & Sutton, 2014). Algorithmically, these true online
methods only make two small changes to the update rules of the
regular methods, and the extra computational cost is
negligible in most cases. However, they follow the ideas
underlying the forward view much more closely. In particular,
they maintain an exact equivalence with the forward view at
all times, whereas the traditional versions only approximate
it for small step-sizes. We hypothesize that these true online
methods not only have better theoretical properties, but also
dominate the regular methods empirically. In this article, we
put this hypothesis to the test by performing an extensive
empirical comparison. Specifically, we compare the performance
of true online TD(λ)/Sarsa(λ) with regular TD(λ)/Sarsa(λ) on
random MRPs, a real-world myoelectric prosthetic arm, and a
domain from the Arcade Learning Environment. We use linear
function approximation with tabular, binary, and non-binary
features. Our results suggest that the true online methods
indeed dominate the regular methods. Across all
domains/representations the learning speed of the true online
methods are often better, but never worse than that of the
regular methods. An additional advantage is that no choice
between traces has to be made for the true online methods.
Besides the empirical results, we provide an in-dept analysis
of the theory behind true online temporal-difference learning.
In addition, we show that new true online temporal- difference
methods can be derived by making changes to the online forward
view and then rewriting the update equations.
van
Seijen, H., Mahmood, A. R., Pilarski, P. M., Sutton, R. S.
(2015). An empirical
evaluation of true online TD(λ). In Proceedings of the 2015 European
Workshop on Reinforcement Learning.
ABSTRACT: The
true online TD(λ) algorithm has recently been proposed (van
Seijen and Sutton, 2014) as a universal replacement for the
popular TD(λ) algorithm, in temporal-difference learning and
reinforcement learning. True online TD(λ) has better
theoretical properties than conventional TD(λ), and the
expectation is that it also results in faster learning. In
this paper, we put this hypothesis to the test by empirically
comparing true online TD(λ) with TD(λ) on a variety of
domains. Our domains include a challenging one-state and
two-state example, random Markov reward processes, and a
real-world myoelectric prosthetic arm. We use linear function
approximation with tabular, binary, and non-binary features.
We assess the algorithms along three dimensions: computational
cost, learning speed, and ease of use. Our results confirm the
strength of true online TD(λ): 1) for sparse feature vectors,
the computational overhead with respect to TD(λ) is
negligible; for non-sparse features the computation time is at
most twice that of TD(λ), 2) across all
domains/representations the learning speed of true online
TD(λ) is often better, but never worse than that of TD(λ), and
3) true online TD(λ) is easier to use, because it does not
require choosing between trace types, and it is generally more
stable with respect to the step-size. Overall, our results
suggest that true online TD(λ) should be the first choice when
looking for an efficient, general-purpose TD method.
Mahmood, A. R., Yu, H., White, M., Sutton, R.
S. (2015). Emphatic
temporal-difference learning. In Proceedings of the 2015 European Workshop on
Reinforcement Learning.
ABSTRACT:
Emphatic algorithms are temporal-difference learning
algorithms that change their effective state distribution by
selectively emphasizing and de-emphasizing their updates on
different time steps. Recent works by Sutton, Mahmood and
White (2015) and Yu (2015) show that by varying the emphasis
in a particular way these algorithms become stable and
convergent under off-policy training with linear function
approximation. This paper serves as a unified summary of the
available results from both works. Additionally, we
empirically demonstrate the benefits of emphatic algorithms,
due to the flexible learning using state-dependent
discounting, bootstrapping and a user-specified allocation of
function approximation resources.
Mahmood, A. R., Sutton, R. S. (2015).
Off-policy learning based
on weighted importance sampling with linear computational
complexity. In Proceedings
of the 2015 Conference on Uncertainty in Artificial
Intelligence.
ABSTRACT:
Importance sampling is an essential component of model-free
off-policy learning algorithms. Weighted importance sampling
(WIS) is generally considered superior to ordinary importance
sampling but, when combined with function approximation, it
has hitherto required computational complexity that is O(n2)
or more in the number of features. In this paper we introduce
new off-policy learning algorithms that obtain the benefits of
WIS with O(n) computational complexity. Our algorithms
maintain for each component of the parameter vector a measure
of the extent to which that component has been used in
previous examples. This measure is used to determine
component-wise step sizes, merging the ideas of stochastic
gradient descent and sample averages. We present our main
WIS-based algorithm first in an intuitive acausal form (the
forward view) and then derive a causal algorithm using
eligibility traces that is equivalent but more efficient (the
backward view). In three small experiments, our algorithms
performed significantly better than prior O(n) algorithms for
off-policy policy evaluation. We also show that our adaptive
step-size technique alone can improve the performance of
on-policy algorithms such as TD(λ) and true online TD(λ).
van Seijen, H., Sutton, R. S. (2015). A deeper look at
planning as learning from replay. In: Proceedings
of the 32nd International Conference on Machine Learning,
Lille, France.
ABSTRACT: In
reinforcement learning, the notions of experience replay, and
of planning as learning from replayed experience, have long
been used to find good policies with minimal training data.
Replay can be seen either as model-based reinforcement
learning, where the store of past experiences serves as the
model, or as a way to avoid a conventional model of the
environment altogether. In this paper, we look more deeply at
how replay blurs the line between model-based and model-free
methods. First, we show for the first time an exact
equivalence between the sequence of value functions found by a
model-based policy-evaluation method and by a model-free
method with replay. Second, we present a general replay method
that can mimic a spectrum of methods ranging from the
explicitly model-free (TD(0)) to the explicitly model-based
(linear Dyna). Finally, we use insights gained from these
relationships to design a new model-based reinforcement
learning algorithm for linear function approximation. This
method, which we call forgetful LSTD(λ), improves upon regular
LSTD(λ) because it extends more naturally to online control,
and improves upon linear Dyna because it is a multi-step
method, enabling it to perform well even in non-Markov
problems or, equivalently, in problems with significant
function approximation.
Kehoe, E. J.,
Ludvig, E. A., Sutton, R. S. (2014). Time course of the rabbit's
nictitating membrane during acquisition, extinction, and
reacquisition. Learning
and Memory 21:585-590. Cold Spring Harbor Press.
ABSTRACT: The
present experiment tested whether or not the time course of a
conditioned eyeblink response, particularly its duration,
would expand and contract, as the magnitude of the conditioned
response (CR) changed massively during acquisition,
extinction, and reacquisition. The CR duration remained
largely constant throughout the experiment, while CR onset and
peak time occurred slightly later during extinction. The
results suggest that computational models can account for
these results by using two layers of plasticity conforming to
the sequence of synapses in the cerebellar pathways that
mediate eyeblink conditioning.
Mahmood, A. R., van Hasselt, H, Sutton,
R. S. (2014). Weighted
importance sampling for off-policy learning with linear
function approximation. Advances in Neural
Information Processing Systems 27, Montreal, Canada.
ABSTRACT: Importance
sampling
is an essential component of off-policy model-free
reinforcement learning algorithms. However, its most
effective variant, weighted
importance sampling, does not carry over easily to
function approximation and, because of this, it is not
utilized in existing off-policy learning algorithms. In
this paper, we take two steps toward bridging this gap.
First, we show that weighted importance sampling can be
viewed as a special case of weighting the error of
individual training samples, and that this weighting has
theoretical and empirical benefits similar to those of
weighted importance sampling. Second, we show that these
benefits extend to a new weighted-importance-sampling
version of off-policy LSTD(λ). We
show empirically that our new WIS-LSTD(λ)
algorithm can result in much more rapid and reliable
convergence than conventional off-policy LSTD(λ) (Yu
2010, Bertsekas & Yu 2009).
Yao, H., Szepesvari, Cs., Sutton, R.,
Modayil, J., Bhatnagar, S. (2014). Universal
Option
Models. Advances in Neural Information Processing
Systems 27, Montreal, Canada.
ABSTRACT: We
consider the problem of learning models of options for
real-time abstract planning,
in the setting where reward functions can be specified at
any time and their expected
returns
must be efficiently computed. We introduce a new model for
an option that is independent of any reward function,
called the universal
option model (UOM). We prove that the UOM of an
option can construct a traditional option model given a
reward function, and also supports efficient computation
of the option-conditional return. We extend the UOM to
linear function approximation, and we show it gives the TD
solution of option returns and value functions of policies
over options. We provide a stochastic approximation
algorithm for incrementally learning UOMs from data and
prove its consistency. We demonstrate our method in two
domains. The first domain is a real-time strategy game,
where the
controller must select the best game unit to accomplish
dynamically-specified
tasks. The second domain is article recommendation, where
each user query defines
a new reward function and an article’s relevance is the
expected return from
following a policy that follows the citations between
articles. Our experiments show that UOMs are substantially
more efficient than previously known methods in evaluating
option returns and policies over options.
Edwards,
A. L.,
Dawson, M. R., Hebert, J. S., Sutton, R. S., Chan, K. M., Pilarski, P.M. (2014). Adaptive
switching in practice: Improving myoelectric prosthesis
performance through reinforcement learning. In: Proceedings of MEC14: Myoelectric
Controls Symposium, Fredericton, New Brunswick, Canada.
van Hasselt, H., Mahmood,
A. R., Sutton, R. S. (2014).
Off-policy TD(λ) with a true
online equivalence. In: Proceedings of the 30th Conference
on Uncertainty in Artificial Intelligence, Quebec City,
Canada.
ABSTRACT: Van
Seijen and Sutton (2014) recently proposed a new version of
the linear TD(λ) learning algorithm that is exactly equivalent
to an online forward view and that empirically performed
better than its classical counterpart in both prediction and
control problems. However, their algorithm is restricted to
on-policy learning. In the more general case of off-policy
learning, in which the policy whose outcome is predicted and
the policy used to generate data may be different, their
algorithm cannot be applied. One reason for this is that the
algorithm bootstraps and thus is subject to instability
problems when function approximation is used. A second reason
true online TD(λ) cannot be used for off-policy learning is
that the off-policy case requires sophisticated importance
sampling in its eligibility traces. To address these
limitations, we generalize their equivalence result and use
this generalization to construct the first online algorithm to
be exactly equivalent to an off-policy forward view. We show
this algorithm, named true online GTD(λ), empirically
outperforms GTD(λ) (Maei, 2011) which was derived from the same
objective as our forward view but lacks the exact online
equivalence. In the general theorem that allows us to derive
this new algorithm, we encounter a new general
eligibility-trace update.
Sutton, R. S., Mahmood, A. R., Precup, D., van
Hasselt, H. (2014).
A new Q(λ) with interim
forward view and Monte Carlo equivalence. In: Proceedings
of the 31st International Conference on Machine Learning,
Beijing, China.
ABSTRACT:
Q-learning, the most popular of reinforcement learning
algorithms, has always included an extension to eligibility
traces to enable more rapid learning and improved asymptotic
performance on non-Markov problems. The λ parameter smoothly
shifts on-policy algorithms such as TD(λ) and Sarsa(λ) from a
pure bootstrapping form (λ = 0) to a pure Monte Carlo form (λ
= 1). In off-policy algorithms, including Q(λ), GQ(λ), and
off-policy LSTD(λ), the λ parameter is intended to play the
same role, but does not; on every exploratory action these
algorithms bootstrap regardless of the value of λ, and as a
result they fail to approximate Monte Carlo learning when λ =
1. It may seem that this is inevitable for any online
off-policy algorithm; if updates are made on each step on
which the target policy is followed, then how could just the
right updates be ‘un-made’ upon deviation from the target
policy? In this paper, we introduce a new version of Q(λ) that
does exactly that, without significantly increased algorithmic
complexity. En route to our new Q(λ), we introduce a new
derivation technique based on the forward-view/backward-view
analysis familiar from TD(λ) but extended to apply at every
time step rather than only at the end of episodes. We apply
this technique to derive first a new off-policy version of
TD(λ), called PTD(λ), and then our new Q(λ), called PQ(λ).
van Seijen, H.,
Sutton, R. S. (2014). True online TD(λ). In: Proceedings of the 31st
International Conference on Machine Learning,
Beijing, China. As published. With two
minor errors corrected.
ABSTRACT:TD(λ)
is a core algorithm of modern reinforcement learning. Its
appeal comes from its equivalence to a clear and conceptually
simple forward view, and the fact that it can be implemented
online in an inexpensive manner. However, the equivalence
between TD(λ) and the forward view is exact only for the
off-line version of the algorithm (in which updates are made
only at the end of each episode). In the online version of
TD(λ) (in which updates are made at each step, which generally
performs better and is always used in applications) the match
to the forward view is only approximate. In a sense this is
unavoidable for the conventional forward view, as it itself
presumes that the estimates are unchanging during an episode.
In this paper we introduce a new forward view that takes into
account the possibility of changing estimates and a new
variant of TD(λ) that exactly achieves it. Our algorithm uses
a new form of eligibility trace similar to but different from
conventional accumulating and replacing traces. The overall
computational complexity is the same as TD(λ), even when using
function approximation. In our empirical comparisons, our
algorithm outperformed TD(λ) in all of its variations. It
seems, by adhering more truly to the original goal of
TD(λ)—matching an intuitively clear forward view even in the
online case—that we have found a new algorithm that simply
improves on classical TD(λ).
Modayil, J., White,
A., Sutton, R. S. (2014). Multi-timescale Nexting in
a Reinforcement Learning Robot. Adaptive Behavior 22(2):146-160.
ABSTRACT: The
term ‘nexting’ has been used by psychologists to refer to the
propensity of people and many other animals to continually
predict what will happen next in an immediate, local, and
personal sense. The ability to ‘next’ constitutes a basic kind
of awareness and knowledge of one’s environment. In this paper
we present results with a robot that learns to next in real
time, making thousands of predictions about sensory input
signals at timescales from 0.1 to 8 seconds. Our predictions
are formulated as a generalization of the value functions
commonly used in reinforcement learning, where now an
arbitrary function of the sensory input signals is used as a
pseudo reward, and the discount rate determines the timescale.
We show that six thousand predictions, each computed as a
function of six thousand features of the state, can be learned
and updated online ten times per second on a laptop computer,
using the standard TD(λ) algorithm with linear function approximation.
This approach is sufficiently computationally efficient to be
used for real-time learning on the robot and sufficiently data
efficient to achieve substantial accuracy within 30 minutes.
Moreover, a single tile-coded feature representation suffices
to accurately predict many different signals over a
significant range of timescales. We also extend nexting beyond
simple timescales by letting the discount rate be a function
of the state and show that nexting predictions of this more
general form can also be learned with substantial accuracy.
General nexting provides a simple yet powerful mechanism for a
robot to acquire predictive knowledge of the dynamics of its
environment.
Modayil, J., Sutton, R. S. (2014). Prediction Driven Behavior:
Learning Predictions that Drive Fixed Responses. The
AAAI-14 Workshop on Artificial Intelligence and Robotics, Quebec
City, Quebec, Canada.
ABSTRACT: We
introduce a new method for robot control that combines
prediction learning with a fixed, crafted response—the robot
learns to make a temporally-extended prediction during its
normal operation, and the prediction is used to select actions
as part of a fixed behavioral response. Our method is inspired
by Pavlovian conditioning experiments in which an animal’s
behavior adapts as it learns to predict an event. Surprisingly
the animal’s behavior changes even in the absence of any
benefit to the animal (i.e.
the animal is not modifying its behavior to maximize reward).
Our method for robot control combines a fixed response with
online prediction learning, thereby producing an adaptive
behavior. This method is different from standard non-adaptive
control methods and also from adaptive reward-maximizing
control methods. We show that this method improves upon the
performance of two reactive controls, with visible benefits
within 2.5 minutes of real-time learning on the robot. In the
first experiment, the robot turns off its motors when it
predicts a future over-current condition, which reduces the
time spent in unsafe over-current conditions and improves
efficiency. In the second experiment, the robot starts to move
when it predicts a human-issued request, which reduces the
apparent latency of the human-robot interface.
White, A., Modayil, J., Sutton, R. S. (2014). Surprise and
curiosity for big data robotics. In Workshop on Sequential
Decision-Making with Big Data at the Twenty-Eighth AAAI
Conference on Artificial Intelligence.
ABSTRACT:This
paper introduces a new perspective on curiosity and
intrinsic motivation, viewed as the problem of generating
behavior data for parallel off-policy learning. We provide
1) the first measure of surprise based on off-policy
general value function learning progress, 2) the first
investigation of reactive behavior control with parallel
gradient temporal difference learning and function
approximation, and 3) the first demonstration of using
curiosity driven control to react to a non-stationary
learning task—all on a mobile robot. Our approach improves
scalability over previous off-policy, robot learning
systems, essential for making progress on the ultimate
big-data decision making problem—life-long robot learning
Edwards, A. L., Kearney, A., Dawson, M. R., Sutton, R. S.,
Pilarski, P. M. (2013). Temporal-difference
learning to assist human decision making during the control of
an artificial limb. In: Proceedings
of
the First Interdisciplinary Conference on Reinforcement
Learning and Decision Making, and
http://arxiv.org/abs/1309.4714.
ABSTRACT: In
this work we explore the use of reinforcement learning (RL) to
help with human decision making, combining state-of-the-art RL
algorithms with an application to prosthetics. Managing
human-machine interaction is a problem of considerable scope,
and the simplification of human-robot interfaces is especially
important in the domains of biomedical technology and
rehabilitation medicine. For example, amputees who control
artificial limbs are often required to quickly switch between
a number of control actions or modes of operation in order to
operate their devices. We suggest that by learning to
anticipate (predict) a user's behaviour, artificial limbs
could take on an active role in a human's control decisions so
as to reduce the burden on their users. Recently, we showed
that RL in the form of general value functions (GVFs) could be
used to accurately detect a user's control intent prior to
their explicit control choices. In the present work, we
explore the use of temporal-difference learning and GVFs to
predict when users will switch their control influence between
the different motor functions of a robot arm. Experiments were
performed using a multi-function robot arm that was controlled
by muscle signals from a user's body (similar to conventional
artificial limb control). Our approach was able to acquire and
maintain forecasts about a user's switching decisions in real
time. It also provides an intuitive and reward-free way for
users to correct or reinforce the decisions made by the
machine learning system. We expect that when a system is
certain enough about its predictions, it can begin to take
over switching decisions from the user to streamline control
and potentially decrease the time and effort needed to
complete tasks. This preliminary study therefore suggests a
way to naturally integrate human- and machine-based decision
making systems.
Silver, D., Sutton, R. S., Mueller, M. (2013). Temporal-difference search in
computer go. In: Proceedings
of
the ICAPS-13 Workshop on Planning and Learning.
ABSTRACT:
Temporal-difference (TD) learning is one of the most
successful and broadly applied solutions to the reinforcement
learning problem; it has been used to achieve master-level
play in chess, checkers and backgammon. Monte-Carlo tree
search is a recent algorithm for simulation-based search,
which has been used to achieve master-level play in Go. We
have introduced a new approach to high-performance planning
(Silver et al., 2012). Our method, TD search, combines TD
learning with simulation-based search. Like Monte-Carlo tree
search, value estimates are updated by learning online from
simulated experience. Like TD learning, it uses value function
approximation and bootstrapping to efficiently generalise
between related states. We applied TD search to the game of 9
× 9 Go, using a million binary features matching simple
patterns of stones. Without any explicit search tree, our
approach outperformed a vanilla Monte-Carlo tree search with
the same number of simulations. When combined with a simple
alpha-beta search, our program also outperformed all
traditional (pre-Monte-Carlo) search and machine learning
programs on the 9 × 9 Computer Go Server.
Mahmood, A. R., Sutton,
R. S. (2013). Representation
Search
through Generate and Test. In: Proceedings of the AAAI Workshop on Learning Rich
Representations from Low-Level Sensors, Bellevue,
Washington, USA.
ABSTRACT:
Learning representations from data is one of the fundamental
problems of artificial intelligence and machine learning. Many
different approaches exist for learning representations, but
what constitutes a good representation is not yet well
understood. Much less is known about how to learn
representations from an unending stream of data. In this work,
we view the problem of representation learning as one of
learning features (e.g., hidden units of neural networks) such
that performance of the underlying base system continually
improves. We study an important case where learning is done
fully online (i.e, on an example-by-example basis). In the
presence of an unending stream of data, the computational cost
of the learning element should not grow with time and cannot
be much more than that of the performance element. Few methods
can be used effectively in this case. We show that a search
approach to representation learning can naturally fit into
this setting. In this approach good representations are
searched by generating different features and then testing
them for utility. We show that a fully online method can be
developed, which utilizes this generate and test approach,
learns representations continually, and adds only a small
fraction to the overall computation. We believe online
representation search constitutes an important step to- ward
effective and inexpensive solutions to representation learning
problems.
van
Seijen, H., Sutton, R. S. (2013). Efficient planning in MDPs by
small backups. In: Proceedings
of the 30th International Conference on Machine Learning,
pp. 361-369, Atlanta, USA.
ABSTRACT:
Efficient planning plays a crucial role in model-based
reinforcement learning. This efficiency is largely determined
by the order in which backups are performed. A particularly
effective ordering strategy is the strategy employed by
prioritized sweeping. Prioritized sweeping orders backups
according to a heuristic, such that backups that cause a large
value change are selected first. The Bellman error predicts
the value change caused by a full backup exactly, but its
computation is expensive. Hence, methods do not use the
Bellman error as a heuristic, but try to approximate the
ordering induced by it with a heuristic that is cheaper to
compute. We introduce the first efficient prioritized sweeping
method that uses exact Bellman error ordering. The core of
this method is a novel backup that allows for efficient
computation of the Bellman error, while its effect as well as
its computational complexity is equal to that of a full
backup. We demonstrate empirically that the obtained method
achieves substantially higher convergence rates than other
prioritized sweeping implementations.
Pilarski,
P. M., Dick, T. B., Sutton, R. S. (2013). Real-time prediction learning
for the simultaneous actuation of multiple prosthetic joints.
In: Proceedings of the 2013
IEEE International Conference on Rehabilitation Robotics,
Seattle, USA, June 24–26. 8 pages.
ABSTRACT:
Integrating learned predictions into a prosthetic control
system promises to enhance multi-joint prosthesis use by
amputees. In this article, we present a preliminary study of
different cases where it may be beneficial to use a set of
temporally extended predictions—learned and maintained in real
time—within an engineered or learned prosthesis controller.
Our study demonstrates the first successful combination of
actor-critic reinforcement learning with real-time prediction
learning. We evaluate this new approach to control learning
during the myoelectric operation of a robot limb. Our results
suggest that the integration of real-time prediction and
control learning may speed control policy acquisition, allow
unsupervised adaptation in myoelectric controllers, and
facilitate synergies in highly actuated limbs. These
experiments also show that temporally extended prediction
learning enables anticipatory actuation, opening the way for
coordinated motion in assistive robotic devices. Our work
therefore provides initial evidence that real-time prediction
learning is a practical way to support intuitive joint control
in increasingly complex prosthetic systems.
Pilarski,
P. M., Dawson, M. R., Degris, T., Chan, K. M., Hebert, J. S.,
Carey, J. P., Sutton, R. S. (2013). Adaptive Artificial Limbs: A
Real-time Approach to Prediction and Anticipation, IEEE Robotics & Automation
Magazine 20(1):53-64.
ABSTRACT:
Predicting the future has long been regarded as a powerful
means to improvement and success. The ability to make accurate
and timely predictions enhances our ability to control our
situation and our environment. Assistive robotics is one
prominent area where foresight of this kind can bring improved
quality of life. In this article, we present a new approach to
acquiring and maintaining predictive knowledge during the
online, ongoing operation of an assistive robot. The ability
to learn accurate, temporally abstracted predictions is shown
through two case studies—able-bodied subjects engaging in the
myoelectric control of a robot arm, and an amputee
participant’s interactions with a myoelectric training robot.
To our knowledge, this work is the first demonstration of a
practical method for real-time prediction learning during
myoelectric control. Our approach therefore represents a
fundamental tool for addressing one major unsolved problem:
amputee-specific adaptation during the ongoing operation of a
prosthetic device. The findings in this article also
contribute a first explicit look at prediction learning in
prosthetics as an important goal in its own right, independent
of its intended use within a specific controller or system.
Our results suggest that real-time learning of predictions and
anticipations is a significant step towards more intuitive
myoelectric prostheses and other assistive robotic devices.
Kehoe, E. J., Ludvig,
E. A., Sutton, R. S. (2013). Timing
and cue competition in conditioning of the nictitating
membrane response of the rabbit (Oryctolagus cuniculus), Learning and Memory 20:97-102.
Cold Spring Harbor Press.
ABSTRACT:Rabbits
were
classically conditioned using compounds of tone and light
conditioned stimuli (CSs) presented with either simultaneous
onsets (Experiment 1) or serial onsets (Experiment 2) in a
delay conditioning paradigm. Training with the simultaneous
compound reduced the likelihood of a conditioned response (CR)
to the individual CSs (“mutual overshadowing”) but left CR
timing unaltered. CR peaks were consistently clustered around
the time of unconditioned stimulus (US) delivery. Training
with the serial compound (CSA->CSB->US) reduced
responding to CSB (“temporal primacy/information effect”) but
this effect was prevented by prior CSB->US pairings. In
both cases, serial compound training altered CR timing. On
CSA->CSB test trials, the CRs were accelerated; the CR
peaks occurred after CSB onset but well before the time of US
delivery. Conversely, CRs on CSB– trials were decelerated; the
distribution of CR peaks was variable but centered well after
the US. Timing on CSB– trials was at most only slightly
accelerated. The results are discussed with respect to
processes of generalization and spectral timing applicable to
the cerebellar and forebrain pathways in eyeblink
preparations.
White, A., Modayil, J., Sutton, R. S. (2012). Scaling Life-long Off-policy
Learning. In Proceedings
of the Second Joint IEEE International Conference on
Development and Learning and on Epigenetic Robotics
(ICDL-Epirob), San Diego, USA.
ABSTRACT:In
this paper, we pursue an approach to scaling life-long
learning using parallel, off-policy reinforcement-learning
algorithms. In life-long learning, a robot continually learns
from a life-time of experience, slowly acquiring and applying
skills and knowledge to new situations. Many of the benefits
of life-long learning are a result of scaling the amount of
training data, processed by the robot, to long sensorimotor
streams. Another dimension of scaling can be added by allowing
off-policy sampling from the unending stream of sensorimotor
data generated by a long-lived robot. Recent algorithmic
developments have made it possible to apply off-policy
algorithms to life-long learning, in a sound way, for the
first time. We assess the scalability of these off-policy
algorithms on a physical robot. We show that hundreds of
accurate multi-step predictions can be learned about several
policies in parallel and in realtime. We present the first
online measures of off-policy learning progress. Finally we
demonstrate that our robot, using the new off-policy measures,
can learn 8000 predictions about 300 distinct policies, a
substantial increase in scale compared to previous simulated
and robotic life-long learning systems.
Modayil, J., White, A., Pilarski, P.
M., Sutton, R. S. (2012). Acquiring a Broad Range of
Empirical Knowledge in Real Time by Temporal-Difference
Learning. In Proceedings
of the IEEE International Conference on Systems, Man, and
Cybernetics, Seoul, Korea.
ABSTRACT:
Several robot capabilities rely on predictions about the
temporally extended consequences of a robot’s behaviour. We
describe how a robot can both learn and make many such
predictions in real time using a standard algorithm. Our
experiments show that a mobile robot can learn and make
thousands of accurate predictions at 10 Hz. The predictions
are about the future of all of the robot’s sensors and many
internal state variables at multiple time-scales. All the
predictions share a single set of features and learning
parameters. We demonstrate the generality of this method with
an application to a different platform, a robot arm operating
at 50 Hz. Here, learned predictions can be used to measurably
improve the user interface. The temporally extended
predictions learned in real time by this method constitute a
basic form of knowledge about the dynamics of the robot’s
interaction with the environment. We also show how this method
can be extended to express more general forms of knowledge.
Modayil, J., White, A., Pilarski, P.
M., Sutton, R. S. (2012). Acquiring Diverse
Predictive Knowledge in Real Time by Temporal-difference
Learning. International
Workshop on
Evolutionary and Reinforcement Learning for Autonomous Robot
Systems, Montpellier, France.
ABSTRACT:
Existing robot algorithms demonstrate several capabilities
that are enabled by a robot’s knowledge of the temporally-
extended consequences of its behaviour. This knowledge
consists of real-time predictions—predictions that are
conventionally computed by iterating a small one-timestep
model of the robot’s dynamics. Given the utility of such
predictions, alternatives are desirable when this conventional
approach is not applicable, for example when an adequate model
of the one-timestep dynamics is either not available or not
computationally tractable. We describe how a robot can both
learn and make many such predictions in real-time using a
standard reinforcement learning algorithm. Our experiments
show that a mobile robot can learn and make thousands of
accurate predictions at 10 Hz about the future of all of its
sensors and many internal state vari- ables at multiple
time-scales. The method uses a single set of features and
learning parameters that are shared across all the
predictions. We demonstrate the generality of these
predictions with an application to a different platform, a
robot arm operating at 50 Hz. Here, the predictions are about
which arm joint the user wants to move next, a difficult
situation to model analytically, and we show how the learned
predictions enable measurable improvements to the user
interface. The predictions learned in real-time by this method
constitute a basic form of knowledge about the robot’s
interaction with the environ- ment, and extensions of this
method can express more general forms of knowledge.
Ludvig,
E. A., Sutton, R. S., Kehoe, E. J.
(2012). Evaluating the TD model of
classical conditioning. Learning
and Behavior 40(3):305-319. Springer.
ABSTRACT: The
temporal-difference (TD) algorithm from reinforcement learning
provides a simple method for incrementally learning
predictions of upcoming events. Applied to classical
conditioning, TD models suppose that animals learn a real-time
prediction of the unconditioned stimulus (US) on the basis of
all available conditioned stimuli (CSs). In the TD model,
similar to other error-correction models, learning is driven
by prediction errors—the difference between the change in US
prediction and the actual US. With the TD model, however,
learning occurs continuously from moment to moment and is not
artificially constrained to occur in trials. Accordingly, a
key feature of any TD model is the assumption about the
representation of a CS on a moment-to-moment basis. Here, we
evaluate the performance of the TD model with a heretofore
unexplored range of classical conditioning tasks. To do so, we
consider three stimulus representations that vary in their
degree of temporal generalization and evaluate how the
representation influences the performance of the TD model on
these conditioning tasks.
Pilarski, P. M., Sutton, R. S. (2012) Between instruction and
reward: Human-prompted switching. Proceedings of the AAAI Fall
Symposium on Robots Learning Interactively from Human Teachers.
AAAI Technical Report FS-12-07, pp. 46–52.
ABSTRACT: Intelligent
systems promise to amplify, augment, and extend innate
human abilities. A principal example is that of assistive
rehabilitation robots—artificial intelligence and machine
learning enable new electromechanical systems that restore
biological functions lost through injury or illness. In
order for an intelligent machine to assist a human user,
it must be possible for a human to communicate their
intentions and preferences to their non-human counterpart.
While there are a number of techniques that a human can
use to direct a machine learning system, most research to
date has focused on the contrasting strategies of
instruction and reward. The primary contribution of our
work is to demonstrate that the middle ground between
instruction and reward is a fertile space for research and
immediate technological progress. To support this idea, we
introduce the setting of human-prompted switching, and
illustrate the successful combination of switching with
interactive learning using a concrete real-world example:
human control of a multi-joint robot arm. We believe
techniques that fall between the domains of instruction
and reward are complementary to existing approaches, and
will open up new lines of rapid progress for interactive
human training of machine learning systems.
Pilarski, P. M., Dawson, M. R.,
Degris, T., Carey, J. P., Chan, K. M., Hebert, J. S., Sutton, R.
S. (2012). Towards
Prediction-Based Prosthetic Control. In Proceedings of the 2012 Conference of the
International Functional Electrical Stimulation Society,
September 9-12, Banff, Canada, 4 pages.
ABSTRACT: Predictions
and anticipations form a basis for
pattern-recognition-based control systems, including those
used in next-generation prostheses and assistive devices.
In this work, we outline how new methods in real-time
prediction learning provide an approach to one of the
principal open problems in multi-function myoelectric
control—robust, ongoing, amputee-specific adaptation.
Techniques from reinforcement learning and general value
functions are applied to learn and update predictions
during continuing human interactions with multi-function
robotic appendages. Tests were performed with a targeted
motor and sensory reinnervation amputee subject and
non-amputee participants. Results show that this online
prediction learning approach is able to accurately
anticipate a diverse set of human and robot signals by
more than two seconds, and at different levels of temporal
abstraction. These methods allow predictions to be adapted
during real-time use, which is an important step toward
the intuitive control of powered prostheses and other
assistive devices.
Degris, T., White, M., Sutton, R. S. (2012). Off-policy
Actor-Critic. In Proceedings of the 29th
International Conference on Machine Learning, Edinburgh, Great Britain. Demo.
ABSTRACT: This paper
presents the first actor-critic algorithm for off-policy
reinforcement learning. Our algorithm is online and
incremental, and its per-time-step complexity scales
linearly with the number of learned weights. Previous work
on actor-critic algorithms is limited to the on-policy
setting and does not take advantage of the recent advances
in off-policy gradient temporal-difference learning.
Off-policy techniques, such as Greedy-GQ, enable a target
policy to be learned while following and obtaining data
from another (behavior) policy. For many problems,
however, actor-critic methods are more practical than
action value methods (like Greedy-GQ) because they
explicitly represent the policy; consequently, the policy
can be stochastic and utilize a large action space. In
this paper, we illustrate how to practically combine the
generality and learning potential of off-policy learning
with the flexibility in action selection given by
actor-critic methods. We derive an incremental, linear
time and space complexity algorithm that includes
eligibility traces, prove convergence under assumptions
similar to previous off-policy algorithms, and empirically
show better or comparable performance to existing
algorithms on standard reinforcement-learning benchmark
problems.
Pilarski, P. M., Dawson, M. R.,
Degris, T., Carey, J. P., Sutton, R. S. (2012). Dynamic Switching and
Real-time Machine Learning for Improved Human Control of
Assistive Biomedical Robots. In Proceedings of the Fourth IEEE International Conference on
Biomedical Robotics and Biomechatronics (BioRob), June 24-27, Roma,
Italy, 296-302.
ABSTRACT: A general
problem for human-machine interaction occurs when a
machine’s controllable dimensions outnumber the control
channels available to its human user. In this work, we
examine one prominent example of this problem: amputee
switching between the multiple functions of a powered
artificial limb. We propose a dynamic switching approach
that learns during ongoing interaction to anticipate user
behaviour, thereby presenting the most effective control
option for a given context or task. Switching predictions
are learned in real time using temporal difference methods
and reinforcement learning, and demonstrated within the
context of a robotic arm and a multi-function myoelectric
controller. We find that a learned, dynamic switching
order is able to out-perform the best fixed (non-adaptive)
switching regime on a standard prosthetic proficiency
task, increasing the number of optimal switching
suggestions by 23%, and decreasing the expected transition
time between degrees of freedom by more than 14%. These
preliminary results indicate that real-time machine
learning, specifically online prediction and anticipation,
may be an important tool for developing more robust and
intuitive controllers for assistive biomedical robots. We
expect these techniques will transfer well to near-term
use by patients. Future work will describe clinical
testing of this approach with a population of amputee
patients.
Sutton, R. S. (2012). Beyond reward: The problem of
knowledge and data (extended abstract). In: Proceedings of
the 21st International Conference on Inductive Logic
Programming, S.H.
Muggleton, A. Tamaddoni-Nezhad, F.A. Lisi (Eds.): ILP 2011, LNAI
7207, pp. 2--6. Springer, Heidelberg.
Degris, T., Modayil,
J.
(2012). Scaling-up Knowledge for
a Cognizant Robot.” In notes of the AAAI Spring Symposium on Designing
Intelligent Robots: Reintegrating AI.
ABSTRACT: This paper
takes a new approach to the old adage that knowledge is
the key for artificial intelligence. A cognizant robot is
a robot with a deep and immediately accessible
understanding of its interaction with the environment—an
understanding the robot can use to flexibly adapt to novel
situations. Such a robot will need a vast amount of
situated, revisable, and expressive knowledge to display
flexible intelligent behaviors. Instead of relying on
human-provided knowledge, we propose that an arbitrary
robot can autonomously acquire pertinent knowledge
directly from everyday interaction with the environment.
We show how existing ideas in reinforcement learning can
enable a robot to maintain and improve its knowledge. The
robot performs a continual learning process that scales-up
knowledge acquisition to cover a large number of facts,
skills and predictions. This knowledge has semantics that
are grounded in sensorimotor experience. We see the
approach of developing more cognizant robots as a
necessary key step towards broadly competent robots.
Degris, T.,
Pilarski, P. M., Sutton, R. S.
(2012). Model-Free Reinforcement Learning
with Continuous Action in Practice. In Proceedings of the 2012 American
Control Conference.
ABSTRACT: Reinforcement
learning methods are often considered as a potential
solution to enable a robot to adapt to changes in real
time to an unpredictable environment. However, with
continuous action, only a few existing algorithms are
practical for real-time learning. In such a setting, most
effective methods have used a parameterized policy
structure, often with a separate parameterized value
function. The goal of this paper is to assess such
actor-critic methods to form a fully specified practical
algorithm. Our specific contributions include 1)
developing the extension of existing incremental
policy-gradient algorithms to use eligibility traces, 2)
an empirical comparison of the resulting algorithms using
continuous actions, 3) the evaluation of a
gradient-scaling technique that can significantly improve
performance. Finally, we apply our actor--critic algorithm
to learn on a robotic platform with a fast sensorimotor
cycle (10ms). Overall, these results constitute an
important step towards practical real-time learning
control with continuous action.
Mahmood, A. R., Sutton, R. S., Degris,
T., Pilarski, P. M. (2012). Tuning-free step-size
adaptation. In Proceedings
of the IEEE International Conference on Acoustics, Speech, and
Signal Processing, Kyoto, Japan.
ABSTRACT: Incremental
learning algorithms based on gradient descent are
effective and popular in online supervised learning,
reinforcement learning, signal processing, and many other
application areas. An oft-noted drawback of these
algorithms is that they include a step-size parameter that
needs to be tuned for best performance, which might
require manual intervention and significant domain
knowledge or additional data. In many cases, an entire
vector of step-size parameters (e.g., one for each input
feature) needs to be tuned in order to attain the best
performance of the algorithm. To address this, several
methods have been proposed for adapting step sizes online.
For example, Sutton’s IDBD method can find the best vector
step size for the LMS algorithm, and Schraudolph’s ELK1
method, an extension of IDBD to neural networks, has
proven effective on large applications, such as 3D hand
tracking. However, to date all such step-size adaptation
methods have included a tunable step-size parameter of
their own, which we call the meta-step-size parameter. In this paper
we show that the performance of existing step-size
adaptation methods are strongly dependent on the choice of
their meta-step-size parameter and that their
meta-step-size parameter cannot be set reliably in a
problem-independent way. We introduce a series of
modifications and normalizations to the IDBD method that
together eliminate the need to tune the meta-step-size
parameter to the particular problem. We show that the
resulting overall algorithm, called Autostep, performs as
well or better than the existing step-size adaptation
methods on a number of idealized and robot prediction
problems and does not require any tuning of its
meta-step-size parameter. The ideas behind Autostep are
not restricted to the IDBD method and the same principles
are potentially applicable to other incremental learning
settings, such as reinforcement learning.
Silver, D., Sutton, R. S., Müller, M. (2012).
Temporal-difference search in computer Go, Machine Learning 87(2):183-219.
Pdf copy for
personal research purposes only.
ABSTRACT: Temporal-difference
learning is one of the most successful and broadly applied
solutions to the reinforcement learning problem; it has
been used to achieve master-level play in chess, checkers
and backgammon. The key idea is to update a value function
from episodes of real experience, by bootstrapping from
future value estimates, and using value function
approximation to generalise between related states.
Monte-Carlo tree search is a recent algorithm for
high-performance search, which has been used to achieve
master-level play in Go. The key idea is to use the mean
outcome of simulated episodes of experience to evaluate
each state in a search tree. We introduce a new approach
to high-performance search in Markov decision processes
and two-player games. Our method, temporal-difference
search, combines temporal-difference learning with
simulation-based search. Like Monte-Carlo tree search, the
value function is updated from simulated experience; but
like temporal-difference learning, it uses value function
approximation and bootstrapping to efficiently generalise
between related states. We apply temporal-difference
search to the game of 9 × 9 Go, using a million binary
features matching simple patterns of stones. Without any
explicit search tree, our approach outperformed an
unenhanced Monte-Carlo tree search with the same number of
simulations. When combined with a simple alpha-beta
search, our program also outperformed all traditional
(pre-Monte-Carlo) search and machine learning programs on
the 9 × 9 Computer Go Server.
Modayil, J., White, A., Sutton, R. S. (2012). Multi-timescale Nexting in a
Reinforcement Learning Robot. Presented at the 2012
International Conference on Adaptive Behaviour, Odense, Denmark. To appear in: SAB 12, LNAI 7426, pp.
299-309, T. Ziemke, C. Balkenius, and J. Hallam, Eds., Springer
Heidelberg. Also ArXiv preprint 1112.1133.
ABSTRACT: Maintaining
accurate world knowledge in a complex and changing
environment is a perennial problem for robots and other
artificial intelligence systems. Our architecture
for addressing this problem, called Horde, consists of a
large number of independent reinforcement learning
sub-agents, or demons. Each demon is responsible for
answering a single predictive or goal-oriented question
about the world, thereby contributing in a factored,
modular way to the system's overall knowledge. The
questions are in the form of a value function, but each
demon has its own policy, reward function, termination
function, and terminal-reward function unrelated to those
of the base problem. Learning proceeds in parallel by all
demons simultaneously so as to extract the maximal
training information from whatever actions are taken by
the system as a whole. Gradient-based temporal-difference
learning methods are used to learn efficiently and
reliably with function approximation in this off-policy
setting. Horde runs in constant time and memory per time
step, and is thus suitable for learning online in
real-time applications such as robotics. We present
results using Horde on a multi-sensored mobile robot to
successfully learn goal-oriented behaviors and long-term
predictions from off-policy experience. Horde is a
significant incremental step towards a real-time
architecture for efficient learning of general knowledge
from unsupervised sensorimotor interaction.
Pilarski,
P.
M., Dawson, M. R., Degris, T.,Fahimi, F., Carey, J. P., Sutton,
R. S. (2011). Online human
training of a myoelectric prosthesis controller via
actor-critic reinforcement learning, Proceedings of the 2011 IEEE International Conference on
Rehabilitation Robotics (ICORR), Zurich, Switzerland,
pp. 134-140.
ABSTRACT: As
a contribution toward the goal of adaptable, intelligent
artificial limbs, this work introduces a continuous
actor-critic reinforcement learning method for optimizing the
control of multi-function myoelectric devices. Using a
simulated upper-arm robotic prosthesis, we demonstrate how it
is possible to derive successful limb controllers from
myoelectric data using only a sparse human-delivered training
signal, without requiring detailed knowledge about the task
domain. This reinforcement-based machine learning framework is
well suited for use by both patients and clinical staff, and
may be easily adapted to different application domains and the
needs of individual amputees. To our knowledge, this is the
first myoelectric control approach that facilitates the online
learning of new amputee-specific motions based only on a
one-dimensional (scalar) feedback signal provided by the user
of the prosthesis.
Sutton, R. S., Modayil, J., Delp, M.,
Degris, T., Pilarski, P. M., White, A., Precup, D. (2011). Horde: A scalable real-time
architecture for learning knowledge from unsupervised
sensorimotor interaction. In Proceedings of the Tenth International Conference on
Autonomous Agents and Multiagent Systems, Taipei,
Taiwan.
ABSTRACT: Maintaining
accurate world knowledge in a complex and changing
environment is a perennial problem for robots and other
artificial intelligence systems. Our architecture
for addressing this problem, called Horde, consists of a
large number of independent reinforcement learning
sub-agents, or demons. Each demon is responsible for
answering a single predictive or goal-oriented question
about the world, thereby contributing in a factored,
modular way to the system's overall knowledge. The
questions are in the form of a value function, but each
demon has its own policy, reward function, termination
function, and terminal-reward function unrelated to those
of the base problem. Learning proceeds in parallel by all
demons simultaneously so as to extract the maximal
training information from whatever actions are taken by
the system as a whole. Gradient-based temporal-difference
learning methods are used to learn efficiently and
reliably with function approximation in this off-policy
setting. Horde runs in constant time and memory per time
step, and is thus suitable for learning online in
real-time applications such as robotics. We present
results using Horde on a multi-sensored mobile robot to
successfully learn goal-oriented behaviors and long-term
predictions from off-policy experience. Horde is a
significant incremental step towards a real-time
architecture for efficient learning of general knowledge
from unsupervised sensorimotor interaction.
Modayil, J., Pilarski, P. M., White, A., Degris, T., Sutton,
R. S. (2010). Off-policy
knowledge maintenance for robots, Robotics Science and
Systems Workshop (Towards Closing the Loop: Active Learning
for Robotics). Extended abstract and poster.
ABSTRACT: A fundamental difficulty in
robotics arises from changes in the experienced
environment—periods when the robot’s current situation
differs from past experience. We present an architecture
whereby many independent reinforcement learn- ing agents (or
demons) observe the
behaviour of a single robot. Each demon learns one piece of
world knowledge represented with a generalized value
function. This architecture allows the demons to update
their knowledge online and off-policy from the robot’s
behaviour. We present one approach to active exploration
using curiosity—an internal measure of learning progress—and
conclude with a preliminary result showing how a robot can
adapt its prediction of the time needed to come to a full
stop.
Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2010). Timing in trace conditioning of
the nictitating membrane response of the rabbit (Oryctolagus cuniculus):
Scalar, nonscalar, and adaptive features. Learning and Memory 17:600-604.
Cold Spring Harbor Press.
Using
interstimulus intervals (ISIs) of 125, 250, and 500 msec in
trace conditioning of the rabbit nictitating membrane
response, the offset times and durations of conditioned
responses (CRs) were collected along with onset and peak
latencies. All measures were proportional to the ISI, but only
onset and peak latencies conformed to the criterion for scalar
timing. Regarding the CR’s possible protective overlap of the
unconditioned stimulus (US), CR duration increased with ISI,
while the peak’s alignment with the US declined. Implications
for models of timing and CR adaptiveness are discussed.
Maei, H. R., Szepesvari,
Cs., Bhatnagar, S., Sutton,
R. S. (2010). Toward
off-policy learning control with function approximation.
In Proceedings of the 27th
International Conference on Machine Learning, Haifa, Israel.
ABSTRACT: We present the first
temporal-difference learning algorithm for off-policy
control with unrestricted linear function approximation
whose per-time-step complexity is linear in the number of
features. Our algorithm, Greedy-GQ, is an extension of
recent work on gradient temporal-difference learning, which
has hitherto been restricted to a prediction (policy
evaluation) setting, to a control setting in which the
target policy is greedy with respect to a linear
approximation to the optimal action-value function. A
limitation of our control setting is that we require the
behavior policy to be stationary. We call this setting
latent learning because the optimal policy, though
learned, is not manifest in behavior. Popular off-policy
algorithms such as Q-learning are known to be unstable in
this setting when used with linear function approximation.
Maei, H. R.,
Sutton, R. S. (2010). GQ(λ): A general gradient
algorithm for temporal-difference prediction learning with
eligibility traces. In Proceedings of the
Third Conference on Artificial General Intelligence,
Lugano, Switzerland. GQ(λ)
quick reference guide. GQ(λ) reference implementation
in java, C++,
and C with header
file.
ABSTRACT: A new family of gradient temporal-difference
learning algorithms have recently been introduced by Sutton,
Maei and others in which function approximation is much more
straightforward. In this paper, we introduce the GQ(λ)
algorithm which can be seen as extension of that work to a
more general setting including eligibility traces and
off-policy learning of temporally abstract predictions. These
extensions bring us closer to the ultimate goal of this
work—the development of a universal prediction learning
algorithm suitable for learning experientially grounded
knowledge of the world. Eligibility traces are essential to
this goal because they bridge the temporal gaps in cause and
effect when experience is processed at a temporally fine
resolution. Temporally abstract predictions are also essential
as the means for representing abstract, higher-level knowledge
about courses of action, or options. GQ(λ) can be thought of
as an extension of Q-learning. We extend existing convergence
results for policy evaluation to this setting and carry out a
forward-view/backward-view analysis to derive and prove the
validity of the new algorithm.
Kehoe, E. J., Ludvig, E. A., Sutton, R. S.
(2009). Magnitude and timing of
conditioned responses in delay and trace classical
conditioning of the nictitating membrane respons of the rabbit
(oryctolagus cuniculus),
Behavioral Neuroscience 123(5):1095-1101.
ABSTRACT:
The present experiment characterized conditioned nictitating
membrane (NM) movements as a function of CS duration, using
the full range of discernible movements (>.06 mm) rather
than movements exceeding a conventional criterion (>.50
mm). The CS–US interval was fixed at 500 ms, while across
groups, the duration of the CS was 50 ms (trace), 550 ms
(delay), or 1050 ms (extended delay). The delay group showed
the highest level of acquisition. When tested with the
different CS durations, the delay and extended delay groups
showed large reductions in their responses when their CS was
shortened to 50 ms, but the trace group maintained its
response at all durations. Timing of the conditioned
movements appeared similar across all manipulations. The
results suggest that the CS has both a fine timing function
tied to CS onset and a general predictive function tied to
CS duration, both of which may be mediated by cerebellar
pathways.
Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S. (2010). Convergent
Temporal-Difference Learning with Arbitrary Smooth
Function Approximation. In Advances in Neural
Information Processing Systems 22, Vancouver, BC, pp.
1204-1212. December 2009. MIT Press.
ABSTRACT: We introduce the first temporal-difference
learning algorithms that converge with smooth value function
approximators, such as neural networks. Conventional
temporal-difference (TD) methods, such as TD(λ), Q-learning
and Sarsa have been used successfully with function
approximation in many applications. However, it is well
known that off-policy sampling, as well as nonlinear
function approximation, can cause these algorithms to become
unstable (i.e., the parameters of the approximator may
diverge). Sutton et al. (2009a, 2009b) solved the problem of
off-policy learning with linear TD algorithms by introducing
a new objective function, related to the Bellman error, and
algorithms that perform stochastic gradient-descent on this
function. These methods can be viewed as natural
generalizations to previous TD methods, as they converge to
the same limit points when used with linear function
approximation methods. We generalize this work to nonlinear
function approximation. We present a Bellman error objective
function and two gradient-descent TD algorithms that
optimize it. We prove the asymptotic almost-sure convergence
of both algorithms, for any finite Markov decision process
and any smooth value function approximator, to a locally
optimal solution. The algorithms are incremental and the
computational complexity per time step scales linearly with
the number of parameters of the approximator. Empirical
results obtained in the game of Go demonstrate the
algorithms’ effectiveness.
Interview
with
Richard S. Sutton, by Verena Heidrich-Meisner of Kuenstliche
Intelligenz. (2009). pp. 41-43.
Bhatnagar,
S., Sutton, R. S., Ghavamzadeh, M., Lee, M. (2009). Natural Actor-Critic Algorithms.
Automatica. Unfortunately, due to a disagreement with
the publisher, and to the mistake of using Bibtex, all the
citations in this paper are in alphabetical order rather than
the order in which credit is due. We regret this. Conference version. TR version with
experiments.
ABSTRACT: We present four new
reinforcement learning algorithms based on actor-critic,
natural-gradient and function-approximation ideas, and we
provide their convergence proofs. Actor-critic reinforcement
learning methods are online approximations to policy
iteration in which the value-function parameters are
estimated using temporal difference learning and the policy
parameters are updated by stochastic gradient descent.
Methods based on policy gradient in this way are of special
interest because of their compatibility with function
approximation methods, which are needed to handle large or
infinite state spaces. The use of temporal difference
learning in this way is of interest because in many
applications it dramatically reduces the variance of the
gradient estimates. The use of the natural gradient is of
interest because it can produce better conditioned
parameterizations and has been shown to further reduce
variance in some cases. Our results extend prior
two-timescale convergence results for actor-critic methods
by Konda and Tsitsiklis by using temporal difference
learning in the actor and by incorporating natural
gradients. Our results extend prior empirical studies of
natural actor-critic methods by Peters, Vijayakumar and
Schaal by providing the first convergence proofs and the
first fully incremental algorithms.
Sutton, R. S. (2009).
The grand challenge of
predictive empirical abstract knowledge. Working Notes
of the IJCAI-09 Workshop on Grand Challenges for Reasoning
from Experiences.
ABSTRACT: We survey ongoing
work at the University of Alberta on an
experience-oriented approach to artificial intelligence
(AI) based an reinforcement learning ideas. We seek to
ground world knowledge in a minimal ontology of just the
signals passing back and forth between the AI agent and
its environment at a fine temporal scale. The challenge is
to connect these low-level signals to higher-level
representations in such a way that the knowledge remains
grounded and autonomously verifiable. The
mathematical ideas of temporally abstract options, option
models, and temporal-difference networks can be applied to
begin to address this challenge. This has been illustrated
in several simple computational worlds, and we seek now to
extend the approach to a physically realized robot. A
recent theoretical development is the extension of simple
temporal-difference methods to off-policy forms using
function approximation; this should enable, for the first
time, efficient and reliable intra-option learning,
bringing the goal of predictive empirical abstract
knowledge closer to achievement.
Sutton, R. S., Maei,
H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, E. (2009). Fast gradient-descent methods
for temporal-difference learning with linear function
approximation. In Proceedings
of the 26th International Conference on Machine Learning, Montreal, Canada. Presentation
at ICML-09.
Sutton, Szepesvari and Maei
(2009) recently introduced the first temporal-difference
learning algorithm compatible with both linear function
approximation and off-policy training, and whose
complexity scales only linearly in the size of the
function approximator. Although their gradient temporal difference
(GTD) algorithm converges reliably, it can be very slow
compared to conventional linear TD (on on-policy problems
where TD is convergent), calling into question its
practical utility. In this paper we introduce two
new related algorithms with better convergence
rates. The first algorithm, GTD2, is derived and
proved convergent just as GTD was, but uses a different
objective function and converges significantly faster (but
still not as fast as conventional TD). The second new
algorithm, linear TD
with gradient correction, or TDC, uses the same
update rule as conventional TD except for an additional
term which is initially zero. In our experiments on small
test problems and in a Computer Go application with a
million features, the learning rate of this algorithm was
comparable to that of conventional TD. This algorithm
appears to extend linear TD to off-policy learning with no
penalty in performance while only doubling computational
requirements.
Kehoe, E. J., Olsen, K. N., Ludvig, E. A., Sutton, R. S.
(2009). Scalar timing varies
with response magnitude in classical conditioning of the
rabbit nictitating membrane response (Oryctolagus
cuniculus). Behavioral
Neuroscience 123, 212–217.
The present experiment was
aimed at characterizing the timing of conditioned
nictitating membrane (NM) movements as function of the
interstimulus interval (ISI) in delay conditioning for
rabbits (Oryctolagus cuniculus). Onset latency and peak
latency were approximately, but not strictly, scalar for
all but the smallest movements (<.10 mm). That is, both
the mean and standard deviation of the timing measures
increased in proportion to the ISI, but their coefficients
of variation (standard deviation/mean) tended to be larger
for shorter ISIs. For all ISIs, the absolute timing of the
NM movements covaried with magnitude. The smaller
movements (approximately, .11-.50 mm) were highly
variable, and their peaks tended to occur well after the
time of US delivery. The larger movements (>.50 mm)
were less variable, and their peaks were better aligned
with the time of US delivery. These results are discussed
with respect to their implications for current models of
timing in eyeblink conditioning.
Sutton, R. S.,
Szepesvari, Cs., Maei, H. R. (2009). A convergent O(n) algorithm for
off-policy temporal-difference learning with linear
function approximation. In Advances in Neural
Information Processing Systems 21, Vancouver, BC.
December 2008. MIT Press.
We introduce the first
temporal-difference learning algorithm that is stable with
linear function approximation and off-policy training, for
any finite Markov decision process, behavior policy, and
target policy, and whose complexity scales linearly in the
number of parameters. We consider an i.i.d.
policy-evaluation setting in which the data need not come
from on-policy experience. The gradient
temporal-difference (GTD) algorithm estimates the
expected update vector of the TD(0) algorithm and performs
stochastic gradient descent on its L2 norm.
We prove that this algorithm is stable and convergent
under the usual stochastic approximation conditions to the
same least-squares solution as found by the LSTD, but
without LSTD's quadratic computational complexity.
GTD is online and incremental, and does not involve
multiplying by products of likelihood ratios as in
importance-sampling methods.
Ludvig,
E., Sutton, R. S., Verbeek, E., Kehoe, E. J. (2009). A computational model of
hippocampal function in trace conditioning. In Advances in Neural Information
Processing Systems 21, pp. 993-1000, Vancouver, BC.
December 2008. MIT Press.
We present a new
reinforcement-learning model for the role of the
hippocampus in classical conditioning, focusing on the
differences between trace and delay conditioning. In the
model, all stimuli are represented both as unindividuated
wholes and as a series of temporal elements with varying
delays. These two stimulus representations interact,
producing different patterns of learning in trace and
delay conditioning. The model proposes that hippocampal
lesions eliminate long-latency temporal elements, but
preserve short-latency temporal elements. For trace
conditioning, with no contiguity between stimulus and
reward, these long-latency temporal elements are vital to
learning adaptively timed responses. For delay
conditioning, in contrast, the continued presence of the
stimulus supports conditioned responding, and the
short-latency elements suppress responding early in the
stimulus. In accord with the empirical data, simulated
hippocampal damage impairs trace conditioning, but not
delay conditioning, at medium-length intervals. With
longer intervals, learning is impaired in both procedures,
and, with shorter intervals, in neither. In addition, the
model makes novel predictions about the response
topography with extended stimuli or post-training lesions.
These results demonstrate how temporal contiguity, as in
delay conditioning, changes the timing problem faced by
animals, rendering it both easier and less susceptible to
disruption by hippocampal lesions.
Cutumisu, M., Szafron, D., Bowling, M., Sutton R. S. (2008).
Agent learning using
action-dependent learning rates in computer role-playing
games. Proceedings
of the 4th Artificial Intelligence and Interactive Digital
Entertainment Conference, pp. 22-29.
We
introduce the ALeRT (Action-dependent Learning Rates with Trends)
algorithm that makes two modifications to the learning
rate and one change to the exploration rate of traditional
reinforcement learning techniques. Our learning rates are
action-dependent and increase or decrease based on trends
in reward sequences. Our exploration rate decreases when
the agent is learning successfully and increases
otherwise. These improvements result in faster learning.
We implemented this algorithm in NWScript, a scripting
language used by BioWare Corp.’s Neverwinter Nights game, with the goal
of improving the behaviours of game agents so that they
react more intelligently to game events. Our goal is to
provide an agent with the ability to (1) discover favourable
policies in a multi-agent computer role-playing game
situation and (2) adapt
to sudden changes in the environment.
Silver, D., Sutton, R. S., and Mueller,
M. (2008). Sample-based learning and search
with permanent and transient memories. In Proceedings
of the 25th International Conference on Machine Learning,
pp. 968-975, Helsinki, Finland. July 2008. Omnipress.
We study a reinforcement
learning architecture that encompasses both learning and
planning, and apply it to high performance Computer Go. In
this domain, the most successful planning methods are
based on sample-based search algorithms, such as UCT, in
which states are individuated, and the most successful
learning methods are based on temporal-difference learning
algorithms, such as Sarsa, in which linear function
approximation is used. In both cases, an estimate of the
value function is formed, but in the first case it is
transient, computed and then discarded after each move,
whereas in the second case it is more permanent, slowly
accumulating over many moves and games. The idea of our
architecture, Dyna-2, is for the transient planning memory
and the permanent learning memory to remain separate, but
for both to be based on linear function approximation and
both to be updated by Sarsa. This architecture subsumes
Sarsa and UCT as special cases. We apply Dyna-2 to 9x9
Computer Go, using a million binary features in the
function approximator, based on templates matching small
fragments of the board. We test our approach against
GnuGo, a standard benchmark opponent, and on the Computer
Go Online Server (CGOS). We show that the combination of
permanent and transient memories performs better than
either memory alone, and significantly outperforms a
traditional approach using alpha-beta search. Our program
based on Dyna-2 achieves a higher rating on CGOS than any
other handcrafted or traditional search based program.
This is the first exploration of the Dyna concept in a
large-scale, high-performance application.
Sutton, R. S., Szepesvari, Cs.,
Geramifard, A., Bowling, M., (2008). Dyna-style planning with linear
function approximation and prioritized sweeping,
Proceedings of the 24th Conference on Uncertainty in
Artificial Intelligence, pp. 528-536.
We consider the problem of
efficiently learning optimal control policies and value
functions over large state spaces in an online setting in
which estimates must be available after each interaction
with the world. This paper develops an explicitly
model-based approach extending the Dyna architecture to
linear function approximation. Dyna-style planning
proceeds by generating imaginary experience from the world
model and then applying model-free reinforcement learning
algorithms to the imagined state transitions. Our
main results are to prove that linear Dyna-style planning
converges to a unique solution independent of the
generating distribution, under natural conditions.
In the policy evaluation setting, we prove that the limit
point is the least-squares (LSTD) solution. An
implication of our results is that prioritized-sweeping
can be soundly extended to the linear approximation case,
backing up to preceding features rather than to preceding
states. We introduce two versions of prioritized sweeping
with linear Dyna and briefly illustrate their performance
empirically on the Mountain Car and Boyan Chain problems.
Ludvig, E. A., Sutton, R. S.,
Kehoe, E. J. (2008) Stimulus
representation and the timing of reward-prediction errors
in models of the dopamine system, Neural
Computation 20:3034-3054.
The phasic firing of dopamine neurons has been
theorized to encode a reward-prediction error as
formalized by the temporal-difference (TD) algorithm in
reinforcement learning. Most TD models of dopamine have
assumed a stimulus representation, known as the complete
serial compound, in which each moment in a trial is
distinctly represented. We introduce a more realistic
temporal stimulus representation for the TD model. In our
model, all external stimuli, including rewards, spawn a
series of internal microstimuli, which grow weaker and
more diffuse over time. These microstimuli are used by the
TD learning algorithm to generate predictions of future
reward. This new stimulus representation injects temporal
generalization into the TD model and enhances
correspondence between model and data in several
experiments, including those when rewards are omitted or
received early. This improved fit mostly derives from the
absence of large negative errors in the new model,
suggesting that dopamine alone can encode the full range
of TD errors in these situations.
Kehoe, E. J., Ludvig, E. A., Dudeney, J. E., Neufeld, J.,
Sutton, R. S. (2008). Magnitude
and timing of nictitating membrane movements during
classical conditioning of the rabbit (oryctolagus
cuniculus), Behavioral Neuroscience
122(2):471-176.
A trial-by-trial, subject-by-subject analysis
was conducted to determine whether generation of the
conditioned response (CR) occurs on a continuous or
all-or-none basis. Three groups of rabbits were trained on
different partial reinforcement schedules with the
conditioned stimulus presented alone on 10%, 30%, or 50%,
respectively, of all trials. Plots of each rabbit’s
nictitating membrane movements revealed that their
magnitude rose in a continuous fashion. Response growth
during acquisition followed a sigmoidal curve, and the
timing of CR-sized movements was largely stable throughout
the experiment. The results are discussed with respect to
alternative models of CR generation.
Bhatnagar, S., Sutton, R. S.,
Ghavamzadeh, M., Lee, M. (2008). Incremental Natural Actor-Critic
Algorithms. Twenty First Annual Conference on
Neural Information Processing Systems, pp.
105-112. Conference held in Vancouver, Canada,
December 2007. Journal version. TR version with experiments.
ABSTRACT: We present four
new reinforcement learning algorithms based on
actor-critic and natural-gradient ideas, and provide their
convergence proofs. Actor-critic reinforcement learning
methods are online approximations to policy iteration in
which the value-function parameters are estimated using
temporal difference learning and the policy parameters are
updated by stochastic gradient descent. Methods based on
policy gradient in this way are of special interest
because of their compatibility with function approximation
methods, which are needed to handle large or infinite
state spaces, and the use of temporal difference learning
in this way is of interest because in many applications it
dramatically reduces the variance of the policy gradient
estimates. The use of the natural gradient is of interest
because it can produce better conditioned
parameterizations and has been shown to further reduce
variance in some cases. Our results extend prior
two-timescale convergence results for actor-critic methods
by Konda et al. by using temporal difference learning in
the actor and by incorporating natural gradients, and
extend prior empirical studies of natural-gradient
actor-critic methods by Peters et al. by providing the
first convergence proofs and the first fully incremental
algorithms.
Sutton, R.
S., Koop, A., Silver, D. (2007). On the Role of Tracking in
Stationary Environments. In Proceedings of
the 2007 International Conference on Machine Learning.
Associated talk from 6/21/07.
ABSTRACT: It is often
thought that learning algorithms that track the best
solution, as opposed to converging to it, are important
only on nonstationary problems. We present three results
suggesting that this is not so. First we illustrate in a
simple concrete example, the Black and White problem, that
tracking can perform better than any converging algorithm
on a stationary problem. Second, we show the same point on
a larger, more realistic problem, an application of
temporal-difference learning to computer Go. Our third
result suggests that tracking in stationary problems could
be important for meta-learning research (e.g., learning to
learn, feature selection, transfer). We apply a
meta-learning algorithm for step-size adaptation, IDBD,e
to the Black and White problem, showing that meta-learning
has a dramatic long-term effect on performance whereas, on
an analogous converging problem, meta-learning has only a
small second-order effect. This small result suggests a
way of eventually overcoming a major obstacle to
meta-learning research: the lack of an independent
methodology for task selection.
Silver, D., Sutton, R. S., and
Mueller, M. (2007) Reinforcement
Learning of Local Shape in the Game of Go, Proceedings of the 20th
International Joint Conference on Artificial Intelligence
(IJCAI-07).
ABSTRACT: We explore an
application to the game of Go of a reinforcement learning
approach based on a linear evaluation function and large
numbers of binary features. This strategy has proved
effective in game playing programs and other reinforcement
learning applications. We apply this strategy to Go by
creating over a million features based on templates for
small fragments of the board, and then use temporal
difference learning and self-play. This method identifies
hundreds of low level shapes with recognisable
significance to expert Go players, and provides quantitive
estimates of their values. We analyse the relative
contributions to performance of templates of different
types and sizes. Our results show that small,
translation-invariant templates are surprisingly
effective. We assess the performance of our program by
playing against the Average Liberty Player and a variety
of computer opponents on the 9x9 Computer Go Server. Our
linear evaluation function appears to outperform all other
static evaluation functions that do not incorporate
substantial domain knowledge.
Geramifard, A., Bowling, M.,
Zinkevich, M., and Sutton, R. S. (2007). iLSTD: Eligibility Traces and
Convergence Analysis, Advances in Neural Information Processing Systems
19 (NIPS*06).
ABSTRACT: We present new
theoretical and empirical results with the iLSTD algorithm
for policy evaluation in reinforcement learning with
linear function approximation. iLSTD is an
incremental method for achieving results similar to LSTD,
the data-efficient, least-squares version of temporal
difference learning, without incurring the full cost of
the LSTD computation. LSTD is O(n2
), where n is
the number of parameters in the linear function
approximator, while iLSTD is O(n).
In this paper, we generalize the previous iLSTD algorithm
and present three new results: (1) the first convergence
proof for an iLSTD algorithm; (2) an extension to
incorporate eligibility traces without changing the
asymptotic computational complexity; and (3) the first
empirical results with an iLSTD algorithm for a problem
(mountain car) with feature vectors large enough (n = 10,000) to show
substantial computational advantages over LSTD.
Geramifard, A., Bowling, M., Sutton, R.
S. (2006). Incremental
Least-Squares Temporal Difference Learning.
In Proceedings of the Twenty-First National Conference
on Artificial Intelligence (AAAI-06), pages 356-361.
ABSTRACT: Approximate policy
evaluation with linear function approximation is a
commonly arising problem in reinforcement learning,
usually solved using temporal difference (TD) algorithms.
In this paper we introduce a new variant of linear TD
learning, called incremental least-squares TD learning, or
iLSTD. This method is more data efficient than
conventional TD algorithms such as TD(0) and is more
computationally efficient than non-incremental
least-squares TD methods such as LSTD (Bradtke & Barto
1996; Boyan 1999). In particular, we show that the
per-time-step complexities of iLSTD and TD(0) are O(n), where n is the number of
features, whereas that of LSTD is O(n2 ). This
difference can be decisive in modern applications of
reinforcement learning where the use of a large number
features has proven to be an effective solution strategy.
We present empirical comparisons, using the test problem
introduced by Boyan (1999), in which iLSTD converges
faster than TD(0) and almost as fast as LSTD.
Precup, D., Sutton, R. S.,
Paduraru, C., Koop, A., Singh, S. (2006). Off-policy
Learning with Recognizers. Advances in Neural Information
Processing Systems 18 (NIPS*05).
ABSTRACT: We introduce a new
algorithm for off-policy temporal-difference learning with
function approximation that has much lower variance and
requires less knowledge of the behavior policy than prior
methods. We develop the notion of a recognizer, a filter
on actions that distorts the behavior policy to produce a
related target policy with low-variance
importance-sampling corrections. We also consider target
policies that are deviations from the state distribution
of the behavior policy, such as potential temporally
abstract options, which further reduces variance. This
paper introduces recognizers and their potential
advantages, then develops a full algorithm for MDPs and
proves that its updates are in the same direction as
on-policy TD updates, which implies asymptotic
convergence. Our algorithm achieves this without knowledge
of the behavior policy or even requiring that there exists
a behavior policy.
Sutton, R. S., Rafols, E. J.,
Koop, A. (2006). Temporal
abstraction in temporal-difference networks. Advances in Neural Information
Processing Systems 18 (NIPS*05).
ABSTRACT:
Temporal-difference (TD) networks have been proposed as a
way of representing and learning a wide variety of
predictions about the interaction between an agent and its
environment (Sutton & Tanner, 2005). These
predictions are compositional
in that their targets are defined in terms of other
predictions, and subjunctive
in that they are about what would happen if an action or
sequence of actions were taken. In conventional TD
networks, the inter-related predictions are at successive
time steps and contingent on a single action; here we
generalize them to accommodate extended time intervals and
contingency on whole ways of behaving. Our
generalization is based on the options framework for
temporal abstraction (Sutton, Precup & Singh,
1999). The primary contribution of this paper is to
introduce a new algorithm for intra-option learning in TD
networks with function approximation and eligibility
traces. We present empirical examples of our
algorithm's effectiveness and of the greater
representational expressiveness of temporally-abstract TD
networks.
Tanner, B., Sutton, R. S. (2005)
TD(lambda) Networks:
Temporal-Difference Networks with Eligibility Traces.
Proceedings of the 2005 International Conference on
Machine Learning, pp 889-896.
ABSTRACT:
Temporal-difference (TD) networks have been introduced as
a formalism for expressing and learning grounded world
knowledge in a predictive form (Sutton & Tanner,
2005). Like conventional TD(0) methods, the learning
algorithm for TD networks uses 1-step backups to train
prediction units about future events. In conventional TD
learning, the TD(lambda) algorithm is often used to do
more general multi-step backups of future predictions. In
our work, we introduce a generalization of the 1-step TD
network specification that is based on the TD(lambda)
learning al- gorithm, creating TD(lambda) networks. We
present experimental results that show TD(lambda) networks
can learn solutions in more complex environments than TD
networks. We also show that in problems that can be solved
by TD networks, TD(lambda) networks generally learn
solutions much faster than their 1-step counterparts.
Finally, we present an analysis of our algorithm
that shows that the computational cost of TD(lambda)
networks is only slightly more than that of TD networks.
Rafols, E. J., Ring, M. B.,
Sutton, R. S., Tanner, B. 2005) Using Predictive
Representations to Improve Generalization in Reinforcement
Learning. Proceedings of the 2005
International Joint Conference on Artificial Intelligence,
pp 835-840.
ABSTRACT: The predictive
representations hypothesis holds that particularly good
generalization will result from representing the state of
the world in terms of predictions about possible future
experience. This hypothesis has been a central motivation
behind recent research in, for example, PSRs and TD
networks. In this paper we present the first explicit
investigation of this hypothesis. We show in a
reinforcement-learning example (a grid-world navigation
task) that a predictive representation in tabular form can
learn much faster than both the tabular explicit-state
representation and a tabular history-based method.
Tanner, B., Sutton, R. S. (2005)
Temporal-Difference Networks
with History. Proceedings of the 2005
International Joint Conference on Artificial Intelligence,
pp 865-870.
ABSTRACT:
Temporal-difference (TD) networks are a formalism for
expressing and learning grounded world knowledge in a
predictive form [Sutton and Tanner, 2005]. However, not
all partially observable Markov decision processes can be
efficiently learned with TD networks. In this paper, we
extend TD networks by allowing the network-update process
(answer network) to depend on the recent history of
previous actions and observations rather than only on the
most recent action and observation. We show that this
extension enables the solution of a larger class of
problems than can be solved by the original TD networks or
by historybased methods alone. In addition, we apply TD
networks to a problem that, while still simple, is
significantly larger than has previously been considered.
We show that history-extended TD networks can learn much
of the common-sense knowledge of an egocentric gridworld
domain with a single bit of perception.
Stone, P., Sutton, R. S.,
Kuhlmann, G. (2005) Reinforcement
Learning for RoboCup-Soccer Keepaway. Adaptive Behavior.
Some simulations
of
keepaway referenced in the paper and keepaway
software.
ABSTRACT: RoboCup simulated
soccer presents many challenges to reinforcement learning
methods, including a large state space, hidden and
uncertain state, multiple independent agents learning
simultaneously, and long and variable delays in the
effects of actions. We describe our application of
episodic SMDP Sarsa(lambda) with linear tile-coding
function approximation and variable lambda to learning
higher-level decisions in a keepaway subtask of RoboCup
soccer. In keepaway, one team, "the keepers," tries to
keep control of the ball for as long as possible despite
the efforts of "the takers." The keepers learn
individually when to hold the ball and when to pass to a
teammate. Our agents learned policies that significantly
outperform a range of benchmark policies. We demonstrate
the generality of our approach by applying it to a number
of task variations including different field sizes and
different numbers of players on each team.
Sutton, R. S., Tanner, B. (2005) Temporal-Difference
Networks. Advances
in Neural Information Processing Systems 17,
pages 1377-1384. Associated
talk from 12/14/05. Small-size version
of same talk.
ABSTRACT: We introduce a generalization of
temporal-difference (TD) learning to networks of
interrelated predictions. Rather than relating a single
prediction to itself at a later time, as in conventional
TD methods, a TD network relates each prediction in a set
of predictions to other predictions in the set at a later
time. TD networks can represent and apply TD learning to a
much wider class of predictions than has previously been
possible. Using a random-walk example, we show that these
networks can be used to learn to predict by a fixed
interval, which is not possible with conventional TD
methods. Secondly, we show that when actions are
introduced, and the inter-prediction relationships made
contingent on them, the usual learning-efficiency
advantage of TD methods over Monte Carlo (supervised
learning) methods becomes particularly pronounced.
Thirdly, we demonstrate that TD networks can learn
predictive state representations that enable exact
solution of a non-Markov problem. A very broad range of
inter-predictive temporal relationships can be expressed
in these networks. Overall we argue that TD networks
represent a substantial extension of the abilities of TD
methods and bring us closer to the goal of representing
world knowledge in entirely predictive, grounded terms.
Littman, M.L., Sutton, R.S., Singh, S.
(2002). Predictive
Representations of State. In: Advances in Neural
Information Processing Systems 14:1555-1561
(Proceedings of the 2001 conference). MIT Press.
Stone, P., Sutton, R. S. (2002). Keepaway
soccer:
A
machine learning testbed. In: Lecture Notes in Computer
Science 2377, pp. 207-237, Springer.
RoboCup simulated soccer
presents many challenges to machine learning (ML) methods,
including a large state space, hidden and uncertain state,
multiple agents, and long and variable delays in the effects
of actions. While there have been many successful ML
applications to portions of the robotic soccer task, it
appears to be still beyond the capabilities of modern machine
learning techniques to enable a team of 11 agents to
successfully learn the full robotic soccer task from sensors
to actuators. Because the successful applications to portions
of the task have been embedded in different teams and have
often addressed different sub-tasks, they have been difficult
to compare. We put forth keepaway soccer as a domain suitable
for directly comparing different machine learning approaches
to robotic soccer. It is complex enough that it can’t be
solved trivially, yet simple enough that complete machine
learning approaches are feasible. In keepaway, one team, “the
keepers,” tries to keep control of the ball for as long as
possible despite the efforts of “the takers.” The keepers
learn individually when to hold the ball and when to pass to a
teammate, while the takers learn when to charge the
ball-holder and when to cover possible passing lanes. We fully
specify the domain and summarize some initial, successful
learning results.
Stone, P., Sutton, R.S. (2001). Scaling reinforcement learning
toward RoboCup soccer. Proceedings of the 18th
International Conference on Machine Learning, pp.
537-544.
Precup, D., Sutton, R.S., Dasgupta,
S. (2001). Off-policy
temporal-difference learning with function approximation.
digitally remastered
version. scanned
version. Proceedings of the 18th International
Conference on Machine Learning. Associated talk from
7/1/01.
Stone, P., Sutton, R.S.,
Singh, S. (2001).
Reinforcement Learning for 3 vs. 2 Keepaway. In: RoboCup-2000:
Robot
Soccer World Cup IV, P. Stone, T. Balch, and G.
Kraetszchmar, Eds., Springer Verlag. An earlier version appeared in
the Proceedings of the RoboCup-2000 Workshop,
Melbourne, Australia.
Sutton, R.S. (2001). Method
and apparatus for machine learning. U.S. Patent
6,249,781.
ABSTRACT: A
method and apparatus is disclosed for machine learning of
a pattern sequence which is derived from a plurality of
inputs. The pattern sequence is predicted from learning
rate parameters that are exponentially related to an
incrementally calculated gain parameter for each input.
The gain parameter are increased or decreased in real time
in correlation with the accuracy of the learning process.
The disclosed method and apparatus are advantageously
utilized in signal processing, adaptive control systems,
and pattern recognition.
Sutton, R.S.,
McAllester, D., Singh, S., Mansour, Y. (2000). Policy Gradient Methods for
Reinforcement Learning with Function Approximation. Advances
in Neural Information Processing Systems 12
(Proceedings of the 1999 conference), pp. 1057-1063. MIT
Press. An earlier version, as submitted May 1999, appeared
as an AT&T
Labs Technical Report. Later we began but never
finished another paper, Comparing
Policy-Gradient Algorithms.
Precup, D., Sutton, R.S., Singh, S.
(2000). Eligibility Traces for
Off-Policy Policy Evaluation. Proceedings of the
17th International Conference on Machine Learning,
pp. 759-766. Morgan Kaufmann.
Sutton, R.S. (1999). Open
theoretical questions in reinforcement learning. In
Proceedings of the Fourth European Conference on
Computational Learning Theory (Proceedings
EuroCOLT'99), pp. 11-17, Fischer, P., Simon, H.U., Eds.
Springer-Verlag.
Sutton, R.S. (1999). Reinforcement
Learning. In Rob Wilson and Frank Keil (Eds.) The MIT
Encyclopedia of the Cognitive Sciences, MIT
Press.
Sutton, R.S., Precup, D., Singh, S.
(1999). Between MDPs and
semi-MDPs: A Framework for Temporal Abstraction in
Reinforcement Learning. Artificial Intelligence
112:181-211. An earlier version appeared as Technical Report 98-74,
Department of Computer Science, University of
Massachusetts, Amherst, MA 01003. April, 1998. Associated talk from
3/5/98. Humorous excerpts
from the JAIR reviews recommending rejection of this
paper.
Sutton, R.S., Singh, S.,
Precup, D., Ravindran, B. (1999). Improved switching among
temporally abstract actions. Djvu
version. Advances in Neural Information
Processing Systems 11 (Proceedings of the 1998
conference), MIT Press. Associated
talk
from 12/2/98.
Moll, R., Barto, A.G., Perkins, T. J., Sutton, R.S.
(1999). Learning
instance-independent value functions to enhance local
search. Advances in Neural Information
Processing Systems 11 (Proceedings of the 1998
conference), pp. 1017-1023. MIT Press.
Sutton, R. S., Reinforcement
learning: Past, present, and future. Extended
abstract in Simulated
Evolution and Learning, McKay, B., Yao, X.,
Newton, C. S., Kim, J.-H., Furuhashi, T., Eds. Published
as Lecture Notes in
Computer Science 1585, pp. 195–197, Springer,
1999.
Sutton, R.S., Barto, A.G. (1998). Reinforcement Learning: An
Introduction. MIT Press.
Sutton, R.S., Precup, D., Singh,
S. (1998). Intra-option
learning about temporally abstract actions. Proceedings
of the 15th International Conference on Machine
Learning, pp. 556-564. Morgan Kaufmann.
Precup, D., Sutton, R.S.
(1998) Multi-time
models for temporally abstract planning. Djvu
version. Advances in Neural Information
Processing Systems 10, MIT Press.
McGovern, A., and Sutton, R.S.
(1998).
Macro-actions in reinforcement learning: An empirical
analysis. Technical Report 98-70, University of
Massachusetts, Department of Computer Science.
Precup, D., Sutton, R.S., Singh,
S. (1998). Theoretical
results on reinforcement learning with temporally
abstract options. In Machine Learning: ECML-98,
Proceedings of the 10th European Conference on Machine
Learning, Chemnitz, Germany, pp. 382-393. Springer
Verlag.
Santamaria, J.C., Sutton, R.S.,
Ram, A. (1998). Experiments
with reinforcement learning in problems with continuous
state and action spaces, Adaptive Behavior 6(2):
163-218. postscript. An
earlier version appeared as
Technical Report UM-CS-1996-088, Department of
Computer Science, University of Massachusetts, Amherst, MA
01003. The source code for all the experiments is also
available here.
McGovern, A., Precup, D.,
Ravindran, B., Singh, S., Sutton, R.S. (1998). Hierarchical Optimal Control
of MDPs. Proceedings of the Tenth Yale Workshop
on Adaptive and Learning Systems, pp. 186-191.
Barto, A. G., Sutton, R. S., Reinforcement learning
in artificial intelligence. In Neural Network Models
of Cognition, Donahoe, J. W., Packard Dorsel, V., Eds., pp.
358–386, Elsevier, 1997.
This chapter provides an overview of an approach
to the study of learning that, in broad terms, has developed
as a part of the field of Artifificial Intelligence (AI),
where it is called reinforcement learning due to its
roots in reinforcement theories of animal learning. We
introduce the field from the perspective of AI and
engineering, describing some of its key features, providing
a formal model of the reinforcement-learning problem, and
defining basic concepts that are exploited by solution
methods. Detailed discussion of solution methods themselves
and their history are very broad topics that we do not
attempt to cover here.
Sutton, R.S. (1997). On the significance of Markov
decision processes. In W. Gerstner, A. Germond, M.
Hasler, and J.-D. Nicoud (Eds.) Artificial Neural
Networks -- ICANN'97, pp. 273-282. Springer.
Precup, D., Sutton, R.S.
(1997). Multi-time
models for reinforcement learning. Proceedings
of the ICML'97 Workshop on Modelling in Reinforcement
Learning.
Precup, D., Sutton, R.S. (1997). Exponentiated gradient
methods for reinforcement learning. Proceedings
of the 14th International Conference on Machine Learning,
pp. 272-277, Morgan Kaufmann.
McGovern, A., Sutton,
R.S., Fagg, A.H. (1997). Roles
of macro-actions in accelerating reinforcement learning.
Proceedings of the 1997 Grace Hopper Celebration of
Women in Computing, pp. 13-17.
Precup, D., Sutton, R.S., Singh, S.P.
(1997). Planning with
closed-loop macro actions. Working notes of the
1997 AAAI Fall Symposium on Model-directed Autonomous
Systems, pp. 70-76.
Mehra, R.K., Ravichandran, B., Cabrera, J.B.D., Greve, D.N.,
Sutton, R.S. (1997). Towards self-learning
adaptive scheduling for ATM networks,
Proceedings of the 36th Conference on Decision and Control,
pp. 2393-2398, San Diego, California USA.
Sutton, R.S. (1996).
Generalization in reinforcement learning: Successful
examples using sparse coarse coding. Camera ready postscript.
Digitally remastered pdf.
Djvu
version. Advances in Neural Information
Processing Systems 8 (Proceedings of the 1995
conference), pp. 1038-1044, MIT Press. Errata: there is
small error in the equations of motion of the acrobot; for
the correct equations, see the
RL textbook page on the acrobot (thanks to Barry
Nichols for spotting this).
Singh, S.P., Sutton, R.S.
(1996). Reinforcement
learning with replacing eligibility traces. Machine
Learning 22: 123-158.
Precup, D., Sutton, R.S. (1996). Empirical comparison of
gradient descent and exponentiated gradient descent in
supervised and reinforcement learning. Technical
Report UM-CS-1996-070, Department of Computer Science,
University of Massachusetts, Amherst, MA 01003.
Kuvayev, L., Sutton, R.S. (1996).
Model-based
reinforcement learning with an approximate, learned
model. Proceedings of the Ninth Yale Workshop
on Adaptive and Learning Systems, pp. 101-105, Yale
University, New Haven, CT. Some additional results are in
this earlier version
of the same paper.
Mehra, R.K., Ravichandran, B., Sutton, R.S. (1996). Adaptive intelligent scheduling
for ATM networks. Proceedings of the Ninth Yale
Workshop on Adaptive and Learning Systems, pp.
106-111, Yale University, New Haven, CT.
Sutton, R.S. (1995). On
the
Virtues of Linear Learning and Trajectory
Distributions. Proceedings
of the Workshop on Value Function Approximation,
Machine Learning Conference.
Sutton, R.S. (1995). TD models: Modeling the
world at a mixture of time scales. Proceedings
of the Twelfth International Conference on Machine
Learning, pp. 531-539, Morgan Kaufmann.
Sutton, R.S., Singh,
S.P. (1994). On bias
and step size in temporal-difference learning. Proceedings
of the Eighth Yale Workshop on Adaptive and Learning
Systems, pp. 91-96, Yale University, New Haven, CT.
Sutton, R.S., Whitehead, S.D.
(1993). Online
learning with random representations. Proceedings
of the Tenth International Conference on Machine
Learning, pp. 314-321, Morgan Kaufmann.
Sutton, R.S. (1992). Adapting bias by gradient
descent: An incremental version of delta-bar-delta.
Proceedings of the Tenth National Conference on
Artificial Intelligence, pp. 171-176, MIT Press. Associated talk from 2/2/04.
Sutton, R.S. (1992). Gain adaptation
beats least squares? Proceedings of the Seventh
Yale Workshop on Adaptive and Learning Systems, pp.
161-166, Yale University, New Haven, CT.
Sutton, R.S. (1992). Machines that
Learn and Mimic the Brain. In ACCESS, GTE's
Journal of Science and Technology, 1992. Reprinted
in Stethoscope Quarterly, Spring.
Sutton, R.S. (1992). Reinforcement
learning architectures. Proceedings ISKIT'92
International Symposium on Neural Information Processing,
Fukuoka, Japan.
Sutton, R.S. (Ed.) (1992).
Reinforcement Learning, book version of a special
issue of Machine Learning on reinforcement
learning (Volume 8, Numbers 3/4). Kluwer. Introduction: The
challenge of reinforcement learning.
Sanger, T.D., Sutton, R.S., Matheus,
C.J. (1992). Iterative
construction of sparse polynomial approximations. Advances
in Neural Information Processing Systems 4, pp.
1064-1071, Morgan Kaufmann.
Gluck, M., Glauthier, P., Sutton, R.S.
(1992). Adaptation of
cue-specific learning rates in network models of human
category learning, Proceedings of the
Fouteenth Annual Conference of the Cognitive Science
Society, pp. 540-545, Erlbaum.
Sutton, R.S. (1991).
Planning by incremental dynamic programming. Proceedings
of the Eighth International Workshop on Machine
Learning, pp. 353-357, Morgan Kaufmann.
Sutton, R.S. (1991). Dyna, an integrated
architecture for learning, planning and reacting. Working
Notes of the 1991 AAAI Spring Symposium on Integrated
Intelligent Architectures and SIGART Bulletin
2, pp. 160-163.
Sutton, R.S. (1991). Integrated modeling and
control based on reinforcement learning and dynamic
programming. In D. S. Touretzky (ed.),
Advances in Neural Information Processing Systems 3,
pages 471-478.
Sutton, R.S. (1991). Reinforcement learning
architectures for animats, Proceedings of the
First International Conference on Simulation of Adaptive
Behavior: From Animals to Animats, pp.
288-296. smaller
gzipped version.
Sutton, R.S., Matheus, C.J. (1991). Learning polynomial
functions by feature construction. Proceedings of
the Eighth International Workshop on Machine Learning,
pp. 208-212, Morgan Kaufmann.
Sutton,
R.S., Barto, A.G., Williams, R. (1991). Reinforcement
learning is direct adaptive optimal control,
Proceedings of the American Control Conference,
pages 2143-2146. Also published in IEEE Control
Systems Magazine 12, No. 2, 19-22.
Miller, W. T.,
Sutton, R. S., Werbos, P. J. (Eds.), Neural Networks for
Control. MIT Press, 1991.
Sutton, R.S. (1990). Integrated architectures for
learning, planning, and reacting based on approximating
dynamic programming. Proceedings of the Seventh
International Conference on Machine Learning, pp.
216-224, Morgan Kaufmann. Also appeared as "Artificial
intelligence by dynamic programming," in Proceedings
of the Sixth Yale Workshop on Adaptive and Learning
Systems, pp. 89-95. smaller gzipped postscript
version.
Sutton, R.S. (1990). First results
with Dyna, an integrated architecture for learning,
planning, and reacting. In Neural Networks for
Control, Miller, T., Sutton, R.S., & Werbos,
P., Eds., MIT Press.
Sutton,
R.S.,
Barto, A.G. (1990).
Time-derivative models of pavlovian reinforcement.
In Learning and Computational Neuroscience:
Foundations of Adaptive Networks, M. Gabriel and J.
Moore, Eds., pp. 497-537. MIT Press.
Barto, A.G., Sutton, R.S., Watkins,
C.J.C.H. (1990). Learning
and sequential decision making. In Learning and
Computational Neuroscience: Foundations of Adaptive
Networks, M. Gabriel and J.W. Moore, Eds., pp.
539-602, MIT Press. The following abstract is from the
technical report version of this paper.
Barto, A.G., Sutton, R.S., &
Watkins, C. (1990). Sequential
decision problems and neural networks. In D. S.
Touretzky (ed.), Advances in Neural Information
Processing Systems 2, pp. 686-693.
Whitehead, S., Sutton, R.S., & Ballard, D. (1990). Advances in reinforcement
learning and their implications for intelligent control,
Proceedings of the Fifth IEEE International Symposium
on Intelligent Control 1990, pp. 1289-1297.
Franklin, J., Sutton, R.S., Anderson, C., Selfridge, O.,
& Schwartz, D. (1990). Connectionist
learning control at GTE laboratories,
Intelligent Control and Adaptive Systems, G.
Rodriguez, Ed., Proc. SPIE 1196, pp. 242-253.
Anderson, C., Franklin, J., & Sutton, R.S. (1990). Learning a nonlinear model of a
manufacturing process using multilayer connectionist
networks, Proceedings of the Fifth IEEE
International Symposium on Intelligent Control 1990,
pp. 404-409.
Sutton, R.S. (1989). Artificial
intelligence as a control problem: Comments on the
relationship between machine learning and intelligent
control. Appeared in Machine Learning in a
dynamic world. Proceedings of the IEEE
International Symposium on Intelligent Control 1988,
pp. 500-507.
Sutton, R.S. (1989). Implementation details of
the TD(lambda) procedure for the case of vector
predictions and backpropagation. GTE Laboratories
Technical Report TR87-509.1, as corrected August 1989. GTE
Laboratories, 40 Sylvan Road, Waltham, MA 02254.
Franklin, J., Sutton, R.S., & Anderson, C. (1989). Application of connectionist
learning methods to manufacturing process monitoring,
Proceedings of the IEEE International Symposium on
Intelligent Control 1988, pp. 709-712.
Sutton, R.S. (1988). Learning to
predict by the methods of temporal differences. Machine
Learning 3: 9-44, erratum p. 377. Scan of paper
as published, with erratum added. Digitally remastered with
missing figure in place.
Sutton, R.S. (1988). Convergence theory for
a new kind of prediction learning, Proceedings
of the 1988 Workshop on Computational Learning Theory,
pp. 421-42.
Sutton, R.S. (1988). NADALINE: A
normalized adaptive linear element that learns
efficiently. GTE Laboratories Technical Report
TR88-509.4. GTE Laboratories, 40 Sylvan Road, Waltham, MA
02254.
Selfridge, O., Sutton, R.S., & Anderson, C. (1988). Selected bibliography on
connectionism, In: Evolution, Learning, and
Cognition, Y.C. Lee (Ed.), pp. 391-403, World
Scientific Publishing.
Sutton, R.S., & Barto,
A.G. (1987). A
temporal-difference model of classical conditioning,
Proceedings of the Ninth Annual Conference of the
Cognitive Science Society, pp. 355-378.
Sutton, R.S.
(1986). Two problems with
backpropagation and other steepest-descent learning
procedures for networks, Proceedings of the
Eighth Annual Conference of the Cognitive Science
Society, pp. 823-831. Reprinted in Artificial
Neural Networks: Concepts and Theory, edited by P.
Mehra and B. Wah, IEEE Computer Society Press, 1992.
smaller gzipped version.
Moore, J., Desmond, J, Berthier,
N., Blazis, D., Sutton, R.S., & Barto, A.G. (1986). Simulation of the classically
conditioned nictitating membrane response by a neuron-like
adaptive element: Response topography, neuronal firing,
and interstimulus intervals, Behavioural Brain
Research 21: 143-154.
Sutton, R.S. (1985). Learning
distributed, searchable, internal models,
Proceedings of the Distributed Artificial Intelligence
Workshop, pp. 287-289.
Sutton, R.S., & Pinette, B.
(1985). The
learning of world models by connectionist networks,
Proceedings of the Seventh Annual Conference of the
Cognitive Science Society, pp. 54-64.
Barto, A.G. & Sutton, R.S. (1985). Neural problem solving.
In: Synaptic Modification, Neuron Selectivity, and
Nervous System Organization, W.B. Levy & J.A.
Anderson (Eds.), pp. 123-152. Lawrence Erlbaum.
Selfridge, O., Sutton,
R.S., & Barto, A.G. (1985). Training and tracking
in robotics, Proceedings of the Ninth
International Joint Conference on Artificial
Intelligence, pp. 670-672.
Moore, J., Desmond, J., Berthier, N., Blazis, D., Sutton,
R.S., & Barto, A.G. (1985). Connectionist
learning in real time: Sutton-Barto adaptive element and
classical conditioning of the nictitating membrane
response, Proceedings of the Seventh Annual
Conference of the Cognitive Science Society, pp.
318-322.
Sutton, R.S. (1984). Temporal credit
assignment in reinforcement learning (106
Mbytes). Ph.D. dissertation, Department of Computer
Science, University of Massachusetts, Amherst, MA 01003.
Published as COINS Technical Report 84-2.
Barto, A.G., Sutton, R.S.,
& Anderson, C. (1983). Neuron-like adaptive elements
that can solve difficult learning control problems,
IEEE Transactions on Systems, Man, and Cybernetics,
SMC-13: 834-846. 12.4 Mb pdf,
3.5 Mb gzipped
Sutton, R.S. (1982 -
unpublished draft). A
theory of salience change dependent on the relationship
between discrepancies on successive trials on which the
stimulus is present. smaller gzipped version.
Barto, A.G. & Sutton, R.S.
(1982). Simulation of
anticipatory responses in classical conditioning by a
neuron-like adaptive element, Behavioral Brain
Research 4:221-235.
Barto, A.G., Anderson,
C., & Sutton, R.S. (1982). Synthesis of nonlinear control
surfaces by a layered associative network,
Biological Cybernetics 43:175-185.
Barto, A.G., Sutton, R.S., & Anderson, C. (1982). Spatial learning simulation
systems, Proceedings of the 10th IMACS World
Congress on Systems Simulation and Scientific
Computation, pp. 204-206.
Sutton, R.S. (1981). Adaptation of learning rate
parameters. In: Goal Seeking Components for Adaptive
Intelligence: An Initial Assessment, by A. G.
Barto and R. S. Sutton. Air Force Wright Aeronautical
Laboratories Technical Report AFWAL-TR-81-1070.
Wright-Patterson Air Force Base, Ohio 45433.
Sutton, R.S., &
Barto, A.G. (1981). Toward a
modern theory of adaptive networks: Expectation and
prediction, Psychological Review 88:135-140.
Translated into Spanish by G. Ruiz to appear in the
journal Estudios de Psicologia.
Sutton, R.S.,
& Barto, A.G. (1981). An adaptive
network that constructs and uses an internal model of
its world, Cognition and Brain Theory 4:217-246.
Barto, A.G., Sutton, R.S.
(1981). Goal seeking
components for adaptive intelligence: An initial
assessment. Air Force Wright Aeronautical
Laboratories Technical Report AFWAL-TR-81-1070.
Wright-Patterson Air Force Base, Ohio 45433. 524
pages. (Appendix C is
available separately.)
Barto, A.G., Sutton,
R.S., & Brouwer, P. (1981). Associative search network: A
reinforcement learning associative memory,
Biological Cybernetics 40:201-211.
Barto, A.G. & Sutton,
R.S. (1981). Landmark
learning: An illustration of associative search,
Biological Cybernetics 42:1-8.
Sutton, R.S. (1978). Single channel theory: A
neuronal theory of learning, Brain Theory
Newsletter 3, No. 3/4, pp. 72-75. (earliest
publication)
Sutton, R.S.
(1978 - unpublished). A unified theory of
expectation in classical and instrumental conditioning.
Bachelors thesis, Stanford University.
Sutton, R.S. (1978 - unpublished). Learning theory
support for a single channel theory of the brain.