First, a quick guide to some of the highlights, roughly in order of the work's popularity or potential current interest:

- The 2nd edition of Reinforcement Learning: An Introduction
- Nature article
on Loss of Plasticity and Continual Backprop

- The Alberta Plan for AI Reseach
- The STOMP progression
- SwiftTD
- Reward
centering

- The common
model of the intelligent agent

- True online TD(λ)
- The predictive
approach to knowledge representation; PEAK;
Horde; nexting

- Temporal-difference learning; TD(λ) details
- The
TD
model of Pavlovian conditioning and its evaluation

- Dyna; with FA 1996,
2008

- The options paper; UAV example; precursor not superseded;
- Policy gradient methods; Incremental Natural Actor-Critic Algorithms
- Rich's PhD thesis, introduced
actor-critic architectures and "temporal credit assignment"

- PSRs; the predictive representations hypothesis; TD networks; with options
- IDBD; improved; earliest pub; in classical conditioning; step-size
optimization

- Random representations; representation search; feature discovery; more
- Pole-balancing; tracking nonstationarity
- A study in alpha and lambda

- Two problems with backprop

- Chris Watkins's thesis
- Boyan's LSTD(λ),
1999

- Barto and Bradtke LSTD, 1996
- Williams, 1992
- Lin, 1992

- Ross, 1983, Chapter 2
- Minsky, 1960, Steps to AI
- Good, 1965, Speculations concerning the first ultraintelligent machine
- Selfridge, 1958, Pandemonium
- Samuel, 1959
- Dayan, 1992
- Tesauro, 1992, TD-Gammon
- Watkins and Dayan, 1992
- Gurvitz, Lin, and Hanson, 1995
- An, Miller, and Parks (1991)
- Intro to Andreae
(2017) and Andreae
(2017)

- Doina Precup, 2000
- David Silver, 2009,
Reinforcement Learning and Simulation-Based Search in
Computer Go

- Hamid Maei, 2011, Gradient
Temporal-Difference Learning Algorithms

- Adam White, 2015, Developing a
predictive approach to knowledge

- Rupam Mahmood, 2017, Incremental Off-policy Reinforcement Learning Algorithms
- Sina Ghiassian, 2022, Online Off-policy Prediction
- Katya Kudashkina, 2022, Model-based Reinforcement Learning Methods for Developing Intelligent Assistants
- Yi Wan, 2023, Learning and Planning with the Average-Reward Formulation
- Alexandra Kearney, 2023, Letting the Agent Take the Wheel: Principles for Constructive and Predictive Knowledge
- Banafsheh Rafiee, 2024, State Construction in Reinforcement Learning
- Kenny Young, 2024, Leveraging Generic Problem Structure for Efficient Reinforcement Learning
- Abhishek Naik, 2024, Reinforcement Learning for Continuing Problems using Average Reward
- Kris De Asis, 2024, Explorations in the
Foundations of Value-based Reinforcement Learning

- Brian Tanner, 2005, Temporal-Difference Networks
- Adam White, 2006, A
Standard Benchmarking System for Reinforcement Learning

- Eddie
Rafols, 2006, Temporal
Abstraction in Temporal-difference Networks

- Cosmin Paduraru, 2008, Planning
with Approximate and Learned Models of Markov Decision
Processes

- Anna Koop, 2008, Investigating Experience:
Temporal Coherence and Empirical Knowledge Representation

- Masoud
Shahamiri, 2008,
Reinforcement Learning in Environments with Independent
Delayed-Sense Dynamics

- Rupam Mahmood, 2010, Automatic Step-size
Adaptation in Incremental Supervised Learning

- Mike Delp, 2011, Experiments in Off-Policy
Reinforcement Learning with the GQ(λ) Algorithm

- MahdiehSadat Mirian
HosseinAbadi, 2012, A
Computational Model of Learning from Replayed Experience in
Spatial Navigation

- Leah Hackman, 2012, Faster Gradient-TD
Algorithms

- Kavosh Asadi, 2015,
Strengths, Weaknesses, and Combinations of Model-based and
Model-free Reinforcement Learning

- Travis Dick, 2015, Policy
Gradient Reinforcement Learning Without Regret

- Vivek Veeriah, 2017, Beyond Clever Hans:
Learning From People Without Their Really Trying

- Shangtong Zhang, 2018, Learning with Artificial
Neural Networks

- Banafsheh Rafiee, 2018, Predictive Knowledge in
Robots: An Empirical Comparison of Learning Algorithms

- Kris De Asis, 2018, A Unified View of Multi-step Temporal Difference Learning
- Tian Tian, 2018, Extending the Sliding-step
Technique of Stochastic Gradient Descent to Temporal
Difference Learning

- Kenny Young, 2019, Learning What to Remember:
Strategies for Selective External Memory in Online
Reinforcement Learning Agents

- Juan Fernando Hernandez Garcia, 2019, Unifying n-Step
Temporal-Difference Action-Value Methods

- Chen Ma, 2020, An
Exploration of Predictive Representations of State

- Shibhansh Dohare, 2020, The Interplay of Search
and Gradient Descent in Semi-stationary Learning Problems

- Dylan Ashley, 2020,
Understanding Forgetting in Artificial Neural Networks

- Brendan Bennett, 2020, Estimating
Variance of Returns using Temporal Difference Methods

- Parash Rahman, 2020, Toward
Generate-and-Test Algorithms for Continual Feature Discovery

- Jingjiao Ni, 2020, Toward
Emphatic Reinforcement Learning

- Cam Linke, 2020, Adapting
Behavior via Intrinsic Reward

- Valliappa Chockalingam, 2020, Interrelating Prediction and Control Objectives in Episodic Actor-Critic Reinforcement Learning
- Amir Samani, 2022, Learning Agent State
Online with Recurrent Generate-and-Test

For any broken links, please send email to rich@richsutton.com.

Degris, T., Javed, K., Sharifnassab, A., Liu, Y., Sutton, R.S. (2024). Step-size Optimization for Continual Learning. ArXiv 2401.17401.

ABSTRACT: In continual learning, a learner has to keep learning from the data over its whole life time. A key issue is to decide what knowledge to keep and what knowledge to let go. In a neural network, this can be implemented by using a step-size vector to scale how much gradient samples change network weights. Common algorithms, like RMSProp and Adam, use heuristics, specifically normalization, to adapt this step-size vector. In this paper, we show that those heuristics ignore the effect of their adaptation on the overall objective function, for example by moving the step-size vector away from better step-size vectors. On the other hand, stochastic meta-gradient descent algorithms, like IDBD (Sutton, 1992), explicitly optimize the step-size vector with respect to the overall objective function. On simple problems, we show that IDBD is able to consistently improve step-size vectors, where RMSProp and Adam do not. We explain the differences between the two approaches and their respective limitations. We conclude by suggesting that combining both approaches could be a promising future direction to improve the performance of neural networks in continual learning.

Dohare, S., Hernandez-Garcia, J.F., Lan, Q., Rahman, P., Sutton, R. S., Mahmood, A.R. (2024). Loss of Plasticity in Deep Continual Learning.

ABSTRACT: Artificial neural networks, deep-learning methods and the backpropagation algorithm form the foundation of modern machine learning and artificial intelligence. These methods are almost always used in two phases, one in which the weights of the network are updated and one in which the weights are held constant while the network is used or evaluated. This contrasts with natural learning and many applications, which require continual learning. It has been unclear whether or not deep learning methods work in continual learning settings. Here we show that they do not—that standard deep-learning methods gradually lose plasticity in continual-learning settings until they learn no better than a shallow network. We show such loss of plasticity using the classic ImageNet dataset and reinforcement-learning problems across a wide range of variations in the network and the learning algorithm. Plasticity is maintained indefinitely only by algorithms that continually inject diversity into the network, such as our continual backpropagation algorithm, a variation of backpropagation in which a small fraction of less-used units are continually and randomly reinitialized. Our results indicate that methods based on gradient descent are not enough—that sustained deep learning requires a random, non-gradient component to maintain variability and plasticity.Javed, K., Sharifnassab, A., Sutton, R. S. (2024). SwiftTD: A Fast and Robust Algorithm for Temporal Difference Learning. Reinforcement Learning Journal 1(1). Presented at the Reinforcement Learning Conference (RLC), Amherst Massachusetts, August 9–12, 2024.

ABSTRACT: Learning to make temporal predictions is a key component of reinforcement learning algorithms. The dominant paradigm for learning predictions from an online stream of data is Temporal Difference (TD) learning. In this work we introduce a new TD algorithm---SwiftTD---that learns more accurate predictions than existing algorithms. SwiftTD combines True Online TD(λ) with per-feature step-size parameters, step-size optimization, a bound on the rate of learning, and step-size decay. Per-feature step-size parameters and step-size optimization improve credit assignment by increasing step-size parameters of important signals and reducing them for irrelevant signals. The bound on the rate of learning prevents overcorrections. Step-size decay reduces step-size parameters if they are too large. We benchmark SwiftTD on the Atari Prediction Benchmark and show that even with linear function approximation it can learn accurate predictions. We further show that SwiftTD can be combined with neural networks to improve their performance. Finally, we show that all three ideas---step-size optimization, the bound on the rate of learning, and step-size decay---contribute to the strong performance of SwiftTD.

Naik, A, Wan, Y., Tomar, M., Sutton, R. S. (2024). Reward centering. Reinforcement Learning Journal 1(1). Presented at the Reinforcement Learning Conference (RLC), Amherst Massachusetts, August 9–12, 2024.

ABSTRACT: We show that discounted methods for solving continuing reinforcement learning problems can perform significantly better if they center their rewards by subtracting out the rewards' empirical average. The improvement is substantial at commonly used discount factors and increases further as the discount factor approaches one. In addition, we show that if aDe Asis, K., Sutton, R. S. (2024). An Idiosyncrasy of Time-discretization in Reinforcement Learning. Reinforcement Learning Journal 1(1). Presented at the Reinforcement Learning Conference (RLC), Amherst Massachusetts, August 9–12, 2024.problem'srewards are shifted by a constant, then standard methods perform much worse, whereas methods with reward centering are unaffected. Estimating the average reward is straightforward in the on-policy setting; we propose a slightly more sophisticated method for the off-policy setting. Reward centering is a general idea, so we expect almost every reinforcement-learning algorithm to benefit by the addition of reward centering.

ABSTRACT: Many reinforcement learning algorithms are built on an assumption that an agent interacts with an environment over fixed-duration, discrete time steps. However, physical systems are continuous in time, requiring a choice of time-discretization granularity when digitally controlling them. Furthermore, such systems do not wait for decisions to be made before advancing the environment state, necessitating the study of how the choice of discretization may affect a reinforcement learning algorithm. In this work, we consider the relationship between the definitions of the continuous-time and discrete-time returns. Specifically, we acknowledge an idiosyncrasy with naively applying a discrete-time algorithm to a discretized continuous-time environment, and note how a simple modification can better align the return definitions. This observation is of practical consideration when dealing with environments where time-discretization granularity is a choice, or situations where such granularity is inherently stochastic.

Dohare, S., Hernandez-Garcia, J.F., Rahman, P., Sutton, R. S., Mahmood, A.R. (2023). Maintaining Plasticity in Deep Continual Learning. arXiv:2210.14361.

ABSTRACT: Modern deep-learning systems are specialized to problem settings in which training occurs once and then never again, as opposed to continual-learning settings in which training occurs continually. If deep-learning systems are applied in a continual learning setting, then it is well known that they may fail to remember earlier examples. More fundamental, but less well known, is that they may also lose their ability to learn on new examples, a phenomenon called loss of plasticity. We provide direct demonstrations of loss of plasticity using the MNIST and ImageNet datasets repurposed for continual learning as sequences of tasks. In ImageNet, binary classification performance dropped from 89% accuracy on an early task down to 77%, about the level of a linear network, on the 2000th task. Loss of plasticity occurred with a wide range of deep network architectures, optimizers, activation functions, batch normalization, dropout, but was substantially eased by L2-regularization, particularly when combined with weight perturbation. Further, we introduce a new algorithm -- continual backpropagation -- which slightly modifies conventional backpropagation to reinitialize a small fraction of less-used units after each example and appears to maintain plasticity indefinitely.

Rafiee, B., Ghiassian, S., Jin, J., Sutton, R., Luo, J., & White, A. (2023). Auxiliary task discovery through generate-and-test. In

ABSTRACT: In this paper, we explore an approach to auxiliary task discovery in reinforcement learning based on ideas from representation learning. Auxiliary tasks tend to improve data efficiency by forcing the agent to learn auxiliary prediction and control objectives in addition to the main task of maximizing reward, and thus producing better representations. Typically these tasks are designed by people. Meta-learning offers a promising avenue for automatic task discovery; however, these methods are computationally expensive and challenging to tune in practice. In this paper, we explore a complementary approach to the auxiliary task discovery: continually generating new auxiliary tasks and preserving only those with high utility. We also introduce a new measure of auxiliary tasks usefulness based on how useful the features induced by them are for the main task. Our discovery algorithm significantly outperforms random tasks, hand-designed tasks, and learning without auxiliary tasks across a suite of environments.

De Asis, K., Graves, E., & Sutton, R. S. (2023). Value-aware Importance Weighting for Off-policy Reinforcement Learning. In

ABSTRACT: Importance sampling is a central idea underlying off-policy prediction in reinforcement learning. It provides a strategy for re-weighting samples from a distribution to obtain unbiased estimates under another distribution. However, importance sampling weights tend to exhibit extreme variance, often leading to stability issues in practice. In this work, we consider a broader class of importance weights to correct samples in off-policy learning. We propose the use ofvalue-aware importance weightswhich take into account the sample space to provide lower variance, but still unbiased, estimates under a target distribution. We derive how such weights can be computed, and detail key properties of the resulting importance weights. We then extend several reinforcement learning prediction algorithms to the off-policy setting with these weights, and evaluate them empirically.

Sharifnassab, A., & Sutton, R. S. (2023). Toward Efficient Gradient-Based Value Estimation. In

ABSTRACT: Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.

Mathewson, K. W., Parker, A. S., Sherstan, C., Edwards, A. L., Sutton, R. S., & Pilarski, P. M. (2023). Communicative capital: A key resource for human-machine shared agency and collaborative capacity.

ABSTRACT: In this work, we present a perspective on the role machine intelligence can play in supporting human abilities. In particular, we consider research in rehabilitation technologies such as prosthetic devices, as this domain requires tight coupling between human and machine. Taking an agent-based view of such devices, we propose that human–machine collaborations have a capacity to perform tasks which is a result of the combined agency of the human and the machine. We introducecommunicative capitalas a resource developed by a human and a machine working together in ongoing interactions. Development of this resource enables the partnership to eventually perform tasks at a capacity greater than either individual could achieve alone. We then examine the benefits and challenges of increasing the agency of prostheses by surveying literature which demonstrates that building communicative resources enables more complex, task-directed interactions. The viewpoint developed in this article extends current thinking on how best to support the functional use of increasingly complex prostheses, and establishes insight toward creating more fruitful interactions between humans and supportive, assistive, and augmentative technologies.

Javed, K., Shah, H., Sutton, R. S., & White, M. (2023). Scalable Real-Time Recurrent Learning Using Columnar-Constructive Networks.

ABSTRACT: Constructing states from sequences of observations is an important component of reinforcement learning agents. One solution for state construction is to use recurrent neural networks. Back-propagation through time (BPTT), and real-time recurrent learning (RTRL) are two popular gradient-based methods for recurrent learning. BPTT requires complete trajectories of observations before it can compute the gradients and is unsuitable for online updates. RTRL can do online updates but scales poorly to large networks. In this paper, we propose two constraints that make RTRL scalable. We show that by either decomposing the network into independent modules or learning the network in stages, we can make RTRL scale linearly with the number of parameters. Unlike prior scalable gradient estimation algorithms, such as UORO and Truncated-BPTT, our algorithms do not add noise or bias to the gradient estimate. Instead, they trade off the functional capacity of the network for computationally efficient learning. We demonstrate the effectiveness of our approach over Truncated-BPTT on a prediction benchmark inspired by animal learning and by doing policy evaluation of pre-trained policies for Atari 2600 games.Sutton, R. S., Machado, M. C., Holland, G. Z., Timbers, D. S. F., Tanner, B., & White, A. (2023). Reward-respecting subtasks for model-based reinforcement learning.

ABSTRACT: To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option- critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions.

Wan, Y., Sutton, R. S. (2022). Toward discovering options that achieve faster planning. ArXiv:2205.12515.

ABSTRACT: We propose a new objective for option discovery that emphasizes the computational advantage of using options in planning. For a given set of episodic tasks and a given number of options, the objective prefers options that can be used to achieve a high return by composing few options. By composing few options, fast planning can be achieved. When faced with new tasks similar to the given ones, the discovered options are also expected to accelerate planning. Our objective extends the objective proposed by Harb et al. (2018) for the single-task setting to the multi-task setting. A closer look at Harb et al.'s objective shows that the best options discovered given one task are not likely to be useful for future unseen tasks and that the multi-task setting is indeed necessary for this purpose. In the same paper, Harb et al. also proposed an algorithm to optimize their objective, and the algorithm can be naturally extended to the multi-task setting. We empirically show that in the four-room domain the extension does not achieve a high objective value and propose a new algorithm that better optimizes the proposed objective. In the same four-room domain, we show that 1) a higher objective value is typically associated with options with which fewer planning iterations are needed to achieve near-optimal performance, 2) our new algorithm achieves a high objective value, which is close to the value achieved by a set of human-designed options, 3) the best number of planning iterations given the discovered options is much smaller and matches it obtained given human-designed options, and 4) the options produced by our algorithm also make intuitive sense because they move to and terminate at cells near hallways connecting two neighbor rooms.

Tian, T., Young, K., & Sutton, R. S. (2022). Doubly-asynchronous value iteration: Making value iteration asynchronous in actions. In

ABSTRACT: Value iteration (VI) is a foundational dynamic programming method, important for learning and planning in optimal control and reinforcement learning. VI proceeds in batches, where the update to the value of each state must be completed before the next batch of updates can begin. Completing a single batch is prohibitively expensive if the state space is large, rendering VI impractical for many applications. Asynchronous VI helps to address the large state space problem by updating one state at a time, in-place and in an arbitrary order. However, Asynchronous VI still requires a maximization over the entire action space, making it impractical for domains with large action space. To address this issue, we propose doubly-asynchronous value iteration (DAVI), a new algorithm that generalizes the idea of asynchrony from states to states and actions. More concretely, DAVI maximizes over a sampled subset of actions that can be of any user-defined size. This simple approach of using sampling to reduce computation maintains similarly appealing theoretical properties to VI without the need to wait for a full sweep through the entire action space in each update. In this paper, we show DAVI converges to the optimal value function with probability one, converges at a near-geometric rate with probability 1-delta, and returns a near-optimal policy in computation time that nearly matches a previously established bound for VI. We also empirically demonstrate DAVI's effectiveness in several experiments.

Sutton, R. S. (2022). A history of meta-gradient: Gradient Methods for meta-learning. ArXiv:2202.09701.

ABSTRACT: Systems for learning parameters often have meta-parameters such as step sizes, initial weights, or dimensional weightings. With a given setting of the meta-parameters, the learning system is complete and capable of finding parameters that are suited to the task, but its efficiency typically depends on the particular choice of meta-parameters. This has led to interest in learning processes that can find good choices for meta-parameters automatically from experience. These higher-level learning methods are often characterized as "learning to learn" or, as we shall call them here,meta-learning. In this article we briefly review the history of meta-learning.

Sutton, R. S., Bowling, M., Pilarski, P. M. (2022). The Alberta plan for AI research. ArXiv:2208.11173.

ABSTRACT: Herein we describe our approach to artificial intelligence (AI) research, which we callSutton, R. S. (2022). The quest for a common model of the intelligent agent. Inthe Alberta Plan. The Alberta Plan is pursued within our research groups in Alberta and by others who are like minded throughout the world. We welcome all who would join us in this pursuit.

ABSTRACT: The premise of theMulti-disciplinary Conference on Reinforcement Learning and Decision Makingis that multiple disciplines share an interest in goal-directed decision making over time. The idea of this paper is to sharpen and deepen this premise by proposing a perspective on the decision maker that is substantive and widely held across psychology, artificial intelligence, economics, control theory, and neuroscience, which I call thecommon model of the intelligent agent. The common model does not include anything specific to any organism, world, or application domain. The common model does in- clude aspects of the decision maker’s interaction with its world (there must be input and output, and a goal) and internal components of the decision maker (for perception, decision-making, internal evaluation, and a world model). I identify these aspects and components, note that they are given different names in different disciplines but refer essentially to the same ideas, and discuss the challenges and benefits of devising a neutral terminology that can be used across disciplines. It is time to recognize and build on the convergence of multiple diverse disciplines on a substantive common model of the intelligent agent.

Javed, K., Shah, H., Sutton, R. S., “Scalable Online State Construction using Recurrent Networks.” In

ABSTRACT:State construction from sensory observations is an important component of a reinforcement learning agent. One solution for state construction is to use recurrent neural networks. Two popular gradient-based methods for recurrent learning are back-propagation through time (BPTT), and real-time recurrent learning (RTRL). BPTT looks at the complete sequence of observations before computing gradients and is unsuitable for online updates. RTRL can do online updates but scales poorly to large networks. In this paper, we propose two constraints that make RTRL scalable. We show that by either decomposing the network into independent modules or learning a recurrent network incrementally, we can make RTRL scale linearly with the number of parameters. Unlike prior scalable gradient estimation algorithms, such as UORO and Truncated-BPTT, our algorithms do not add bias or noise to the gradient estimate. Instead, they trade off the functional capacity of the recurrent network to achieve scalable learning. We demonstrate the effectiveness of our approach on a prediction learning benchmark inspired by animal learning.Wan, Y., Sutton, R. S., “Discovering Options by Minimizing the Number of Composed Options to Solve Multiple Tasks.” In

ABSTRACT: We propose a new metric for discovering options in the tabular setting: a set options is better than the other set of same number of options if, for a given set of episodic tasks, on average, optimal solutions to the tasks can be obtained by composing fewer options. By composing fewer options to obtain solutions, planning requires less computation and can be faster. We propose a new objective function and prove that the set of options that minimizes the proposed objective results in the least number of composed options on average in the tabular setting. In the function approximation setting, where generally the optimal solutions are not achievable, this objective trades off higher returns against fewer composed options. We further propose a tabular algorithm that optimizes the proposed objective, based on the option-critic algorithm (Bacon et al., 2017). In a four-room domain with 16 tasks, we demonstrate that the options produced by our algorithm are consistent with human intuition because they seem to lead to cells near hallways connecting two rooms.

De Asis, K. Chan, A., Sutton, R. S., “Inverse Policy Evaluation for Value-based Decision Making.” In

ABSTRACT: Value-based methods for control often involve approximate value iteration (e.g., Q-learning), and behaving greedily with respect to the resulting value estimates with some degree of entropy to ensure the state-space is sufficiently explored. Such a greedy policy is an improvement over the previous policy, under which the current values were estimated. However, value-iteration may produce value functions that do not correspond with any policy. This is especially relevant with function approximation, when the true value function cannot be perfectly represented. In this work, we explore the use of Inverse Policy Evaluation, the process of solving for a likely policy given a value function. We derive a simple, incremental algorithm for the procedure, analyze how inaccuracies in the value function manifest in the corresponding policy, and provide empirical results emphasizing key properties of the mapping from value functions to policies.

Dohare, S., Mahmood, R., Sutton, R. S., “Continual Backprop: Stochastic Gradient Descent with Persistent Randomness.” In

ABSTRACT:The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former. We show that in continual learning setups, Backprop performs well initially, but over time its performance degrades. Stochastic gradient descent alone is insufficient to learn continually; the initial randomness enables only initial learning but not continual learning. To the best of our knowledge, ours is the first result showing this degradation in Backprop’s ability to learn. To address this degradation in Backprop’s plasticity, we propose an algorithm that continually injects random features alongside gradient descent using a new generate-and-test process. We call this the Continual Backprop algorithm. We show that, unlike Backprop, Continual Backprop is able to continually adapt in both supervised and reinforcement learning (RL) problems. Continual Backprop has the same computational complexity as Backprop and can be seen as a natural extension of Backprop for continual learning.Sutton, R. S., Machado, M. C., Holland, G. Z., Szepesvari, D., Timbers, F., Tanner, B., White, A. “Reward-Respecting Subtasks for Model-Based Reinforcement Learning.” In

ABSTRACT: To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress in state abstraction, but, although the theory of time abstraction has been extensively developed based on the options framework, in practice options have rarely been used in planning. One reason for this is that the space of possible options is immense and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks such as reaching a bottleneck state, or maximizing a sensory signal other than the reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. The subtasks proposed in most previous work ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option stops. We show that options and option models obtained from such reward-respecting subtasks are much more likely to be useful in planning and can be learned online and off-policy using existing learning algorithms. Reward-respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how the algorithms for learning values, policies, options, and models can be unified using general value functions.

Naik, A., Sutton, R. S. (2022). Multi-Step Average-Reward Prediction via Differential TD(λ). In

ABSTRACT:We present Differential TD(λ), an average-reward algorithm that extends Wan et al.’s (2021) Differential TD(0) from the one-step tabular case to that of multi-step linear function approximation using eligibility traces. We characterize the algorithm’s fixed point and present a convergence theorem. Our analysis builds on prior work by Tsitsiklis and Van Roy (1999), who proposed a trace-based prediction algorithm that is restricted to the on-policy setting. Preliminary experiments show that Differential TD(λ) converges to the fixed point predicted by the theory and that an intermediate value of the bootstrapping parameter λ results in the most sample efficiency.

Rafiee, B., Abbas, Z., Ghiassian, S., Kumaraswamy, R., Sutton, R., Ludvig, E., White, A. (2022). From Eye-blinks to State Construction: Diagnostic Benchmarks for Online Representation Learning.

ABSTRACT: Experiments in classical conditioning show that animals such as rabbits, pigeons, and dogs can make long temporal associations that enable multi-step prediction. To replicate this remarkable ability, an agent must construct an internal state representation that summarizes its interaction history. Recurrent neural networks can automatically construct state and learn temporal associations. But the current training methods are prohibitively expensive for online prediction -- continual learning on every time step -- which is the focus of this paper. To facilitate research in online prediction, we present three new diagnostic prediction problems inspired by classical-conditioning experiments. The proposed problems test the learning capabilities that animals readily exhibit and highlight the current recurrent learning methods' limitations. While the proposed problems are nontrivial, they are still amenable to extensive testing and analysis in the small-compute regime, thereby enabling researchers to study issues in isolation carefully, ultimately accelerating progress towards scalable online representation learning methods.

Wan, Y., Naik, A., Sutton, R. (2021). Average-Reward Learning and Planning with Options.

ABSTRACT: We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs. Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as sample based planning variants of our learning algorithms. Our algorithms and convergence proofs extend those recently developed by Wan, Naik, and Sutton. We also extend the notion of option-interrupting behavior from the discounted to the average-reward formulation. We show the efficacy of the proposed algorithms with experiments on a continuing version of the Four-Room domain.Samani, A., Sutton, R. S. (2021). Learning agent state online with recurrent generate-and-test. ArXiv:2112.15236.

ABSTRACT: Learning continually and online from a continuous stream of data is challenging, especially for a reinforcement learning agent with sequential data. When the environment only provides observations giving partial information about the state of the environment, the agent must learn the agent state based on the data stream of experience. We refer to the state learned directly from the data stream of experience as the agent state. Recurrent neural networks can learn the agent state, but the training methods are computationally expensive and sensitive to the hyper-parameters, making them unideal for online learning. This work introduces methods based on the generate-and-test approach to learn the agent state. A generate-and-test algorithm searches for state features by generating features and testing their usefulness. In this process, features useful for the agent's performance on the task are preserved, and the least useful features get replaced with newly generated features. We study the effectiveness of our methods on two online multi-step prediction problems. The first problem, trace conditioning, focuses on the agent's ability to remember a cue for a prediction multiple steps into the future. In the second problem, trace patterning, the agent needs to learn patterns in the observation signals and remember them for future predictions. We show that our proposed methods can effectively learn the agent state online and produce accurate predictions.

Ashley, D. R., Ghiassian, S., Sutton, R. S. (2021) Does the Adam Optimizer Exacerbate Catastrophic Forgetting? ArXiv:2102.07686.

ABSTRACT: Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon. Despite the extensive amount of work on catastrophic forgetting, we argue that it is still unclear how exactly the phenomenon should be quantified, and, moreover, to what degree all of the choices we make when designing learning systems affect the amount of catastrophic forgetting. We use various testbeds from the reinforcement learning and supervised learning literature to (1) provide evidence that the choice of which modern gradient-based optimization algorithm is used to train an ANN has a significant impact on the amount of catastrophic forgetting and show that-surprisingly-in many instances classical algorithms such as vanilla SGD experience less catastrophic forgetting than the more modern algorithms such as Adam. We empirically compare four different existing metrics for quantifying catastrophic forgetting and (2) show that the degree to which the learning systems experience catastrophic forgetting is sufficiently sensitive to the metric used that a change from one principled metric to another is enough to change the conclusions of a study dramatically. Our results suggest that a much more rigorous experimental methodology is required when looking at catastrophic forgetting. Based on our results, we recommend inter-task forgetting in supervised learning must be measured with both retention and relearning metrics concurrently, and intra-task forgetting in reinforcement learning must---at the very least---be measured with pairwise interference.

Kudashkina, K., Wan, Y., Naik, A., Sutton, R. S. (2021) Planning with Expectation Models for Control, ArXiv:2104.08543.

ABSTRACT: In model-based reinforcement learning (MBRL), Wan et al. (2019) showed conditions under which the environment model could produce the expectation of the next feature vector rather than the full distribution, or a sample thereof, with no loss in planning performance. Such expectation models are of interest when the environment is stochastic and non-stationary, and the model is approximate, such as when it is learned using function approximation. In these cases a full distribution model may be impractical and a sample model may be either more expensive computationally or of high variance. Wan et al. considered only planning for prediction to evaluate a fixed policy. In this paper, we treat the control case - planning to improve and find a good approximate policy. We prove that planning with an expectation model must update a state-value function, not an action-value function as previously suggested (e.g., Sorg & Singh, 2010). This opens the question of how planning influences action selections. We consider three strategies for this and present general MBRL algorithms for each. We identify the strengths and weaknesses of these algorithms in computational experiments. Our algorithms and experiments are the first to treat MBRL with expectation models in a general setting.

Silver, D., Singh, S., Precup, D., Sutton, R. S. (2021) Reward Is Enough.

ABSTRACT: In this article we hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward. Accordingly, reward is enough to drive behaviour that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language, generalisation and imitation. This is in contrast to the view that specialised problem formulations are needed for each ability, based on other signals or objectives. Furthermore, we suggest that agents that learn through trial and error experience to maximise reward could learn behaviour that exhibits most if not all of these abilities, and therefore that powerful reinforcement learning agents could constitute a solution to artificial general intelligence.Ghiassian, S., Sutton, R. S. (2021) An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task. ArXiv:2106.00922.

ABSTRACT: Off-policy prediction---learning the value function for one policy from data generated while following another policy---is one of the most challenging subproblems in reinforcement learning. This paper presents empirical results with eleven prominent off-policy learning algorithms that use linear function approximation: five Gradient-TD methods, two Emphatic-TD methods, Off-policy TD(λ), Vtrace, and versions of Tree Backup and ABQ modified to apply to a prediction setting. Our experiments used the Collision task, a small idealized off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. We assessed the performance of the algorithms according to their learning rate, asymptotic error level, and sensitivity to step-size and bootstrapping parameters. By these measures, the eleven algorithms can be partially ordered on the Collision task. In the top tier, the two Emphatic-TD algorithms learned the fastest, reached the lowest errors, and were robust to parameter settings. In the middle tier, the five Gradient-TD algorithms and Off-policy TD(λ) were more sensitive to the bootstrapping parameter. The bottom tier comprised Vtrace, Tree Backup, and ABQ; these algorithms were no faster and had higher asymptotic error than the others. Our results are definitive for this task, though of course experiments with more tasks are needed before an overall assessment of the algorithms' merits can be made.

Zhang, S., Wan, Y., Naik, A., Sutton, R. S., Whiteson, S. (2021). Average-Reward Off-Policy Policy Evaluation with Function Approximation. In

ABSTRACT: We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.Wan, Y., Naik, A., Sutton, R. S. (2021). Learning and Planning in Average-Reward Markov Decision Processes. In

Average-Reward Off-Policy Policy Evaluation with Function Approximation

ABSTRACT: We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward. Our proof techniques are a slight generalization of those by Abounadi, Bertsekas, and Borkar (2001). In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms can be significantly easier to use.

Lee, J. Y., Sutton, R. S. (2021). Policy iterations for reinforcement learning problems in continuous time and space — Fundamental theory and methods. Automatica

ABSTRACT:
Policy iteration (PI) is a recursive process of policy
evaluation and improvement for solving an optimal
decision-making/control problem, or in other words, a
reinforcement learning (RL) problem. PI has also served as the
fundamental to develop RL methods. In this paper, we propose two
PI methods, called differential PI (DPI) and integral PI (IPI),
and their variants to solve a general RL problem in continuous
time and space (CTS), with the environment modeled by a system
of ordinary differential equations (ODEs). The proposed methods
inherit the current ideas of PI in classical RL and optimal
control and theoretically support the existing RL algorithms in
CTS: TD-learning and value-gradient-based (VGB) greedy policy
update. We also provide case studies including 1) discounted RL
and 2) optimal control tasks. Fundamental mathematical
properties — admissibility, uniqueness of the solution to the
Bellman equation, monotone improvement, convergence, and
optimality of the solution to the Hamilton-Jacobi-Bellman
equation (HJBE) — are all investigated in-depth and improved
from the existing theory, along with the general and case
studies. Finally, the proposed ones are simulated with an
inverted-pendulum model and their model-based and partially
model-free implementations to support the theory and further
investigate them.

Barto, A. G., Sutton, R. S., Anderson, C. W. (2021) Looking Back on the Actor-Critic Architecture.

ABSTRACT: This retrospective describes the overall research project that gave rise to the authors' paper “Neuronlike adaptive elements that can solve difficult learning control problems” that was published in the 1983 Neural and Sensory Information Processing special issue of the IEEE Transactions on Systems, Man, and Cybernetics. This look back explains how this project came about, presents the ideas and previous publications that influenced it, and describes our most closely related subsequent research. It concludes by pointing out some noteworthy aspects of this article that have been eclipsed by its main contributions, followed by commenting on some of the directions and cautions that should inform future research.

Chan, A., De Asis, K., & Sutton, R. S. (2020). Inverse policy evaluation for value-based sequential decision-making. ArXiv:2008.11329.

ABSTRACT: Value-based methods for reinforcement learning lack generally applicable ways to derive behavior from a value function. Many approaches involve approximate value iteration (e.g., Q-learning), and acting greedily with respect to the estimates with an arbitrary degree of entropy to ensure that the state-space is sufficiently explored. Behavior based on explicit greedification assumes that the values reflect those ofDohare, S., Sutton, R. S., Mahmood, A. R. (2021). Continual backprop: Stochastic gradient descent with persistent randomness. ArXiv:2108.06325.somepolicy, over which the greedy policy will be an improvement. However, value-iteration can produce value functions that do not correspond toanypolicy. This is especially relevant in the function-approximation regime, when the true value function can't be perfectly represented. In this work, we explore the use ofinverse policy evaluation, the process of solving for a likely policy given a value function, for deriving behavior from a value function. We provide theoretical and empirical results to show that inverse policy evaluation, combined with an approximate value iteration algorithm, is a feasible method for value-based control.

ABSTRACT: The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former. We show that in continual learning setups, Backprop performs well initially, but over time its performance degrades. Stochastic gradient descent alone is insufficient to learn continually; the initial randomness enables only initial learning but not continual learning. To the best of our knowledge, ours is the first result showing this degradation in Backprop's ability to learn. To address this degradation in Backprop's plasticity, we propose an algorithm that continually injects random features alongside gradient descent using a new generate-and-test process. We call this theContinual Backpropalgorithm. We show that, unlike Backprop, Continual Backprop is able to continually adapt in both supervised and reinforcement learning (RL) problems. Continual Backprop has the same computational complexity as Backprop and can be seen as a natural extension of Backprop for continual learning.

Ghiassian, S., Sutton, R. S. (2021). An empirical comparison of off-policy prediction learning algorithms in the four rooms environment. ArXiv:2109.05110.

ABSTRACT: Many off-policy prediction learning algorithms have been proposed in the past decade, but it remains unclear which algorithms learn faster than others. We empirically compare 11 off-policy prediction learning algorithms with linear function approximation on two small tasks: the Rooms task, and the High Variance Rooms task. The tasks are designed such that learning fast in them is challenging. In the Rooms task, the product of importance sampling ratios can be as large as 2^{14}and can sometimes be two. To control the high variance caused by the product of the importance sampling ratios, step size should be set small, which in turn slows down learning. The High Variance Rooms task is more extreme in that the product of the ratios can become as large as 2^{14}×25. This paper builds upon the empirical study of off-policy prediction learning algorithms by Ghiassian and Sutton (2021). We consider the same set of algorithms as theirs and employ the same experimental methodology. The algorithms considered are: Off-policy TD(λ), five Gradient-TD algorithms, two Emphatic-TD algorithms, Tree Backup(λ), Vtrace(λ), and ABTD(ζ). We found that the algorithms' performance is highly affected by the variance induced by the importance sampling ratios. The data shows that Tree Backup(λ), Vtrace(λ), and ABTD(ζ) are not affected by the high variance as much as other algorithms but they restrict the effective bootstrapping parameter in a way that is too limiting for tasks where high variance is not present. We observed that Emphatic TD(λ) tends to have lower asymptotic error than other algorithms, but might learn more slowly in some cases. We suggest algorithms for practitioners based on their problem of interest, and suggest approaches that can be applied to specific algorithms that might result in substantially improved algorithms.

De Asis, K., Chan, A., Pitis, S., Sutton, R., & Graves, D. (2020). Fixed-horizon temporal difference methods for stable reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 04, pp. 3741-3748).

ABSTRACT: We explore
fixed-horizon temporal difference (TD) methods, reinforcement
learning algorithms for a new kind of value function that
predicts the sum of rewards over a fixed number of future time
steps. To learn the value function for horizon h, these
algorithms bootstrap from the value function for horizon h−1, or
some shorter horizon. Because no value function bootstraps from
itself, fixed-horizon methods are immune to the stability
problems that plague other off-policy TD methods using function
approximation (also known as “the deadly triad”). Although
fixed-horizon methods require the storage of additional value
functions, this gives the agent additional predictive power,
while the added complexity can be substantially reduced via
parallel updates, shared weights, and n-step bootstrapping. We
show how to use fixed-horizon value functions to solve
reinforcement learning problems competitively with methods such
as Q-learning that learn conventional value functions. We also
prove convergence of fixed-horizon temporal difference methods
with linear and general function approximation. Taken together,
our results establish fixed-horizon TD methods as a viable new
way of avoiding the stability problems of the deadly triad.

Dalrymple, A. N., Roszko, D. A., Sutton, R. S., & Mushahwar, V. K. (2020). Pavlovian control of intraspinal microstimulation to produce over-ground walking.

ABSTRACT: *Objective*. Neuromodulation technologies
are increasingly used for improving function after neural
injury. To achieve a symbiotic relationship between device and
user, the device must augment remaining function, and
independently adapt to day-to-day changes in function. The goal
of this study was to develop predictive control strategies to
produce over-ground walking in a model of hemisection spinal
cord injury (SCI) using intraspinal microstimulation (ISMS). *Approach*.
Eight cats were anaesthetized and placed in a sling over a
walkway. The residual function of a hemisection SCI was mimicked
by manually moving one hind-limb through the walking cycle. ISMS
targeted motor networks in the lumbosacral enlargement to
activate muscles in the other, presumably ‘paralyzed’ limb,
using low levels of current (<130 μA). Four people took turns
to move the ‘intact’ limb, generating four different walking
styles. Two control strategies, which used ground reaction force
and angular velocity information about the manually moved
‘intact’ limb to control the timing of the transitions of the
‘paralyzed’ limb through the step cycle, were compared. The
first strategy used thresholds on the raw sensor values to
initiate transitions. The second strategy used reinforcement
learning and Pavlovian control to learn predictions about the
sensor values. Thresholds on the predictions were then used to
initiate transitions. *Main results*. Both control
strategies were able to produce alternating, over-ground
walking. Transitions based on raw sensor values required manual
tuning of thresholds for each person to produce walking, whereas
Pavlovian control did not. Learning occurred quickly during
walking: predictions of the sensor signals were learned rapidly,
initiating correct transitions after ≤4 steps. Pavlovian control
was resilient to different walking styles and different cats,
and recovered from induced mistakes during walking. *Significance*.
This work demonstrates, for the first time, that Pavlovian
control can augment remaining function and facilitate
personalized walking with minimal tuning requirements.

Kudashkina, K., Pilarski, P. M., & Sutton, R. S. (2020). Document-editing Assistants and Model-based Reinforcement Learning as a Path to Conversational AI. ArXiv:2008.12095.

ABSTRACT: Intelligent assistants that follow commands or answer simple questions, such as Siri and Google search, are among the most economically important applications of AI. Future conversational AI assistants promise even greater capabilities and a better user experience through a deeper understanding of the domain, the user, or the user's purposes. But what domain and what methods are best suited to researching and realizing this promise? In this article we argue for the domain of voice document editing and for the methods of model-based reinforcement learning. The primary advantages of voice document editing are that the domain is tightly scoped and that it provides something for the conversation to be about (the document) that is delimited and fully accessible to the intelligent assistant. The advantages of reinforcement learning in general are that its methods are designed to learn from interaction without explicit instruction and that it formalizes the purposes of the assistant. Model-based reinforcement learning is needed in order to genuinely understand the domain of discourse and thereby work efficiently with the user to achieve their goals. Together, voice document editing and model-based reinforcement learning comprise a promising research direction for achieving conversational AI.

Young, K., & Sutton, R. S. (2020). Understanding the Pathologies of Approximate Policy Evaluation when Combined with Greedification in Reinforcement Learning. ArXiv:2010.15268.

ABSTRACT: Despite empirical success, the theory of reinforcement learning (RL) with value function approximation remains fundamentally incomplete. Prior work has identified a variety of pathological behaviours that arise in RL algorithms that combine approximate on-policy evaluation and greedification. One prominent example is policy oscillation, wherein an algorithm may cycle indefinitely between policies, rather than converging to a fixed point. What is not well understood however is the quality of the policies in the region of oscillation. In this paper we present simple examples illustrating that in addition to policy oscillation and multiple fixed points -- the same basic issue can lead to convergence to the worst possible policy for a given approximation. Such behaviours can arise when algorithms optimize evaluation accuracy weighted by the distribution of states that occur under the current policy, but greedify based on the value of states which are rare or nonexistent under this distribution. This means the values used for greedification are unreliable and can steer the policy in undesirable directions. Our observation that this can lead to the worst possible policy shows that in a general sense such algorithms are unreliable. The existence of such examples helps to narrow the kind of theoretical guarantees that are possible and the kind of algorithmic ideas that are likely to be helpful. We demonstrate analytically and experimentally that such pathological behaviours can impact a wide range of RL and dynamic programming algorithms; such behaviours can arise both with and without bootstrapping, and with linear function approximation as well as with more complex parameterized functions like neural networks.

Tian, T., Sutton, R. S. (2020). “Extending sliding-step importance weighting from supervised learning to reinforcement learning.” Best of IJCAI Workshops 2019, Springer.

ABSTRACT:Stochastic
gradient descent (SGD) has been in the center of many advances
in modern machine learning. SGD processes examples sequentially,
updating a weight vector in the direction that would most reduce
the loss for that example. In many applications, some examples
are more important than others and, to capture this, each
example is given a non-negative weight that modulates its
impact. Unfortunately, if the importance weights are highly
variable they can greatly exacerbate the difficulty of setting
the step-size parameter of SGD. To ease this difficulty,
Karampatziakis and Langford [6] developed a class of elegant
algorithms that are much more robust in the face of highly
variable importance weights in supervised learning. In this
paper we extend their idea, which we call “sliding step,” to
reinforcement learning, where importance weighting can be
particularly variable due to the importance sampling involved in
off-policy learning algorithms. We compare two alternative ways
of doing the extension in the linear function approximation
setting, then introduce specific sliding-step versions of the
TD(0) and Emphatic TD(0) learning algorithms. We prove the
convergence of our algorithms and demonstrate their
effectiveness on both on-policy and off-policy problems.
Overall, our new algorithms appear to be effective in bringing
the robustness of the sliding-step technique from supervised
learning to reinforcement learning.

Sutton, R. S. (2020) John McCarthy's Definition of Intelligence. Journal of Artificial General Intelligence 11(2), 66-67. doi:10.2478/jagi-2020-0003, 2 pages.

Osband, I., Doron, Y., Hessel, M., Aslanides, J., Sezener, E., Saraiva, A., McKinney, K., Lattimore, T., Szepezvari, C., Singh, S., Van Roy, B., Sutton, R., Silver, D., van Hasselt, H. (2020). “Behaviour suite for reinforcement learning.” Proceedings of the International Conference on Learning Representations, 2020.

ABSTRACT: This
paper introduces the Behaviour
Suite for Reinforcement Learning, or bsuite for
short. bsuite is a collection of carefully-designed
experiments that investigate core capabilities of reinforcement
learning (RL) agents with two objectives. First, to collect
clear, informative and scalable problems that capture key issues
in the design of general and efficient learning algorithms.
Second, to study agent behaviour through their performance on
these shared benchmarks. To complement this effort, we open
source github.com/deepmind/bsuite, which
automates evaluation and analysis of any agent on bsuite. This library facilitates reproducible and
accessible research on the core issues in RL, and ultimately the
design of superior learning algorithms. Our code is Python, and
easy to use within existing projects. We include examples with
OpenAI Baselines, Dopamine as well as new reference
implementations. Going forward, we hope to incorporate more
excellent experiments from the research community, and commit to
a periodic review of bsuite from a committee of prominent researchers.

Hernandez-Garcia, J. F., & Sutton, R. S. (2019). Learning Sparse Representations Incrementally in Deep Reinforcement Learning. ArXiv:1912.04002.

ABSTRACT: Sparse representations have been shown to be useful in deep reinforcement learning for mitigating catastrophic interference and improving the performance of agents in terms of cumulative reward. Previous results were based on a two step process were the representation was learned offline and the action-value function was learned online afterwards. In this paper, we investigate if it is possible to learn a sparse representation and the action-value function simultaneously and incrementally. We investigate this question by employing several regularization techniques and observing how they affect sparsity of the representation learned by a DQN agent in two different benchmark domains. Our results show that with appropriate regularization it is possible to increase the sparsity of the representations learned by DQN agents. Moreover, we found that learning sparse representations also resulted in improved performance in terms of cumulative reward. Finally, we found that the performance of the agents that learned a sparse representation was more robust to the size of the experience replay buffer. This last finding supports the long standing hypothesis that the overlap in representations learned by deep neural networks is the leading cause of catastrophic interference.

Naik, A., Shariff, R., Yasui, N., Yao, H., & Sutton, R. S. (2019). Discounted reinforcement learning is not an optimization problem. arXiv preprint arXiv:1910.02140.

ABSTRACT: Discounted reinforcement learning is fundamentally incompatible with function approximation for control in continuing tasks. It is not an optimization problem in its usual formulation, so when using function approximation there is no optimal policy. We substantiate these claims, then go on to address some misconceptions about discounting and its connection to the average reward formulation. We encourage researchers to adopt rigorous optimization approaches, such as maximizing average reward, for reinforcement learning in continuing tasks

Wan, Y., Zaheer, M., White, A., White, M., Sutton, R.S. (2019). “Planning with expectation models.” In Proceedings of the International Joint Conference on Artificial Intelligence.

ABSTRACT:
Distribution and sample models are two popular model choices in
model-based reinforcement learning (MBRL). However, learning
these models can be intractable, particularly when the state and
action spaces are large. Expectation models, on the other hand,
are relatively easier to learn due to their compactness and have
also been widely used for deterministic environments. For
stochastic environments, it is not obvious how expectation
models can be used for planning as they only partially
characterize a distribution. In this paper, we propose a sound
way of using expectation models for MBRL. In particular, we 1)
show that planning with an expectation model is equivalent to
planning with a distribution model if the state value function
is linear in state-feature vector, 2) analyze two common
parametrization choices for approximating the expectation:
linear and non-linear expectation models, 3) propose a sound
model-based policy evaluation algorithm and present its
convergence results, and 4) empirically demonstrate the
effectiveness of the proposed planning algorithm.

Sutton, R. (2019). The bitter lesson. Incomplete Ideas (blog). One and a half pages.

Hernandez-Garcia, J. F., & Sutton, R. S. (2019). Understanding multi-step deep reinforcement learning: A systematic study of the DQN target. arXiv preprint arXiv:1901.07510.

ABSTRACT: Multi-step methods such as Retrace(λ) and n-step Q-learning have become a crucial component of modern deep reinforcement learning agents. These methods are often evaluated as a part of bigger architectures and their evaluations rarely include enough samples to draw statistically significant conclusions about their performance. This type of methodology makes it difficult to understand how particular algorithmic details of multi-step methods influence learning. In this paper we combine the n-step action-value algorithms Retrace, Q-learning, Tree Backup, Sarsa, and Q(σ) with an architecture analogous to DQN. We test the performance of all these algorithms in the mountain car environment; this choice of environment allows for faster training times and larger sample sizes. We present statistical analyses on the effects of the off-policy correction, the backup length parameter n, and the update frequency of the target network on the performance of these algorithms. Our results show that (1) using off-policy correction can have an adverse effect on the performance of Sarsa and Q(σ); (2) increasing the backup length n consistently improved performance across all the different algorithms; and (3) the performance of Sarsa and Q-learning was more robust to the effect of the target network update frequency than the performance of Tree Backup, Q(σ), and Retrace in this particular task.

Gu, X., Ghiassian, S., & Sutton, R. S. (2019). Should All Temporal Difference Learning Use Emphasis?. ArXiv:1903.00194.

ABSTRACT: Emphatic Temporal
Difference (ETD) learning has recently been proposed as a
convergent off-policy learning method. ETD was proposed mainly
to address convergence issues of conventional Temporal
Difference (TD) learning under off-policy training but it is
different from conventional TD learning even under on-policy
training. A simple counterexample provided back in 2017 pointed
to a potential class of problems where ETD converges but TD
diverges. In this paper, we empirically show that ETD converges
on a few other well-known on-policy experiments whereas TD
either diverges or performs poorly. We also show that ETD
outperforms TD on the mountain car prediction problem. Our
results, together with a similar pattern observed under
off-policy training in prior works, suggest that ETD might be a
good substitute over conventional TD.

Kearney, A., Veeriah, V., Travnik, J., Pilarski, P. M., & Sutton, R. S. (2019). Learning feature relevance through step size adaptation in temporal-difference learning. ArXiv:1903.03252.

ABSTRACT: There is a long history of using meta learning as
representation learning, specifically for determining the
relevance of inputs. In this paper, we examine an instance of
meta-learning in which feature relevance is learned by adapting
step size parameters of stochastic gradient descent—building on
a variety of prior work in stochastic approximation, machine
learning, and artificial neural networks. In particular, we
focus on stochastic meta-descent introduced in the Incremental
Delta-Bar-Delta (IDBD) algorithm for setting individual step
sizes for each feature of a linear function approximator. Using
IDBD, a feature with large or small step sizes will have a large
or small impact on generalization from training examples. As a
main contribution of this work, we extend IDBD to
temporal-difference (TD) learning—a form of learning which is
effective in sequential, non i.i.d. problems. We derive a
variety of IDBD generalizations for TD learning, demonstrating
that they are able to distinguish which features are relevant
and which are not. We demonstrate that TD IDBD is effective at
learning feature relevance in both an idealized gridworld and a
real-world robotic prediction task.

Rafiee, B., Ghiassian, S., White, A., Sutton, R.S. (2019). “Prediction in intelligence: An empirical comparison of off-policy algorithms on robots.” In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, pp. 332-340.

ABSTRACT: The
ability to continually make predictions about the world may be
central to intelligence. Off-policy learning and general value
functions (GVFs) are well-established algorithmic techniques for
learning about many signals while interacting with the world. In
the past couple of years, many ambitious works have used
off-policy GVF learning to improve control performance in both
simulation and robotic control tasks. Many of these works use
semi-gradient temporal-difference (TD) learning algorithms, like
Q-learning, which are potentially divergent. In the last decade,
several TD learning algorithms have been proposed that are
convergent and computationally efficient, but not much is known
about how they perform in practice, especially on robots. In
this work, we perform an empirical comparison of modern
off-policy GVF learning algorithms on three different robot
platforms, providing insights into their strengths and
weaknesses. We also discuss the challenges of conducting fair
comparative studies of off-policy learning on robots and develop
a new evaluation methodology that is successful and applicable
to a relatively complicated robot domain.

Ghiassian, S., Patterson, A., White, M., Sutton, R. S., & White, A. (2018). Online off-policy prediction. arXiv preprint arXiv:1811.02597.

ABSTRACT: This paper investigates the problem of online
prediction learning, where learning proceeds continuously as the
agent interacts with an environment. The predictions made by the
agent are contingent on a particular way of behaving,
represented as a value function. However, the behavior used to
select actions and generate the behavior data might be different
from the one used to define the predictions, and thus the
samples are generated off-policy. The ability to learn
behavior-contingent predictions online and off-policy has long
been advocated as a key capability of predictive-knowledge
learning systems but remained an open algorithmic challenge for
decades. The issue lies with the temporal difference (TD)
learning update at the heart of most prediction algorithms:
combining bootstrapping, off-policy sampling and function
approximation may cause the value estimate to diverge. A
breakthrough came with the development of a new objective
function that admitted stochastic gradient descent variants of
TD. Since then, many sound online off-policy prediction
algorithms have been developed, but there has been limited
empirical work investigating the relative merits of all the
variants. This paper aims to fill these empirical gaps and
provide clarity on the key ideas behind each method. We
summarize the large body of literature on off-policy learning,
focusing on 1- methods that use computation linear in the number
of features and are convergent under off-policy sampling, and 2-
other methods which have proven useful with non-fixed, nonlinear
function approximation. We provide an empirical study of
off-policy prediction methods in two challenging microworlds. We
report each method's parameter sensitivity, empirical
convergence rate, and final performance, providing new insights
that should enable practitioners to successfully extend these
new methods to large-scale applications

Travnik, J. B., Mathewson, K. W., Sutton, R. S., Pilarski, P. M. (2018). Reactive Reinforcement Learning in Asynchronous Environments. Frontiers in Robotics and AI

ABSTRACT: The
relationship between a reinforcement learning (RL) agent and an
asynchronous environment is often ignored. Frequently used
models of the interaction between an agent and its environment,
such as Markov Decision Processes (MDP) or Semi-Markov Decision
Processes (SMDP), do not capture the fact that, in an
asynchronous environment, the state of the environment may
change during computation performed by the agent. In an
asynchronous environment, minimizing reaction time—the time it
takes for an agent to react to an observation—also minimizes the
time in which the state of the environment may change following
observation. In many environments, the reaction time of an agent
directly impacts task performance by permitting the environment
to transition into either an undesirable terminal state or a
state where performing the chosen action is inappropriate. We
propose a class of reactive
reinforcement learning algorithms that address this
problem of asynchronous environments by immediately acting after
observing new state information. We compare a reactive SARSA
learning algorithm with the conventional SARSA learning
algorithm on two asynchronous robotic tasks (emergency stopping
and impact prevention), and show that the reactive RL algorithm
reduces the reaction time of the agent by approximately the
duration of the algorithm's learning update. This new class of
reactive algorithms may facilitate safer control and faster
decision making without any change to standard learning
guarantees.

Kearney, A., Veeriah, V., Travnik, J. B., Sutton, R. S., Pilarski, P. M. (2018). TIDBD: Adapting Temporal-difference Step-sizes Through Stochastic Meta-descent. ArXiv:1804.03334.

ABSTRACT: In
this paper, we introduce a method for adapting the step-sizes of
temporal difference (TD) learning. The performance of TD methods
often depends on well chosen step-sizes, yet few algorithms have
been developed for setting the step-size automatically for TD
learning. An important limitation of current methods is that
they adapt a single step-size shared by all the weights of the
learning system. A vector step-size enables greater optimization
by specifying parameters on a per-feature basis. Furthermore,
adapting parameters at different rates has the added benefit of
being a simple form of representation learning. We generalize
Incremental Delta Bar Delta (IDBD)—a vectorized adaptive
step-size method for supervised learning—to TD learning, which
we name TIDBD. We demonstrate that TIDBD is able to find
appropriate step-sizes in both stationary and non-stationary
prediction tasks, outperforming ordinary TD methods and TD
methods with scalar step-size adaptation; we demonstrate that it
can differentiate between features which are relevant and
irrelevant for a given task, performing representation learning;
and we show on a real-world robot prediction task that TIDBD is
able to outperform ordinary TD methods and TD methods augmented
with AlphaBound and RMSprop.

Ghiassian, S., Yu, H., Rafiee, B., Sutton, R. S. (2018). Two geometric input transformation methods for fast online reinforcement learning with neural nets. ArXiv:1805.07476.

ABSTRACT: We
apply neural nets with ReLU gates in online reinforcement
learning. Our goal is to train these networks in an incremental
manner, without the computationally expensive experience replay.
By studying how individual neural nodes behave in online
training, we recognize that the global nature of ReLU gates can
cause undesirable learning interference in each node’s learning
behavior. We propose reducing such interferences with two
efficient input transformation methods that are geometric in
nature and match well the geometric property of ReLU gates. The
first one is tile coding, a classic binary encoding scheme
originally designed for local generalization based on the
topological structure of the input space. The second one (EmECS)
is a new method we introduce; it is based on geometric
properties of convex sets and topological embedding of the input
space into the boundary of a convex set. We discuss the behavior
of the network when it operates on the transformed inputs. We
also compare it experimentally with some neural nets that do not
use the same input transformations, and with the classic
algorithm of tile coding plus a linear function approximator,
and on several online reinforcement learning tasks, we show that
the neural net with tile coding or EmECS can achieve not only
faster learning but also more accurate approximations. Our
results strongly suggest that geometric input transformation of
this type can be effective for interference reduction and takes
us a step closer to fully incremental reinforcement learning
with neural nets.

De Asis, K., Sutton, R. S. (2018). Per-decision Multi-step Temporal Difference Learning with Control Variates. Proceedings of the 2018 Conference on Uncertainty in Artificial Intelligence.

ABSTRACT:
Multi-step temporal difference (TD) learning is an important
approach in reinforcement learning, as it unifies one-step TD
learning with Monte Carlo methods in a way where intermediate
algorithms can outperform either extreme. They address a
bias-variance trade off between reliance on current estimates,
which could be poor, and incorporating longer sampled reward
sequences into the updates. Especially in the off-policy
setting, where the agent aims to learn about a policy different
from the one generating its behaviour, the variance in the
updates can cause learning to diverge as the number of sampled
rewards used in the estimates increases. In this paper, we
introduce per-decision control variates for multi-step TD
algorithms, and compare them to existing methods. Our results
show that including the control variates can greatly improve
performance on both on and off-policy multi-step temporal
difference learning tasks.

Wan, Y., Zaheer, M., White, M., Sutton, R. S. (2018). Model-based Reinforcement Learning with Non-linear Expectation Models and Stochastic Environments. Workshop of the Federated Artificial Intelligence Meeting, Stockholm, Sweden.

ABSTRACT: In model-based
reinforcement learning (MBRL), the model of a stochastic
environment provides, for each state and action, either 1) the
complete distribution of possible next states, 2) a sample next
state, or 3) the expectation of the next state’s feature vector.
The third case, that of an expectation
model, is particularly appealing because the
expectation is compact and deterministic; this is the case most
commonly used, but often in a way that is not sound for
non-linear models such as those obtained with deep learning. In
this paper we introduce the first MBRL algorithm that is sound
for non-linear expectation models and stochastic environments.
Key to our algorithm, based on the Dyna architecture, is that
the model is never iterated to produce a trajectory, but only
used to generate single expected transitions to which a Bellman
backup with a linear
approximate value function is applied. In our results, we also
consider the extension of the Dyna architecture to partial
observability. We show the effectiveness of our algorithm by
comparing it with model-free methods on partially-observable
navigation tasks.

De Asis, K., Bennett, B., Sutton, R. S. (2018). Predicting Periodicity with Temporal Difference Learning. arXiv:1809.07435.

ABSTRACT:
Temporal difference (TD) learning is an important approach in
reinforcement learning, as it combines ideas from dynamic
programming and Monte Carlo methods in a way that allows for
online and incremental model-free learning. A key idea of TD
learning is that it is learning predictive knowledge about the
environment in the form of value functions, from which it can
derive its behavior to address long-term sequential decision
making problems. The agent’s horizon of interest, that is, how
immediate or long-term a TD learning agent predicts into the
future, is adjusted through a discount rate parameter. In this
paper, we introduce an alternative view on the discount rate,
with insight from digital signal processing, to include
complex-valued discounting. Our results show that setting the
discount rate to appropriately chosen complex numbers allows for
online and incremental estimation of the Discrete Fourier
Transform (DFT) of a signal of interest with TD learning. We
thereby extend the types of knowledge representable by value
functions, which we show are particularly useful for identifying
periodic effects in the reward sequence.

Sherstan, C., Ashley, D. R., Bennett, B., Young, K., White, A., White, M., Sutton, R. S. (2018). Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return. Proceedings of the 2018 Conference on Uncertainty in Artificial Intelligence.

ABSTRACT: Temporal-difference (TD) learning methods are
widely used in reinforcement learning to estimate the expected
return for each state, without a model, because of their
significant advantages in computational and data efficiency. For
many applications involving risk mitigation, it would also be
useful to estimate the variance
of the return by TD methods. In this paper, we describe a way of
doing this that is substantially simpler than those proposed by
Tamar, Di Castro, and Mannor in 2012, or those proposed by White
and White in 2016. We show that two TD learners operating in
series can learn expectation and variance estimates. The trick
is to use the square of the TD error of the expectation learner
as the reward of the variance learner, and the square of the
expectation learner’s discount rate as the discount rate of the
variance learner. With these two modifications, the variance
learning problem becomes a conventional TD learning problem to
which standard theoretical results can be applied. Our formal
results are limited to the table lookup case, for which our
method is still novel, but the extension to function
approximation is immediate, and we provide some empirical
results for the linear function approximation case. Our
experimental results show that our direct method behaves just as
well as a comparable indirect method, but is generally more
robust.

Young, K. J., Sutton, R. S., Yang, S. (2018). Integrating Episodic Memory into a Reinforcement Learning Agent using Reservoir Sampling. ArXiv:1806.00540.

ABSTRACT:
Episodic memory is a psychology term which refers to the ability
to recall specific events from the past. We suggest one
advantage of this particular type of memory is the ability to
easily assign credit to a specific state when remembered
information is found to be useful. Inspired by this idea, and
the increasing popularity of external memory mechanisms to
handle long-term dependencies in deep learning systems, we
propose a novel algorithm which uses a reservoir sampling
procedure to maintain an external memory consisting of a fixed
number of past states. The algorithm allows a deep reinforcement
learning agent to learn online to preferentially remember those
states which are found to be useful to recall later on.
Critically this method allows for efficient online computation
of gradient estimates with respect to the write process of the
external memory. Thus unlike most prior mechanisms for external
memory it is feasible to use in an online reinforcement learning
setting.

De Asis, K., Hernandez-Garcia, J. F., Holland, G. Z., Sutton, R. S. (2018). Multi-step reinforcement learning: A unifying algorithm. Proceedings of the Conference of the Association for the Advancement of Artificial Intelligence. Also ArXiv:1703.01327.

ABSTRACT:
Unifying seemingly disparate algorithmic ideas to produce better
performing algorithms has been a longstanding goal in
reinforcement learning. As a primary example, TD(λ) elegantly
unifies one-step TD prediction with Monte Carlo methods through
the use of eligibility traces and the trace-decay parameter λ.
Currently, there are a multitude of algorithms that can be used
to perform TD control, including Sarsa, Q-learning, and Expected
Sarsa. These methods are often studied in the one-step case, but
they can be extended across multiple time steps to achieve
better performance. Each of these algorithms is seemingly
distinct, and no one dominates the others for all problems. In
this paper, we study a new multi-step action-value algorithm
called Q(σ) that unifies and generalizes these existing
algorithms, while subsuming them as special cases. A new
parameter, σ, is introduced to allow the degree of sampling
performed by the algorithm at each step during its backup to be
continuously varied, with Sarsa existing at one extreme (full
sampling), and Expected Sarsa existing at the other (pure
expectation). Q(σ) is generally applicable to both on- and
off-policy learning, but in this work we focus on experiments in
the on-policy case. Our results show that an intermediate value
of σ, which results in a mixture of the existing algorithms,
performs better than either extreme. The mixture can also be
varied dynamically which can result in even greater performance.

Yu, H., Mahmood, A. R., Sutton, R. S. (2018). On Generalized Bellman Equations and Temporal-Difference Learning. Journal of Machine Learning Research 19.

ABSTRACT: We
consider off-policy temporal-difference (TD) learning in
discounted Markov decision processes, where the goal is to
evaluate a policy in a model-free way by using observations of a
state process generated without executing the policy. To curb
the high variance issue in off-policy TD learning, we propose a
new scheme of setting the λ-parameters of TD, based on
generalized Bellman equations. Our scheme is to set λ according
to the eligibility trace iterates calculated in TD, thereby
easily keeping these traces in a desired bounded range. Compared
with prior work, this scheme is more direct and flexible, and
allows much larger λ values for off-policy TD learning with
bounded traces. As to its soundness, using Markov chain theory,
we prove the ergodicity of the joint state-trace process under
nonrestrictive conditions, and we show that associated with our
scheme is a generalized Bellman equation (for the policy to be
evaluated) that depends on both the evolution of λ and the
unique invariant probability measure of the state-trace process.
These results not only lead immediately to a characterization of
the convergence behavior of least-squares based implementation
of our scheme, but also prepare the ground for further analysis
of gradient-based implementations.

Lee, J. Y., Sutton, R. S. (2017). Integral Policy Iterations for Reinforcement Learning Problems in Continuous Time and Space. ArXiv:1705.03520.

ABSTRACT:
Policy iteration (PI) is a recursive process of policy
evaluation and improvement to solve an optimal decision-making,
e.g., reinforcement learning (RL) or optimal control problem and
has served as the fundamental to develop RL methods. Motivated
by integral PI (IPI) schemes in optimal control and RL methods
in continuous time and space (CTS), this paper proposes
on-policy IPI to solve the general RL problem in CTS, with its
environment modelled by an ordinary differential equation (ODE).
In such continuous domain, we also propose four off-policy IPI
methods—two are the ideal PI forms that use advantage and
Q-functions, respectively, and the other two are natural
extensions of the existing off-policy IPI schemes to our general
RL framework. Compared to the IPI methods in optimal control,
the proposed IPI schemes can be applied to more general
situations and do not require an initial stabilizing policy to
run; they are also strongly relevant to the RL algorithms in CTS
such as advantage updating, Q-learning, and value-gradient based
(VGB) greedy policy improvement. Our on-policy IPI is basically
model-based but can be made partially model-free; each
off-policy method is also either partially or completely
model-free. The mathematical properties of the IPI
methods—admissibility, monotone improvement, and convergence
towards the optimal solution—are all rigorously proven, together
with the equivalence of on- and off-policy IPI. Finally, the IPI
methods are simulated with an inverted-pendulum model to support
the theory and verify the performance.

Ghiassian, S., Rafiee, B., & Sutton, R. S. (2017). A first empirical study of emphatic temporal difference learning. arXiv preprint arXiv:1705.04185.

ABSTRACT: In this paper we present the first empirical
study of the emphatic temporal-difference learning algorithm
(ETD), comparing it with conventional temporal-difference
learning, in particular, with linear TD(0), on on-policy and
off-policy variations of the Mountain Car problem. The initial
motivation for developing ETD was that it has good convergence
properties under off-policy training (Sutton, Mahmood &
White 2016), but it is also a new algorithm for the on-policy
case. In both our on-policy and off-policy experiments, we found
that each method converged to a characteristic asymptotic level
of error, with ETD better than TD(0). TD(0) achieved a still
lower error level temporarily before falling back to its higher
asymptote, whereas ETD never showed this kind of “bounce”. In
the off-policy case (in which TD(0) is not guaranteed to
converge), ETD was significantly slower.

Mahmood, A. R., Yu, H., Sutton, R. S. (2017). Multi-step off-policy learning without importance sampling ratios. ArXiv:1702.03006.

ABSTRACT: To
estimate the value functions of policies from exploratory data,
most model-free off-policy algorithms rely on importance
sampling, where the use of importance sampling ratios often
leads to estimates with severe variance. It is thus desirable to
learn off-policy without using the ratios. However, such an
algorithm does not exist for multi-step learning with function
approximation. In this paper, we introduce the first such
algorithm based on temporal-difference (TD) learning updates. We
show that an explicit use of importance sampling ratios can be
eliminated by varying the amount of bootstrapping in TD updates
in an action-dependent manner. Our new algorithm achieves
stability using a two-timescale gradient-based TD update. A
prior algorithm based on lookup table representation called Tree
Backup can also be retrieved using action-dependent
bootstrapping, becoming a special case of our algorithm. In two
challenging off-policy tasks, we demonstrate that our algorithm
is stable, effectively avoids the large variance issue, and can
perform substantially better than its state-of-the-art
counterpart.

Barto, A. G., Thomas, P. S., Sutton, R. S. (2017). Some Recent Applications of Reinforcement Learning. In Proceedings of the Eighteenth Yale Workshop on Adaptive and Learning Systems.

ABSTRACT: Five
relatively recent applications of reinforcement learning methods
are described. These examples were chosen to illustrate a
diversity of application types, the engineering needed to build
applications, and most importantly, the impressive results that
these methods are able to achieve. This paper is based on a
case-study chapter of the forthcoming second edition of Sutton
and Barto’s 1998 book “Reinforcement Learning: An Introduction.”

Pilarski, P. M., Sutton, R. S., Mathewson, K. W., Sherstan, C., Parker, A. S., Edwards, A. L. (2017). Communicative Capital for Prosthetic Agents. ArXiv:1711.03676.

ABSTRACT: This
work presents an overarching perspective on the role that
machine intelligence can play in enhancing human abilities,
especially those that have been diminished due to injury or
illness. As a primary contribution, we develop the hypothesis
that assistive devices, and specifically artificial arms and
hands, can and should be viewed as agents in order for us to
most effectively improve their collaboration with their human
users. We believe that increased agency will enable more
powerful interactions between human users and next generation
prosthetic devices, especially when the sensorimotor space of
the prosthetic technology greatly exceeds the conventional
control and communication channels available to a prosthetic
user. To more concretely examine an agency-based view on
prosthetic devices, we propose a new schema for interpreting the
capacity of a human-machine collaboration as a function of both
the human’s and machine’s degrees of agency. We then introduce
the idea of communicative
capital as a way of thinking about the communication
resources developed by a human and a machine during their
ongoing interaction. Using this schema of agency and capacity,
we examine the benefits and disadvantages of increasing the
agency of a prosthetic limb. To do so, we present an analysis of
examples from the literature where building communicative
capital has enabled a progression of fruitful, task-directed
interactions between prostheses and their human users. We then
describe further work that is needed to concretely evaluate the
hypothesis that prostheses are best thought of as agents. The
agent-based viewpoint developed in this article significantly
extends current thinking on how best to support the natural,
functional use of increasingly complex prosthetic enhancements,
and opens the door for more powerful interactions between humans
and their assistive technologies.

Zhang, S., Sutton, R. S. (2017). A Deeper Look at Experience Replay. Symposium on Deep Reinforcement Learning at the 31st Conference on Neural Information Processing Systems. ArXiv:1712.01275.

ABSTRACT:
Recently experience replay is widely used in various deep
reinforcement learning (RL) algorithms, in this paper we rethink
the utility of experience replay. It introduces a new
hyper-parameter, the memory buffer size, which needs carefully
tuning. However unfortunately the importance of this new
hyper-parameter has been underestimated in the community for a
long time. In this paper we did a systematic empirical study of
experience replay under various function representations. We
showcase that a large replay buffer can significantly hurt the
performance. Moreover, we propose a simple O(1) method to remedy
the negative influence of a large replay buffer. We showcase its
utility in both simple grid world and challenging domains like
Atari games.

Veeriah, V., van Seijen, H., Sutton, R. S. (2017). Forward actor-critic for nonlinear function approximation in reinforcement learning. In Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems (pp. 556-564), Sao Paulo, Brazil..

ABSTRACT:
Multi-step methods are important in reinforcement learning (RL).
Eligibility traces, the usual way of handling them, works well
with linear function approximators. Recently, van Seijen (2016)
had introduced a delayed learning approach, without eligibility
traces, for handling the multi-step λ-return with nonlinear
function approximators. However, this was limited to
action-value methods. In this paper, we extend this approach to
handle n-step returns, generalize this approach to policy
gradient methods and empirically study the effect of such
delayed updates in control tasks. Specifically, we introduce two
novel forward actor-critic methods and empirically investigate
our proposed methods with the conventional actor-critic method
on mountain car and pole-balancing tasks. From our experiments,
we observe that forward actor-critic dramatically outperforms
the conventional actor-critic in these standard control tasks.
Notably, this forward actor-critic method has produced a new
class of multi-step RL algorithms without eligibility traces.

Yu, H., Mahmood, A. R., Sutton, R. S. (2017). On Generalized Bellman Equations and Temporal-Difference Learning. Proceedings of the Canadian Artificial Intelligence Conference.

ABSTRACT: We
consider off-policy temporal-difference (TD) learning in
discounted Markov decision processes, where the goal is to
evaluate a policy in a model-free way by using observations of a
state process generated without executing the policy. To curb
the high variance issue in off-policy TD learning, we propose a
new scheme of setting the λ-parameters of TD, based on
generalized Bellman equations. Our scheme is to set λ according
to the eligibility trace iterates calculated in TD, thereby
easily keeping these traces in a desired bounded range. Compared
with prior work, this scheme is more direct and flexible, and
allows much larger λ values for off-policy TD learning with
bounded traces. As to its soundness, using Markov chain theory,
we prove the ergodicity of the joint state-trace process under
nonrestrictive conditions, and we show that associated with our
scheme is a generalized Bellman equation (for the policy to be
evaluated) that depends on both the evolution of λ and the
unique invariant probability measure of the state-trace process.
These results not only lead immediately to a characterization of
the convergence behavior of least-squares based implementation
of our scheme, but also prepare the ground for further analysis
of gradient-based implementations.

Veeriah, V., Zhang, S., Sutton, R. S. (2017). Crossprop: Learning Representations by Stochastic Meta-Gradient Descent in Neural Networks. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD).

ABSTRACT:
Representations are fundamental to artificial intelligence. The
performance of a learning system depends on the type of
representation used for representing the data. Typically, these
representations are hand-engineered using domain knowledge. More
recently, the trend is to learn these representations through
stochastic gradient descent in multi-layer neural networks,
which is called backprop. Learning the representations directly
from the incoming data stream reduces the human labour involved
in designing a learning system. More importantly, this allows in
scaling of a learning system for difficult tasks. In this paper,
we introduce a new incremental learning algorithm called
crossprop, which learns incoming weights of hidden units based
on the meta-gradient descent approach, that was previously
introduced by Sutton (1992) and Schraudolph (1999) for learning
step-sizes. The final update equation introduces an additional
memory parameter for each of these weights and generalizes the
backprop update equation. From our experiments, we show that
crossprop learns and reuses its feature representation while
tackling new and unseen tasks whereas backprop re- learns a new
feature representation.

Ludvig, E. A., Mirian, M. S., Kehoe, E. J., Sutton, R. S. Associative learning from replayed experience. http://biorxiv.org/content/early/2017/01/16/100800.

ABSTRACT: We
develop an extension of the Rescorla-Wagner model of associative
learning. In addition to learning from the current trial, the
new model supposes that animals store and replay previous
trials, learning from the replayed trials using the same
learning rule. This simple idea provides a unified explanation
for diverse phenomena that have proved challenging to earlier
associative models, including spontaneous recovery, latent
inhibition, retrospective revaluation, and trial spacing
effects. For example, spontaneous recovery is explained by
supposing that the animal replays its previous trials during the
interval between extinction and test. These include earlier
acquisition trials as well as recent extinction trials, and thus
there is a gradual re-acquisition of the conditioned response.
We present simulation results for the simplest version of this
replay idea, where the trial memory is assumed empty at the
beginning of an experiment, all experienced trials are stored
and none removed, and sampling from the memory is performed at
random. Even this minimal replay model is able to explain the
challenging phenomena, illustrating the explanatory power of an
associative model enhanced by learning from remembered as well
as real experiences.

Murphy, S. A., Deng, Y., Laber, E. B., Maei, H. R., Sutton, R. S., Witkiewitz, K. (2016). A batch, off-policy, actor-critic algorithm for optimizing the average reward. arXiv preprint arXiv:1607.05047.

ABSTRACT:We develop an off-policy actor–critic algorithm for learning an optimal policy from a training set composed of data from multiple individuals. This algorithm is developed with a view toward its use in mobile health.

Veeriah, V., Pilarski, P. M., Sutton, R. S. (2016). Face valuing: Training user interfaces with facial expressions and reinforcement learning. Int. J. Conference on Artificial Intelligence, Workshop on Interactive Machine Learning. Also arXiv 1606.02807.

ABSTRACT: An
important application of interactive machine learning is
extending or amplifying the cognitive and physical capabilities
of a human. To accomplish this, machines need to learn about
their human users' intentions and adapt to their preferences. In
most current research, a user has conveyed preferences to a
machine using explicit corrective or instructive feedback;
explicit feedback imposes a cognitive load on the user and is
expensive in terms of human effort. The primary objective of the
current work is to demonstrate that a learning agent can reduce
the amount of explicit feedback required for adapting to the
user's preferences pertaining to a task by learning to perceive
a value of its behavior from the human user, particularly from
the user's facial expressions---we call this face valuing. We
empirically evaluate face valuing on a grip selection task. Our
preliminary results suggest that an agent can quickly adapt to a
user's changing preferences with minimal explicit feedback by
learning a value function that maps facial features extracted
from a camera image to expected future reward. We believe that
an agent learning to perceive a value from the body language of
its human user is complementary to existing interactive machine
learning approaches and will help in creating successful
human-machine interactive applications.

Sutton, R. S., Mahmood, A. R., White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. Journal of Machine Learning Research 17(73):1-29. Also arXiv:1503.04269.

ABSTRACT: In
this paper we introduce the idea of improving the performance of
parametric temporal-difference (TD) learning algorithms by
selectively emphasizing or de-emphasizing their updates on
different time steps. In particular, we show that varying the
emphasis of linear TD(λ)’s updates in a particular way causes
its expected update to become stable under off-policy training.
The only prior model-free TD methods to achieve this with
per-step computation linear in the number of function
approximation parameters are the gradient-TD family of methods
including TDC, GTD(λ), and GQ(λ). Compared to these methods, our
emphatic TD(λ) is
simpler and easier to use; it has only one learned parameter
vector and one step-size parameter. Our treatment includes
general state-dependent discounting and bootstrapping functions,
and a way of specifying varying degrees of interest in
accurately valuing different states.

Sutton, R. S. (2015). True online Emphatic TD(λ): Quick Reference and Implementation Guide. ArXiv. Code is available in Python and C++ by downloading the source files of this arXiv paper as a zip archive.

ABSTRACT: In
this paper we introduce the idea of improving the performance
of parametric temporal-difference (TD) learning algorithms by
selectively emphasizing or de-emphasizing their updates on
different time steps. In particular, we show that varying the
emphasis of linear TD(λ)’s updates in a particular way causes
its expected update to become stable under off-policy
training. The only prior model-free TD methods to achieve this
with per-step computation linear in the number of function
approximation parameters are the gradient-TD family of methods
including TDC, GTD(λ), and GQ(λ). Compared to these methods,
our emphatic TD(λ)
is simpler and easier to use; it has only one learned
parameter vector and one step-size parameter. Our treatment
includes general state-dependent discounting and bootstrapping
functions, and a way of specifying varying degrees of interest
in accurately valuing different states.

Edwards, A. L., Dawson, M. R., Hebert, J. S., Sherstan, C., Sutton, R. S., Chan, K. M., Pilarski, P. M. (2015). Application of real-time machine learning to myoelectric prosthesis control: A case series in adaptive switching. Prosthetics and Orthotics International, first published September 15, 2015, doi:0309364615605373.

ABSTRACT:

Background: Myoelectric prostheses currently used by amputees can be difficult to control. Machine learning, and in particular learned predictions about user intent, could help to reduce the time and cognitive load required by amputees while operating their prosthetic device.

Objectives: The goal of this study was to compare two switching-based methods of controlling a myoelectric arm: non-adaptive (or conventional) control and adaptive control (involving real-time prediction learning).

Study Design: Case series study.

Methods: We compared non-adaptive and adaptive control in two different experiments. In the first, one amputee and one non-amputee subject controlled a robotic arm to perform a simple task; in the second, three able-bodied subjects controlled a robotic arm to perform a more complex task. For both tasks, we calculated the mean time and total number of switches between robotic arm functions over three trials.

Results: Adaptive control significantly decreased the number of switches and total switching time for both tasks compared with the conventional control method.

Conclusion: Real-time prediction learning was successfully used to improve the control interface of a myoelectric robotic arm during uninterrupted use by an amputee subject and able-bodied subjects.

Clinical Relevance: Adaptive control using real-time prediction learning has the potential to help decrease both the time and the cognitive load required by amputees in real-world functional situations when using myoelectric prostheses.

Background: Myoelectric prostheses currently used by amputees can be difficult to control. Machine learning, and in particular learned predictions about user intent, could help to reduce the time and cognitive load required by amputees while operating their prosthetic device.

Objectives: The goal of this study was to compare two switching-based methods of controlling a myoelectric arm: non-adaptive (or conventional) control and adaptive control (involving real-time prediction learning).

Study Design: Case series study.

Methods: We compared non-adaptive and adaptive control in two different experiments. In the first, one amputee and one non-amputee subject controlled a robotic arm to perform a simple task; in the second, three able-bodied subjects controlled a robotic arm to perform a more complex task. For both tasks, we calculated the mean time and total number of switches between robotic arm functions over three trials.

Results: Adaptive control significantly decreased the number of switches and total switching time for both tasks compared with the conventional control method.

Conclusion: Real-time prediction learning was successfully used to improve the control interface of a myoelectric robotic arm during uninterrupted use by an amputee subject and able-bodied subjects.

Clinical Relevance: Adaptive control using real-time prediction learning has the potential to help decrease both the time and the cognitive load required by amputees in real-world functional situations when using myoelectric prostheses.

van Hasselt, H., Sutton, R. S. (2015) Learning to Predict Independent of Span. ArXiv:1508.04582.

ABSTRACT: We
consider how to learn multi-step predictions effciently.
Conventional algorithms wait until observing actual outcomes
before performing the computations to update their
predictions. If predictions are made at a high rate or span
over a large amount of time, substantial computation can be
required to store all relevant observations and to update all
predictions when the outcome is finally observed. We show that
the exact same predictions can be learned in a much more
computationally congenial way, with uniform per-step
computation that does not depend on the span of the
predictions. We apply this idea to various settings of
increasing generality, repeatedly adding desired properties
and each time deriving an equivalent span-independent
algorithm for the conventional algorithm that satisfies these
desiderata. Interestingly, along the way several known
algorithmic constructs emerge spontaneously from our
derivations, including dutch eligibility traces, temporal
difference errors, and averaging. This allows us to link these
constructs one-to-one to the corresponding desiderata,
unambiguously connecting the ‘how’ to the ‘why’. Each step, we
make sure that the derived algorithm subsumes the previous
algorithms, thereby retaining their properties. Ultimately we
arrive at a single general temporal-difference algorithm that
is applicable to the full setting of reinforcement learning.

Pilarski, P. M., Sutton, R. S., Mathewson, K. W. (2015). Prosthetic Devices as Goal-Seeking Agents. Workshop on Present and Future of Non-invasive Peripheral Nervous System Machine Interfaces at the International Conference on Rehabilitation Robotics, Singapore.

ABSTRACT: In
this article we develop the perspective that assistive
devices, and specifically artificial arms and hands, may be
beneficially viewed as goal-seeking agents. We further suggest
that taking this perspective enables more powerful
interactions between human users and next generation
prosthetic devices, especially when the sensorimotor space of
the prosthetic technology greatly exceeds the conventional
myoelectric control and communication channels available to a
prosthetic user. As a principal contribution, we propose a
schema for thinking about the capacity of a human-machine
collaboration as a function of both the human and machine’s
degrees of agency. Using this schema, we present a brief
analysis of three examples from the literature where agency or
goal-seeking behaviour by a prosthesis has enabled a
progression of fruitful, task-directed interactions between a
prosthetic assistant and a human director. While preliminary,
the agent-based viewpoint developed in this article extends
current thinking on how best to support the natural,
functional use of increasingly complex prosthetic
enhancements.

van Seijen, H., Mahmood, A. R., Pilarski, P. M., Machado, M. C., Sutton, R. S. (2016). True online temporal-difference learning. Journal of Machine Learning Research 17(145):1-40. Also arXiv:1502.04087.

ABSTRACT: The
temporal-difference methods TD(λ) and Sarsa(λ) form a core
part of modern reinforcement learning. Their appeal comes from
their good performance, low computational cost, and their
simple interpretation, given by their forward view. Recently,
new versions of these methods were introduced, called true
online TD(λ) and true online Sarsa(λ), respectively (van
Seijen & Sutton, 2014). Algorithmically, these true online
methods only make two small changes to the update rules of the
regular methods, and the extra computational cost is
negligible in most cases. However, they follow the ideas
underlying the forward view much more closely. In particular,
they maintain an exact equivalence with the forward view at
all times, whereas the traditional versions only approximate
it for small step-sizes. We hypothesize that these true online
methods not only have better theoretical properties, but also
dominate the regular methods empirically. In this article, we
put this hypothesis to the test by performing an extensive
empirical comparison. Specifically, we compare the performance
of true online TD(λ)/Sarsa(λ) with regular TD(λ)/Sarsa(λ) on
random MRPs, a real-world myoelectric prosthetic arm, and a
domain from the Arcade Learning Environment. We use linear
function approximation with tabular, binary, and non-binary
features. Our results suggest that the true online methods
indeed dominate the regular methods. Across all
domains/representations the learning speed of the true online
methods are often better, but never worse than that of the
regular methods. An additional advantage is that no choice
between traces has to be made for the true online methods.
Besides the empirical results, we provide an in-dept analysis
of the theory behind true online temporal-difference learning.
In addition, we show that new true online temporal- difference
methods can be derived by making changes to the online forward
view and then rewriting the update equations.

van Seijen, H., Mahmood, A. R., Pilarski, P. M., Sutton, R. S. (2015). An empirical evaluation of true online TD(λ). In Proceedings of the 2015 European Workshop on Reinforcement Learning.

ABSTRACT: The
true online TD(λ) algorithm has recently been proposed (van
Seijen and Sutton, 2014) as a universal replacement for the
popular TD(λ) algorithm, in temporal-difference learning and
reinforcement learning. True online TD(λ) has better
theoretical properties than conventional TD(λ), and the
expectation is that it also results in faster learning. In
this paper, we put this hypothesis to the test by empirically
comparing true online TD(λ) with TD(λ) on a variety of
domains. Our domains include a challenging one-state and
two-state example, random Markov reward processes, and a
real-world myoelectric prosthetic arm. We use linear function
approximation with tabular, binary, and non-binary features.
We assess the algorithms along three dimensions: computational
cost, learning speed, and ease of use. Our results confirm the
strength of true online TD(λ): 1) for sparse feature vectors,
the computational overhead with respect to TD(λ) is
negligible; for non-sparse features the computation time is at
most twice that of TD(λ), 2) across all
domains/representations the learning speed of true online
TD(λ) is often better, but never worse than that of TD(λ), and
3) true online TD(λ) is easier to use, because it does not
require choosing between trace types, and it is generally more
stable with respect to the step-size. Overall, our results
suggest that true online TD(λ) should be the first choice when
looking for an efficient, general-purpose TD method.

Mahmood, A. R., Yu, H., White, M., Sutton, R. S. (2015). Emphatic temporal-difference learning. In Proceedings of the 2015 European Workshop on Reinforcement Learning.

ABSTRACT:
Emphatic algorithms are temporal-difference learning
algorithms that change their effective state distribution by
selectively emphasizing and de-emphasizing their updates on
different time steps. Recent works by Sutton, Mahmood and
White (2015) and Yu (2015) show that by varying the emphasis
in a particular way these algorithms become stable and
convergent under off-policy training with linear function
approximation. This paper serves as a unified summary of the
available results from both works. Additionally, we
empirically demonstrate the benefits of emphatic algorithms,
due to the flexible learning using state-dependent
discounting, bootstrapping and a user-specified allocation of
function approximation resources.

Mahmood, A. R., Sutton, R. S. (2015). Off-policy learning based on weighted importance sampling with linear computational complexity. In Proceedings of the 2015 Conference on Uncertainty in Artificial Intelligence.

ABSTRACT:
Importance sampling is an essential component of model-free
off-policy learning algorithms. Weighted importance sampling
(WIS) is generally considered superior to ordinary importance
sampling but, when combined with function approximation, it
has hitherto required computational complexity that is O(n^{2})
or more in the number of features. In this paper we introduce
new off-policy learning algorithms that obtain the benefits of
WIS with O(n) computational complexity. Our algorithms
maintain for each component of the parameter vector a measure
of the extent to which that component has been used in
previous examples. This measure is used to determine
component-wise step sizes, merging the ideas of stochastic
gradient descent and sample averages. We present our main
WIS-based algorithm first in an intuitive acausal form (the
forward view) and then derive a causal algorithm using
eligibility traces that is equivalent but more efficient (the
backward view). In three small experiments, our algorithms
performed significantly better than prior O(n) algorithms for
off-policy policy evaluation. We also show that our adaptive
step-size technique alone can improve the performance of
on-policy algorithms such as TD(λ) and true online TD(λ).

van Seijen, H., Sutton, R. S. (2015). A deeper look at planning as learning from replay. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, France.

ABSTRACT: In
reinforcement learning, the notions of experience replay, and
of planning as learning from replayed experience, have long
been used to find good policies with minimal training data.
Replay can be seen either as model-based reinforcement
learning, where the store of past experiences serves as the
model, or as a way to avoid a conventional model of the
environment altogether. In this paper, we look more deeply at
how replay blurs the line between model-based and model-free
methods. First, we show for the first time an exact
equivalence between the sequence of value functions found by a
model-based policy-evaluation method and by a model-free
method with replay. Second, we present a general replay method
that can mimic a spectrum of methods ranging from the
explicitly model-free (TD(0)) to the explicitly model-based
(linear Dyna). Finally, we use insights gained from these
relationships to design a new model-based reinforcement
learning algorithm for linear function approximation. This
method, which we call forgetful LSTD(λ), improves upon regular
LSTD(λ) because it extends more naturally to online control,
and improves upon linear Dyna because it is a multi-step
method, enabling it to perform well even in non-Markov
problems or, equivalently, in problems with significant
function approximation.

Kehoe, E. J.,
Ludvig, E. A., Sutton, R. S. (2014). Time course of the rabbit's
nictitating membrane during acquisition, extinction, and
reacquisition. Learning
and Memory 21:585-590. Cold Spring Harbor Press.ABSTRACT: The
present experiment tested whether or not the time course of a
conditioned eyeblink response, particularly its duration,
would expand and contract, as the magnitude of the conditioned
response (CR) changed massively during acquisition,
extinction, and reacquisition. The CR duration remained
largely constant throughout the experiment, while CR onset and
peak time occurred slightly later during extinction. The
results suggest that computational models can account for
these results by using two layers of plasticity conforming to
the sequence of synapses in the cerebellar pathways that
mediate eyeblink conditioning.

Mahmood, A. R., van Hasselt, H, Sutton, R. S. (2014). Weighted importance sampling for off-policy learning with linear function approximation.

ABSTRACT: Importance
sampling
is an essential component of off-policy model-free
reinforcement learning algorithms. However, its most
effective variant, weighted
importance sampling, does not carry over easily to
function approximation and, because of this, it is not
utilized in existing off-policy learning algorithms. In
this paper, we take two steps toward bridging this gap.
First, we show that weighted importance sampling can be
viewed as a special case of weighting the error of
individual training samples, and that this weighting has
theoretical and empirical benefits similar to those of
weighted importance sampling. Second, we show that these
benefits extend to a new weighted-importance-sampling
version of off-policy LSTD(λ). We
show empirically that our new WIS-LSTD(λ)
algorithm can result in much more rapid and reliable
convergence than conventional off-policy LSTD(λ) (Yu
2010, Bertsekas & Yu 2009).

Yao, H., Szepesvari, Cs., Sutton, R., Modayil, J., Bhatnagar, S. (2014). Universal Option Models.

ABSTRACT: We
consider the problem of learning models of options for
real-time abstract planning,
in the setting where reward functions can be specified at
any time and their expected
returns
must be efficiently computed. We introduce a new model for
an option that is independent of any reward function,
called the universal
option model (UOM). We prove that the UOM of an
option can construct a traditional option model given a
reward function, and also supports efficient computation
of the option-conditional return. We extend the UOM to
linear function approximation, and we show it gives the TD
solution of option returns and value functions of policies
over options. We provide a stochastic approximation
algorithm for incrementally learning UOMs from data and
prove its consistency. We demonstrate our method in two
domains. The first domain is a real-time strategy game,
where the
controller must select the best game unit to accomplish
dynamically-specified
tasks. The second domain is article recommendation, where
each user query defines
a new reward function and an article’s relevance is the
expected return from
following a policy that follows the citations between
articles. Our experiments show that UOMs are substantially
more efficient than previously known methods in evaluating
option returns and policies over options.

Edwards, A. L., Dawson, M. R., Hebert, J. S., Sutton, R. S., Chan, K. M., Pilarski, P.M. (2014). Adaptive switching in practice: Improving myoelectric prosthesis performance through reinforcement learning. In: Proceedings of MEC14: Myoelectric Controls Symposium, Fredericton, New Brunswick, Canada.

van Hasselt, H., Mahmood, A. R., Sutton, R. S. (2014). Off-policy TD(λ) with a true online equivalence. In: Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, Quebec City, Canada.

ABSTRACT: Van
Seijen and Sutton (2014) recently proposed a new version of
the linear TD(λ) learning algorithm that is exactly equivalent
to an online forward view and that empirically performed
better than its classical counterpart in both prediction and
control problems. However, their algorithm is restricted to
on-policy learning. In the more general case of off-policy
learning, in which the policy whose outcome is predicted and
the policy used to generate data may be different, their
algorithm cannot be applied. One reason for this is that the
algorithm bootstraps and thus is subject to instability
problems when function approximation is used. A second reason
true online TD(λ) cannot be used for off-policy learning is
that the off-policy case requires sophisticated importance
sampling in its eligibility traces. To address these
limitations, we generalize their equivalence result and use
this generalization to construct the first online algorithm to
be exactly equivalent to an off-policy forward view. We show
this algorithm, named true online GTD(λ), empirically
outperforms GTD(λ) (Maei, 2011) which was derived from the same
objective as our forward view but lacks the exact online
equivalence. In the general theorem that allows us to derive
this new algorithm, we encounter a new general
eligibility-trace update.

Sutton, R. S., Mahmood, A. R., Precup, D., van Hasselt, H. (2014). A new Q(λ) with interim forward view and Monte Carlo equivalence. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China.

ABSTRACT:
Q-learning, the most popular of reinforcement learning
algorithms, has always included an extension to eligibility
traces to enable more rapid learning and improved asymptotic
performance on non-Markov problems. The λ parameter smoothly
shifts on-policy algorithms such as TD(λ) and Sarsa(λ) from a
pure bootstrapping form (λ = 0) to a pure Monte Carlo form (λ
= 1). In off-policy algorithms, including Q(λ), GQ(λ), and
off-policy LSTD(λ), the λ parameter is intended to play the
same role, but does not; on every exploratory action these
algorithms bootstrap regardless of the value of λ, and as a
result they fail to approximate Monte Carlo learning when λ =
1. It may seem that this is inevitable for any online
off-policy algorithm; if updates are made on each step on
which the target policy is followed, then how could just the
right updates be ‘un-made’ upon deviation from the target
policy? In this paper, we introduce a new version of Q(λ) that
does exactly that, without significantly increased algorithmic
complexity. En route to our new Q(λ), we introduce a new
derivation technique based on the forward-view/backward-view
analysis familiar from TD(λ) but extended to apply at every
time step rather than only at the end of episodes. We apply
this technique to derive first a new off-policy version of
TD(λ), called PTD(λ), and then our new Q(λ), called PQ(λ).

van Seijen, H., Sutton, R. S. (2014). True online TD(λ). In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China. As published. With two minor errors corrected.

ABSTRACT:TD(λ)
is a core algorithm of modern reinforcement learning. Its
appeal comes from its equivalence to a clear and conceptually
simple forward view, and the fact that it can be implemented
online in an inexpensive manner. However, the equivalence
between TD(λ) and the forward view is exact only for the
off-line version of the algorithm (in which updates are made
only at the end of each episode). In the online version of
TD(λ) (in which updates are made at each step, which generally
performs better and is always used in applications) the match
to the forward view is only approximate. In a sense this is
unavoidable for the conventional forward view, as it itself
presumes that the estimates are unchanging during an episode.
In this paper we introduce a new forward view that takes into
account the possibility of changing estimates and a new
variant of TD(λ) that exactly achieves it. Our algorithm uses
a new form of eligibility trace similar to but different from
conventional accumulating and replacing traces. The overall
computational complexity is the same as TD(λ), even when using
function approximation. In our empirical comparisons, our
algorithm outperformed TD(λ) in all of its variations. It
seems, by adhering more truly to the original goal of
TD(λ)—matching an intuitively clear forward view even in the
online case—that we have found a new algorithm that simply
improves on classical TD(λ).

Modayil, J., White, A., Sutton, R. S. (2014). Multi-timescale Nexting in a Reinforcement Learning Robot. Adaptive Behavior 22(2):146-160.

ABSTRACT: The
term ‘nexting’ has been used by psychologists to refer to the
propensity of people and many other animals to continually
predict what will happen next in an immediate, local, and
personal sense. The ability to ‘next’ constitutes a basic kind
of awareness and knowledge of one’s environment. In this paper
we present results with a robot that learns to next in real
time, making thousands of predictions about sensory input
signals at timescales from 0.1 to 8 seconds. Our predictions
are formulated as a generalization of the value functions
commonly used in reinforcement learning, where now an
arbitrary function of the sensory input signals is used as a
pseudo reward, and the discount rate determines the timescale.
We show that six thousand predictions, each computed as a
function of six thousand features of the state, can be learned
and updated online ten times per second on a laptop computer,
using the standard TD(λ) algorithm with linear function approximation.
This approach is sufficiently computationally efficient to be
used for real-time learning on the robot and sufficiently data
efficient to achieve substantial accuracy within 30 minutes.
Moreover, a single tile-coded feature representation suffices
to accurately predict many different signals over a
significant range of timescales. We also extend nexting beyond
simple timescales by letting the discount rate be a function
of the state and show that nexting predictions of this more
general form can also be learned with substantial accuracy.
General nexting provides a simple yet powerful mechanism for a
robot to acquire predictive knowledge of the dynamics of its
environment.

Modayil, J., Sutton, R. S. (2014). Prediction Driven Behavior: Learning Predictions that Drive Fixed Responses. The AAAI-14 Workshop on Artificial Intelligence and Robotics, Quebec City, Quebec, Canada.

ABSTRACT: We
introduce a new method for robot control that combines
prediction learning with a fixed, crafted response—the robot
learns to make a temporally-extended prediction during its
normal operation, and the prediction is used to select actions
as part of a fixed behavioral response. Our method is inspired
by Pavlovian conditioning experiments in which an animal’s
behavior adapts as it learns to predict an event. Surprisingly
the animal’s behavior changes even in the absence of any
benefit to the animal (i.e.
the animal is not modifying its behavior to maximize reward).
Our method for robot control combines a fixed response with
online prediction learning, thereby producing an adaptive
behavior. This method is different from standard non-adaptive
control methods and also from adaptive reward-maximizing
control methods. We show that this method improves upon the
performance of two reactive controls, with visible benefits
within 2.5 minutes of real-time learning on the robot. In the
first experiment, the robot turns off its motors when it
predicts a future over-current condition, which reduces the
time spent in unsafe over-current conditions and improves
efficiency. In the second experiment, the robot starts to move
when it predicts a human-issued request, which reduces the
apparent latency of the human-robot interface.

White, A., Modayil, J., Sutton, R. S. (2014). Surprise and curiosity for big data robotics. In Workshop on Sequential Decision-Making with Big Data at the Twenty-Eighth AAAI Conference on Artificial Intelligence.

ABSTRACT:This
paper introduces a new perspective on curiosity and
intrinsic motivation, viewed as the problem of generating
behavior data for parallel off-policy learning. We provide
1) the first measure of surprise based on off-policy
general value function learning progress, 2) the first
investigation of reactive behavior control with parallel
gradient temporal difference learning and function
approximation, and 3) the first demonstration of using
curiosity driven control to react to a non-stationary
learning task—all on a mobile robot. Our approach improves
scalability over previous off-policy, robot learning
systems, essential for making progress on the ultimate
big-data decision making problem—life-long robot learning

Edwards, A. L., Kearney, A., Dawson, M. R., Sutton, R. S., Pilarski, P. M. (2013). Temporal-difference learning to assist human decision making during the control of an artificial limb. In: Proceedings of the First Interdisciplinary Conference on Reinforcement Learning and Decision Making, and http://arxiv.org/abs/1309.4714.

ABSTRACT: In
this work we explore the use of reinforcement learning (RL) to
help with human decision making, combining state-of-the-art RL
algorithms with an application to prosthetics. Managing
human-machine interaction is a problem of considerable scope,
and the simplification of human-robot interfaces is especially
important in the domains of biomedical technology and
rehabilitation medicine. For example, amputees who control
artificial limbs are often required to quickly switch between
a number of control actions or modes of operation in order to
operate their devices. We suggest that by learning to
anticipate (predict) a user's behaviour, artificial limbs
could take on an active role in a human's control decisions so
as to reduce the burden on their users. Recently, we showed
that RL in the form of general value functions (GVFs) could be
used to accurately detect a user's control intent prior to
their explicit control choices. In the present work, we
explore the use of temporal-difference learning and GVFs to
predict when users will switch their control influence between
the different motor functions of a robot arm. Experiments were
performed using a multi-function robot arm that was controlled
by muscle signals from a user's body (similar to conventional
artificial limb control). Our approach was able to acquire and
maintain forecasts about a user's switching decisions in real
time. It also provides an intuitive and reward-free way for
users to correct or reinforce the decisions made by the
machine learning system. We expect that when a system is
certain enough about its predictions, it can begin to take
over switching decisions from the user to streamline control
and potentially decrease the time and effort needed to
complete tasks. This preliminary study therefore suggests a
way to naturally integrate human- and machine-based decision
making systems.

Silver, D., Sutton, R. S., Mueller, M. (2013). Temporal-difference search in computer go. In: Proceedings of the ICAPS-13 Workshop on Planning and Learning.

ABSTRACT:
Temporal-difference (TD) learning is one of the most
successful and broadly applied solutions to the reinforcement
learning problem; it has been used to achieve master-level
play in chess, checkers and backgammon. Monte-Carlo tree
search is a recent algorithm for simulation-based search,
which has been used to achieve master-level play in Go. We
have introduced a new approach to high-performance planning
(Silver et al., 2012). Our method, TD search, combines TD
learning with simulation-based search. Like Monte-Carlo tree
search, value estimates are updated by learning online from
simulated experience. Like TD learning, it uses value function
approximation and bootstrapping to efficiently generalise
between related states. We applied TD search to the game of 9
× 9 Go, using a million binary features matching simple
patterns of stones. Without any explicit search tree, our
approach outperformed a vanilla Monte-Carlo tree search with
the same number of simulations. When combined with a simple
alpha-beta search, our program also outperformed all
traditional (pre-Monte-Carlo) search and machine learning
programs on the 9 × 9 Computer Go Server.

Mahmood, A. R., Sutton, R. S. (2013). Representation Search through Generate and Test. In: Proceedings of the AAAI Workshop on Learning Rich Representations from Low-Level Sensors, Bellevue, Washington, USA.

ABSTRACT:
Learning representations from data is one of the fundamental
problems of artificial intelligence and machine learning. Many
different approaches exist for learning representations, but
what constitutes a good representation is not yet well
understood. Much less is known about how to learn
representations from an unending stream of data. In this work,
we view the problem of representation learning as one of
learning features (e.g., hidden units of neural networks) such
that performance of the underlying base system continually
improves. We study an important case where learning is done
fully online (i.e, on an example-by-example basis). In the
presence of an unending stream of data, the computational cost
of the learning element should not grow with time and cannot
be much more than that of the performance element. Few methods
can be used effectively in this case. We show that a search
approach to representation learning can naturally fit into
this setting. In this approach good representations are
searched by generating different features and then testing
them for utility. We show that a fully online method can be
developed, which utilizes this generate and test approach,
learns representations continually, and adds only a small
fraction to the overall computation. We believe online
representation search constitutes an important step to- ward
effective and inexpensive solutions to representation learning
problems.

van Seijen, H., Sutton, R. S. (2013). Efficient planning in MDPs by small backups. In: Proceedings of the 30th International Conference on Machine Learning, pp. 361-369, Atlanta, USA.

ABSTRACT:
Efficient planning plays a crucial role in model-based
reinforcement learning. This efficiency is largely determined
by the order in which backups are performed. A particularly
effective ordering strategy is the strategy employed by
prioritized sweeping. Prioritized sweeping orders backups
according to a heuristic, such that backups that cause a large
value change are selected first. The Bellman error predicts
the value change caused by a full backup exactly, but its
computation is expensive. Hence, methods do not use the
Bellman error as a heuristic, but try to approximate the
ordering induced by it with a heuristic that is cheaper to
compute. We introduce the first efficient prioritized sweeping
method that uses exact Bellman error ordering. The core of
this method is a novel backup that allows for efficient
computation of the Bellman error, while its effect as well as
its computational complexity is equal to that of a full
backup. We demonstrate empirically that the obtained method
achieves substantially higher convergence rates than other
prioritized sweeping implementations.

Pilarski, P. M., Dick, T. B., Sutton, R. S. (2013). Real-time prediction learning for the simultaneous actuation of multiple prosthetic joints. In: Proceedings of the 2013 IEEE International Conference on Rehabilitation Robotics, Seattle, USA, June 24–26. 8 pages.

ABSTRACT:
Integrating learned predictions into a prosthetic control
system promises to enhance multi-joint prosthesis use by
amputees. In this article, we present a preliminary study of
different cases where it may be beneficial to use a set of
temporally extended predictions—learned and maintained in real
time—within an engineered or learned prosthesis controller.
Our study demonstrates the first successful combination of
actor-critic reinforcement learning with real-time prediction
learning. We evaluate this new approach to control learning
during the myoelectric operation of a robot limb. Our results
suggest that the integration of real-time prediction and
control learning may speed control policy acquisition, allow
unsupervised adaptation in myoelectric controllers, and
facilitate synergies in highly actuated limbs. These
experiments also show that temporally extended prediction
learning enables anticipatory actuation, opening the way for
coordinated motion in assistive robotic devices. Our work
therefore provides initial evidence that real-time prediction
learning is a practical way to support intuitive joint control
in increasingly complex prosthetic systems.

Pilarski, P. M., Dawson, M. R., Degris, T., Chan, K. M., Hebert, J. S., Carey, J. P., Sutton, R. S. (2013). Adaptive Artificial Limbs: A Real-time Approach to Prediction and Anticipation, IEEE Robotics & Automation Magazine 20(1):53-64.

ABSTRACT:
Predicting the future has long been regarded as a powerful
means to improvement and success. The ability to make accurate
and timely predictions enhances our ability to control our
situation and our environment. Assistive robotics is one
prominent area where foresight of this kind can bring improved
quality of life. In this article, we present a new approach to
acquiring and maintaining predictive knowledge during the
online, ongoing operation of an assistive robot. The ability
to learn accurate, temporally abstracted predictions is shown
through two case studies—able-bodied subjects engaging in the
myoelectric control of a robot arm, and an amputee
participant’s interactions with a myoelectric training robot.
To our knowledge, this work is the first demonstration of a
practical method for real-time prediction learning during
myoelectric control. Our approach therefore represents a
fundamental tool for addressing one major unsolved problem:
amputee-specific adaptation during the ongoing operation of a
prosthetic device. The findings in this article also
contribute a first explicit look at prediction learning in
prosthetics as an important goal in its own right, independent
of its intended use within a specific controller or system.
Our results suggest that real-time learning of predictions and
anticipations is a significant step towards more intuitive
myoelectric prostheses and other assistive robotic devices.

Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2013). Timing and cue competition in conditioning of the nictitating membrane response of the rabbit (Oryctolagus cuniculus), Learning and Memory 20:97-102. Cold Spring Harbor Press.

ABSTRACT:Rabbits
were
classically conditioned using compounds of tone and light
conditioned stimuli (CSs) presented with either simultaneous
onsets (Experiment 1) or serial onsets (Experiment 2) in a
delay conditioning paradigm. Training with the simultaneous
compound reduced the likelihood of a conditioned response (CR)
to the individual CSs (“mutual overshadowing”) but left CR
timing unaltered. CR peaks were consistently clustered around
the time of unconditioned stimulus (US) delivery. Training
with the serial compound (CSA->CSB->US) reduced
responding to CSB (“temporal primacy/information effect”) but
this effect was prevented by prior CSB->US pairings. In
both cases, serial compound training altered CR timing. On
CSA->CSB test trials, the CRs were accelerated; the CR
peaks occurred after CSB onset but well before the time of US
delivery. Conversely, CRs on CSB– trials were decelerated; the
distribution of CR peaks was variable but centered well after
the US. Timing on CSB– trials was at most only slightly
accelerated. The results are discussed with respect to
processes of generalization and spectral timing applicable to
the cerebellar and forebrain pathways in eyeblink
preparations.

White, A., Modayil, J., Sutton, R. S. (2012). Scaling Life-long Off-policy Learning. In Proceedings of the Second Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-Epirob), San Diego, USA.

ABSTRACT:In
this paper, we pursue an approach to scaling life-long
learning using parallel, off-policy reinforcement-learning
algorithms. In life-long learning, a robot continually learns
from a life-time of experience, slowly acquiring and applying
skills and knowledge to new situations. Many of the benefits
of life-long learning are a result of scaling the amount of
training data, processed by the robot, to long sensorimotor
streams. Another dimension of scaling can be added by allowing
off-policy sampling from the unending stream of sensorimotor
data generated by a long-lived robot. Recent algorithmic
developments have made it possible to apply off-policy
algorithms to life-long learning, in a sound way, for the
first time. We assess the scalability of these off-policy
algorithms on a physical robot. We show that hundreds of
accurate multi-step predictions can be learned about several
policies in parallel and in realtime. We present the first
online measures of off-policy learning progress. Finally we
demonstrate that our robot, using the new off-policy measures,
can learn 8000 predictions about 300 distinct policies, a
substantial increase in scale compared to previous simulated
and robotic life-long learning systems.

Modayil, J., White, A., Pilarski, P. M., Sutton, R. S. (2012). Acquiring a Broad Range of Empirical Knowledge in Real Time by Temporal-Difference Learning. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Seoul, Korea.

ABSTRACT:
Several robot capabilities rely on predictions about the
temporally extended consequences of a robot’s behaviour. We
describe how a robot can both learn and make many such
predictions in real time using a standard algorithm. Our
experiments show that a mobile robot can learn and make
thousands of accurate predictions at 10 Hz. The predictions
are about the future of all of the robot’s sensors and many
internal state variables at multiple time-scales. All the
predictions share a single set of features and learning
parameters. We demonstrate the generality of this method with
an application to a different platform, a robot arm operating
at 50 Hz. Here, learned predictions can be used to measurably
improve the user interface. The temporally extended
predictions learned in real time by this method constitute a
basic form of knowledge about the dynamics of the robot’s
interaction with the environment. We also show how this method
can be extended to express more general forms of knowledge.

Modayil, J., White, A., Pilarski, P. M., Sutton, R. S. (2012). Acquiring Diverse Predictive Knowledge in Real Time by Temporal-difference Learning. International Workshop on Evolutionary and Reinforcement Learning for Autonomous Robot Systems, Montpellier, France.

ABSTRACT:
Existing robot algorithms demonstrate several capabilities
that are enabled by a robot’s knowledge of the temporally-
extended consequences of its behaviour. This knowledge
consists of real-time predictions—predictions that are
conventionally computed by iterating a small one-timestep
model of the robot’s dynamics. Given the utility of such
predictions, alternatives are desirable when this conventional
approach is not applicable, for example when an adequate model
of the one-timestep dynamics is either not available or not
computationally tractable. We describe how a robot can both
learn and make many such predictions in real-time using a
standard reinforcement learning algorithm. Our experiments
show that a mobile robot can learn and make thousands of
accurate predictions at 10 Hz about the future of all of its
sensors and many internal state vari- ables at multiple
time-scales. The method uses a single set of features and
learning parameters that are shared across all the
predictions. We demonstrate the generality of these
predictions with an application to a different platform, a
robot arm operating at 50 Hz. Here, the predictions are about
which arm joint the user wants to move next, a difficult
situation to model analytically, and we show how the learned
predictions enable measurable improvements to the user
interface. The predictions learned in real-time by this method
constitute a basic form of knowledge about the robot’s
interaction with the environ- ment, and extensions of this
method can express more general forms of knowledge.

Ludvig, E. A., Sutton, R. S., Kehoe, E. J. (2012). Evaluating the TD model of classical conditioning. Learning and Behavior 40(3):305-319. Springer.

ABSTRACT: The
temporal-difference (TD) algorithm from reinforcement learning
provides a simple method for incrementally learning
predictions of upcoming events. Applied to classical
conditioning, TD models suppose that animals learn a real-time
prediction of the unconditioned stimulus (US) on the basis of
all available conditioned stimuli (CSs). In the TD model,
similar to other error-correction models, learning is driven
by prediction errors—the difference between the change in US
prediction and the actual US. With the TD model, however,
learning occurs continuously from moment to moment and is not
artificially constrained to occur in trials. Accordingly, a
key feature of any TD model is the assumption about the
representation of a CS on a moment-to-moment basis. Here, we
evaluate the performance of the TD model with a heretofore
unexplored range of classical conditioning tasks. To do so, we
consider three stimulus representations that vary in their
degree of temporal generalization and evaluate how the
representation influences the performance of the TD model on
these conditioning tasks.

Pilarski, P. M., Sutton, R. S. (2012) Between instruction and reward: Human-prompted switching. Proceedings of the AAAI Fall Symposium on Robots Learning Interactively from Human Teachers. AAAI Technical Report FS-12-07, pp. 46–52.

ABSTRACT: Intelligent
systems promise to amplify, augment, and extend innate
human abilities. A principal example is that of assistive
rehabilitation robots—artificial intelligence and machine
learning enable new electromechanical systems that restore
biological functions lost through injury or illness. In
order for an intelligent machine to assist a human user,
it must be possible for a human to communicate their
intentions and preferences to their non-human counterpart.
While there are a number of techniques that a human can
use to direct a machine learning system, most research to
date has focused on the contrasting strategies of
instruction and reward. The primary contribution of our
work is to demonstrate that the middle ground between
instruction and reward is a fertile space for research and
immediate technological progress. To support this idea, we
introduce the setting of human-prompted switching, and
illustrate the successful combination of switching with
interactive learning using a concrete real-world example:
human control of a multi-joint robot arm. We believe
techniques that fall between the domains of instruction
and reward are complementary to existing approaches, and
will open up new lines of rapid progress for interactive
human training of machine learning systems.

Pilarski, P. M., Dawson, M. R., Degris, T., Carey, J. P., Chan, K. M., Hebert, J. S., Sutton, R. S. (2012). Towards Prediction-Based Prosthetic Control. In Proceedings of the 2012 Conference of the International Functional Electrical Stimulation Society, September 9-12, Banff, Canada, 4 pages.

ABSTRACT: Predictions
and anticipations form a basis for
pattern-recognition-based control systems, including those
used in next-generation prostheses and assistive devices.
In this work, we outline how new methods in real-time
prediction learning provide an approach to one of the
principal open problems in multi-function myoelectric
control—robust, ongoing, amputee-specific adaptation.
Techniques from reinforcement learning and general value
functions are applied to learn and update predictions
during continuing human interactions with multi-function
robotic appendages. Tests were performed with a targeted
motor and sensory reinnervation amputee subject and
non-amputee participants. Results show that this online
prediction learning approach is able to accurately
anticipate a diverse set of human and robot signals by
more than two seconds, and at different levels of temporal
abstraction. These methods allow predictions to be adapted
during real-time use, which is an important step toward
the intuitive control of powered prostheses and other
assistive devices.

Degris, T., White, M., Sutton, R. S. (2012). Off-policy Actor-Critic. In

ABSTRACT: This paper
presents the first actor-critic algorithm for off-policy
reinforcement learning. Our algorithm is online and
incremental, and its per-time-step complexity scales
linearly with the number of learned weights. Previous work
on actor-critic algorithms is limited to the on-policy
setting and does not take advantage of the recent advances
in off-policy gradient temporal-difference learning.
Off-policy techniques, such as Greedy-GQ, enable a target
policy to be learned while following and obtaining data
from another (behavior) policy. For many problems,
however, actor-critic methods are more practical than
action value methods (like Greedy-GQ) because they
explicitly represent the policy; consequently, the policy
can be stochastic and utilize a large action space. In
this paper, we illustrate how to practically combine the
generality and learning potential of off-policy learning
with the flexibility in action selection given by
actor-critic methods. We derive an incremental, linear
time and space complexity algorithm that includes
eligibility traces, prove convergence under assumptions
similar to previous off-policy algorithms, and empirically
show better or comparable performance to existing
algorithms on standard reinforcement-learning benchmark
problems.

Pilarski, P. M., Dawson, M. R., Degris, T., Carey, J. P., Sutton, R. S. (2012). Dynamic Switching and Real-time Machine Learning for Improved Human Control of Assistive Biomedical Robots. In Proceedings of the Fourth IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob), June 24-27, Roma, Italy, 296-302.

ABSTRACT: A general
problem for human-machine interaction occurs when a
machine’s controllable dimensions outnumber the control
channels available to its human user. In this work, we
examine one prominent example of this problem: amputee
switching between the multiple functions of a powered
artificial limb. We propose a dynamic switching approach
that learns during ongoing interaction to anticipate user
behaviour, thereby presenting the most effective control
option for a given context or task. Switching predictions
are learned in real time using temporal difference methods
and reinforcement learning, and demonstrated within the
context of a robotic arm and a multi-function myoelectric
controller. We find that a learned, dynamic switching
order is able to out-perform the best fixed (non-adaptive)
switching regime on a standard prosthetic proficiency
task, increasing the number of optimal switching
suggestions by 23%, and decreasing the expected transition
time between degrees of freedom by more than 14%. These
preliminary results indicate that real-time machine
learning, specifically online prediction and anticipation,
may be an important tool for developing more robust and
intuitive controllers for assistive biomedical robots. We
expect these techniques will transfer well to near-term
use by patients. Future work will describe clinical
testing of this approach with a population of amputee
patients.

Sutton, R. S. (2012). Beyond reward: The problem of knowledge and data (extended abstract). In: Proceedings of the 21st International Conference on Inductive Logic Programming, S.H. Muggleton, A. Tamaddoni-Nezhad, F.A. Lisi (Eds.): ILP 2011, LNAI 7207, pp. 2--6. Springer, Heidelberg.

Degris, T., Modayil, J. (2012). Scaling-up Knowledge for a Cognizant Robot.” In notes of the AAAI Spring Symposium on Designing Intelligent Robots: Reintegrating AI.

ABSTRACT: This paper
takes a new approach to the old adage that knowledge is
the key for artificial intelligence. A cognizant robot is
a robot with a deep and immediately accessible
understanding of its interaction with the environment—an
understanding the robot can use to flexibly adapt to novel
situations. Such a robot will need a vast amount of
situated, revisable, and expressive knowledge to display
flexible intelligent behaviors. Instead of relying on
human-provided knowledge, we propose that an arbitrary
robot can autonomously acquire pertinent knowledge
directly from everyday interaction with the environment.
We show how existing ideas in reinforcement learning can
enable a robot to maintain and improve its knowledge. The
robot performs a continual learning process that scales-up
knowledge acquisition to cover a large number of facts,
skills and predictions. This knowledge has semantics that
are grounded in sensorimotor experience. We see the
approach of developing more cognizant robots as a
necessary key step towards broadly competent robots.

Degris, T., Pilarski, P. M., Sutton, R. S. (2012). Model-Free Reinforcement Learning with Continuous Action in Practice. In Proceedings of the 2012 American Control Conference.

ABSTRACT: Reinforcement
learning methods are often considered as a potential
solution to enable a robot to adapt to changes in real
time to an unpredictable environment. However, with
continuous action, only a few existing algorithms are
practical for real-time learning. In such a setting, most
effective methods have used a parameterized policy
structure, often with a separate parameterized value
function. The goal of this paper is to assess such
actor-critic methods to form a fully specified practical
algorithm. Our specific contributions include 1)
developing the extension of existing incremental
policy-gradient algorithms to use eligibility traces, 2)
an empirical comparison of the resulting algorithms using
continuous actions, 3) the evaluation of a
gradient-scaling technique that can significantly improve
performance. Finally, we apply our actor--critic algorithm
to learn on a robotic platform with a fast sensorimotor
cycle (10ms). Overall, these results constitute an
important step towards practical real-time learning
control with continuous action.

Mahmood, A. R., Sutton, R. S., Degris, T., Pilarski, P. M. (2012). Tuning-free step-size adaptation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Kyoto, Japan.

ABSTRACT: Incremental
learning algorithms based on gradient descent are
effective and popular in online supervised learning,
reinforcement learning, signal processing, and many other
application areas. An oft-noted drawback of these
algorithms is that they include a step-size parameter that
needs to be tuned for best performance, which might
require manual intervention and significant domain
knowledge or additional data. In many cases, an entire
vector of step-size parameters (e.g., one for each input
feature) needs to be tuned in order to attain the best
performance of the algorithm. To address this, several
methods have been proposed for adapting step sizes online.
For example, Sutton’s IDBD method can find the best vector
step size for the LMS algorithm, and Schraudolph’s ELK1
method, an extension of IDBD to neural networks, has
proven effective on large applications, such as 3D hand
tracking. However, to date all such step-size adaptation
methods have included a tunable step-size parameter of
their own, which we call the meta-step-size parameter. In this paper
we show that the performance of existing step-size
adaptation methods are strongly dependent on the choice of
their meta-step-size parameter and that their
meta-step-size parameter cannot be set reliably in a
problem-independent way. We introduce a series of
modifications and normalizations to the IDBD method that
together eliminate the need to tune the meta-step-size
parameter to the particular problem. We show that the
resulting overall algorithm, called Autostep, performs as
well or better than the existing step-size adaptation
methods on a number of idealized and robot prediction
problems and does not require any tuning of its
meta-step-size parameter. The ideas behind Autostep are
not restricted to the IDBD method and the same principles
are potentially applicable to other incremental learning
settings, such as reinforcement learning.

Silver, D., Sutton, R. S., Müller, M. (2012). Temporal-difference search in computer Go, Machine Learning 87(2):183-219. Pdf copy for personal research purposes only.

ABSTRACT: Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9 × 9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9 × 9 Computer Go Server.

Modayil, J., White, A., Sutton, R. S. (2012). Multi-timescale Nexting in a Reinforcement Learning Robot. Presented at the 2012 International Conference on Adaptive Behaviour, Odense, Denmark. To appear in: SAB 12, LNAI 7426, pp. 299-309, T. Ziemke, C. Balkenius, and J. Hallam, Eds., Springer Heidelberg. Also ArXiv preprint 1112.1133.

ABSTRACT: Maintaining accurate world knowledge in a complex and changing environment is a perennial problem for robots and other artificial intelligence systems. Our architecture for addressing this problem, called Horde, consists of a large number of independent reinforcement learning sub-agents, or demons. Each demon is responsible for answering a single predictive or goal-oriented question about the world, thereby contributing in a factored, modular way to the system's overall knowledge. The questions are in the form of a value function, but each demon has its own policy, reward function, termination function, and terminal-reward function unrelated to those of the base problem. Learning proceeds in parallel by all demons simultaneously so as to extract the maximal training information from whatever actions are taken by the system as a whole. Gradient-based temporal-difference learning methods are used to learn efficiently and reliably with function approximation in this off-policy setting. Horde runs in constant time and memory per time step, and is thus suitable for learning online in real-time applications such as robotics. We present results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience. Horde is a significant incremental step towards a real-time architecture for efficient learning of general knowledge from unsupervised sensorimotor interaction.

Pilarski, P. M., Dawson, M. R., Degris, T.,Fahimi, F., Carey, J. P., Sutton, R. S. (2011). Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning, Proceedings of the 2011 IEEE International Conference on Rehabilitation Robotics (ICORR), Zurich, Switzerland, pp. 134-140.

ABSTRACT: As
a contribution toward the goal of adaptable, intelligent
artificial limbs, this work introduces a continuous
actor-critic reinforcement learning method for optimizing the
control of multi-function myoelectric devices. Using a
simulated upper-arm robotic prosthesis, we demonstrate how it
is possible to derive successful limb controllers from
myoelectric data using only a sparse human-delivered training
signal, without requiring detailed knowledge about the task
domain. This reinforcement-based machine learning framework is
well suited for use by both patients and clinical staff, and
may be easily adapted to different application domains and the
needs of individual amputees. To our knowledge, this is the
first myoelectric control approach that facilitates the online
learning of new amputee-specific motions based only on a
one-dimensional (scalar) feedback signal provided by the user
of the prosthesis.

Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the Tenth International Conference on Autonomous Agents and Multiagent Systems, Taipei, Taiwan.

ABSTRACT: Maintaining accurate world knowledge in a complex and changing environment is a perennial problem for robots and other artificial intelligence systems. Our architecture for addressing this problem, called Horde, consists of a large number of independent reinforcement learning sub-agents, or demons. Each demon is responsible for answering a single predictive or goal-oriented question about the world, thereby contributing in a factored, modular way to the system's overall knowledge. The questions are in the form of a value function, but each demon has its own policy, reward function, termination function, and terminal-reward function unrelated to those of the base problem. Learning proceeds in parallel by all demons simultaneously so as to extract the maximal training information from whatever actions are taken by the system as a whole. Gradient-based temporal-difference learning methods are used to learn efficiently and reliably with function approximation in this off-policy setting. Horde runs in constant time and memory per time step, and is thus suitable for learning online in real-time applications such as robotics. We present results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience. Horde is a significant incremental step towards a real-time architecture for efficient learning of general knowledge from unsupervised sensorimotor interaction.

Modayil, J., Pilarski, P. M., White, A., Degris, T., Sutton, R. S. (2010). Off-policy knowledge maintenance for robots, Robotics Science and Systems Workshop (Towards Closing the Loop: Active Learning for Robotics). Extended abstract and poster.

ABSTRACT: A fundamental difficulty in
robotics arises from changes in the experienced
environment—periods when the robot’s current situation
differs from past experience. We present an architecture
whereby many independent reinforcement learn- ing agents (or
demons) observe the
behaviour of a single robot. Each demon learns one piece of
world knowledge represented with a generalized value
function. This architecture allows the demons to update
their knowledge online and off-policy from the robot’s
behaviour. We present one approach to active exploration
using curiosity—an internal measure of learning progress—and
conclude with a preliminary result showing how a robot can
adapt its prediction of the time needed to come to a full
stop.

Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2010). Timing in trace conditioning of the nictitating membrane response of the rabbit (Oryctolagus cuniculus): Scalar, nonscalar, and adaptive features. Learning and Memory 17:600-604. Cold Spring Harbor Press.

Using
interstimulus intervals (ISIs) of 125, 250, and 500 msec in
trace conditioning of the rabbit nictitating membrane
response, the offset times and durations of conditioned
responses (CRs) were collected along with onset and peak
latencies. All measures were proportional to the ISI, but only
onset and peak latencies conformed to the criterion for scalar
timing. Regarding the CR’s possible protective overlap of the
unconditioned stimulus (US), CR duration increased with ISI,
while the peak’s alignment with the US declined. Implications
for models of timing and CR adaptiveness are discussed.

Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In

ABSTRACT: We present the ﬁrst temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, Greedy-GQ, is an extension of recent work on gradient temporal-diﬀerence learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear function approximation.

Maei, H. R., Sutton, R. S. (2010). GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In

ABSTRACT: A new family of gradient temporal-difference
learning algorithms have recently been introduced by Sutton,
Maei and others in which function approximation is much more
straightforward. In this paper, we introduce the GQ(λ)
algorithm which can be seen as extension of that work to a
more general setting including eligibility traces and
off-policy learning of temporally abstract predictions. These
extensions bring us closer to the ultimate goal of this
work—the development of a universal prediction learning
algorithm suitable for learning experientially grounded
knowledge of the world. Eligibility traces are essential to
this goal because they bridge the temporal gaps in cause and
effect when experience is processed at a temporally ﬁne
resolution. Temporally abstract predictions are also essential
as the means for representing abstract, higher-level knowledge
about courses of action, or options. GQ(λ) can be thought of
as an extension of Q-learning. We extend existing convergence
results for policy evaluation to this setting and carry out a
forward-view/backward-view analysis to derive and prove the
validity of the new algorithm.

Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2009). Magnitude and timing of conditioned responses in delay and trace classical conditioning of the nictitating membrane respons of the rabbit (oryctolagus cuniculus),

ABSTRACT:
The present experiment characterized conditioned nictitating
membrane (NM) movements as a function of CS duration, using
the full range of discernible movements (>.06 mm) rather
than movements exceeding a conventional criterion (>.50
mm). The CS–US interval was fixed at 500 ms, while across
groups, the duration of the CS was 50 ms (trace), 550 ms
(delay), or 1050 ms (extended delay). The delay group showed
the highest level of acquisition. When tested with the
different CS durations, the delay and extended delay groups
showed large reductions in their responses when their CS was
shortened to 50 ms, but the trace group maintained its
response at all durations. Timing of the conditioned
movements appeared similar across all manipulations. The
results suggest that the CS has both a fine timing function
tied to CS onset and a general predictive function tied to
CS duration, both of which may be mediated by cerebellar
pathways.

Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S. (2010). Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation. In

ABSTRACT: We introduce the first temporal-difference
learning algorithms that converge with smooth value function
approximators, such as neural networks. Conventional
temporal-difference (TD) methods, such as TD(λ), Q-learning
and Sarsa have been used successfully with function
approximation in many applications. However, it is well
known that off-policy sampling, as well as nonlinear
function approximation, can cause these algorithms to become
unstable (i.e., the parameters of the approximator may
diverge). Sutton et al. (2009a, 2009b) solved the problem of
off-policy learning with linear TD algorithms by introducing
a new objective function, related to the Bellman error, and
algorithms that perform stochastic gradient-descent on this
function. These methods can be viewed as natural
generalizations to previous TD methods, as they converge to
the same limit points when used with linear function
approximation methods. We generalize this work to nonlinear
function approximation. We present a Bellman error objective
function and two gradient-descent TD algorithms that
optimize it. We prove the asymptotic almost-sure convergence
of both algorithms, for any finite Markov decision process
and any smooth value function approximator, to a locally
optimal solution. The algorithms are incremental and the
computational complexity per time step scales linearly with
the number of parameters of the approximator. Empirical
results obtained in the game of Go demonstrate the
algorithms’ effectiveness.

Interview with Richard S. Sutton, by Verena Heidrich-Meisner of Kuenstliche Intelligenz. (2009). pp. 41-43.

Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., Lee, M. (2009). Natural Actor-Critic Algorithms.

ABSTRACT: We present four new
reinforcement learning algorithms based on actor-critic,
natural-gradient and function-approximation ideas, and we
provide their convergence proofs. Actor-critic reinforcement
learning methods are online approximations to policy
iteration in which the value-function parameters are
estimated using temporal difference learning and the policy
parameters are updated by stochastic gradient descent.
Methods based on policy gradient in this way are of special
interest because of their compatibility with function
approximation methods, which are needed to handle large or
infinite state spaces. The use of temporal difference
learning in this way is of interest because in many
applications it dramatically reduces the variance of the
gradient estimates. The use of the natural gradient is of
interest because it can produce better conditioned
parameterizations and has been shown to further reduce
variance in some cases. Our results extend prior
two-timescale convergence results for actor-critic methods
by Konda and Tsitsiklis by using temporal difference
learning in the actor and by incorporating natural
gradients. Our results extend prior empirical studies of
natural actor-critic methods by Peters, Vijayakumar and
Schaal by providing the first convergence proofs and the
first fully incremental algorithms.

Sutton, R. S. (2009).
The grand challenge of
predictive empirical abstract knowledge. Working Notes
of the IJCAI-09 Workshop on Grand Challenges for Reasoning
from Experiences.ABSTRACT: We survey ongoing
work at the University of Alberta on an
experience-oriented approach to artificial intelligence
(AI) based an reinforcement learning ideas. We seek to
ground world knowledge in a minimal ontology of just the
signals passing back and forth between the AI agent and
its environment at a fine temporal scale. The challenge is
to connect these low-level signals to higher-level
representations in such a way that the knowledge remains
grounded and autonomously verifiable. The
mathematical ideas of temporally abstract options, option
models, and temporal-difference networks can be applied to
begin to address this challenge. This has been illustrated
in several simple computational worlds, and we seek now to
extend the approach to a physically realized robot. A
recent theoretical development is the extension of simple
temporal-difference methods to off-policy forms using
function approximation; this should enable, for the first
time, efficient and reliable intra-option learning,
bringing the goal of predictive empirical abstract
knowledge closer to achievement.

Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In

Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.

Kehoe, E. J., Olsen, K. N., Ludvig, E. A., Sutton, R. S. (2009). Scalar timing varies with response magnitude in classical conditioning of the rabbit nictitating membrane response (Oryctolagus cuniculus). Behavioral Neuroscience 123, 212–217.

The present experiment was aimed at characterizing the timing of conditioned nictitating membrane (NM) movements as function of the interstimulus interval (ISI) in delay conditioning for rabbits (Oryctolagus cuniculus). Onset latency and peak latency were approximately, but not strictly, scalar for all but the smallest movements (<.10 mm). That is, both the mean and standard deviation of the timing measures increased in proportion to the ISI, but their coefficients of variation (standard deviation/mean) tended to be larger for shorter ISIs. For all ISIs, the absolute timing of the NM movements covaried with magnitude. The smaller movements (approximately, .11-.50 mm) were highly variable, and their peaks tended to occur well after the time of US delivery. The larger movements (>.50 mm) were less variable, and their peaks were better aligned with the time of US delivery. These results are discussed with respect to their implications for current models of timing in eyeblink conditioning.Sutton, R. S., Szepesvari, Cs., Maei, H. R. (2009). A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. Thegradient temporal-difference(GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on itsL2norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD's quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.

Ludvig, E., Sutton, R. S., Verbeek, E., Kehoe, E. J. (2009). A computational model of hippocampal function in trace conditioning. In

We present a new reinforcement-learning model for the role of the hippocampus in classical conditioning, focusing on the differences between trace and delay conditioning. In the model, all stimuli are represented both as unindividuated wholes and as a series of temporal elements with varying delays. These two stimulus representations interact, producing different patterns of learning in trace and delay conditioning. The model proposes that hippocampal lesions eliminate long-latency temporal elements, but preserve short-latency temporal elements. For trace conditioning, with no contiguity between stimulus and reward, these long-latency temporal elements are vital to learning adaptively timed responses. For delay conditioning, in contrast, the continued presence of the stimulus supports conditioned responding, and the short-latency elements suppress responding early in the stimulus. In accord with the empirical data, simulated hippocampal damage impairs trace conditioning, but not delay conditioning, at medium-length intervals. With longer intervals, learning is impaired in both procedures, and, with shorter intervals, in neither. In addition, the model makes novel predictions about the response topography with extended stimuli or post-training lesions. These results demonstrate how temporal contiguity, as in delay conditioning, changes the timing problem faced by animals, rendering it both easier and less susceptible to disruption by hippocampal lesions.

Cutumisu, M., Szafron, D., Bowling, M., Sutton R. S. (2008). Agent learning using action-dependent learning rates in computer role-playing games. Proceedings of the 4th Artificial Intelligence and Interactive Digital Entertainment Conference, pp. 22-29.

We
introduce the ALeRT (Action-dependent Learning Rates with Trends)
algorithm that makes two modifications to the learning
rate and one change to the exploration rate of traditional
reinforcement learning techniques. Our learning rates are
action-dependent and increase or decrease based on trends
in reward sequences. Our exploration rate decreases when
the agent is learning successfully and increases
otherwise. These improvements result in faster learning.
We implemented this algorithm in NWScript, a scripting
language used by BioWare Corp.’s Neverwinter Nights game, with the goal
of improving the behaviours of game agents so that they
react more intelligently to game events. Our goal is to
provide an agent with the ability to (1) discover favourable
policies in a multi-agent computer role-playing game
situation and (2) adapt
to sudden changes in the environment.

Silver, D., Sutton, R. S., and Mueller, M. (2008). Sample-based learning and search with permanent and transient memories. In

We study a reinforcement learning architecture that encompasses both learning and planning, and apply it to high performance Computer Go. In this domain, the most successful planning methods are based on sample-based search algorithms, such as UCT, in which states are individuated, and the most successful learning methods are based on temporal-difference learning algorithms, such as Sarsa, in which linear function approximation is used. In both cases, an estimate of the value function is formed, but in the first case it is transient, computed and then discarded after each move, whereas in the second case it is more permanent, slowly accumulating over many moves and games. The idea of our architecture, Dyna-2, is for the transient planning memory and the permanent learning memory to remain separate, but for both to be based on linear function approximation and both to be updated by Sarsa. This architecture subsumes Sarsa and UCT as special cases. We apply Dyna-2 to 9x9 Computer Go, using a million binary features in the function approximator, based on templates matching small fragments of the board. We test our approach against GnuGo, a standard benchmark opponent, and on the Computer Go Online Server (CGOS). We show that the combination of permanent and transient memories performs better than either memory alone, and significantly outperforms a traditional approach using alpha-beta search. Our program based on Dyna-2 achieves a higher rating on CGOS than any other handcrafted or traditional search based program. This is the first exploration of the Dyna concept in a large-scale, high-performance application.Sutton, R. S., Szepesvari, Cs., Geramifard, A., Bowling, M., (2008). Dyna-style planning with linear function approximation and prioritized sweeping, Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, pp. 528-536.

We consider the problem of efficiently learning optimal control policies and value functions over large state spaces in an online setting in which estimates must be available after each interaction with the world. This paper develops an explicitly model-based approach extending the Dyna architecture to linear function approximation. Dyna-style planning proceeds by generating imaginary experience from the world model and then applying model-free reinforcement learning algorithms to the imagined state transitions. Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions. In the policy evaluation setting, we prove that the limit point is the least-squares (LSTD) solution. An implication of our results is that prioritized-sweeping can be soundly extended to the linear approximation case, backing up to preceding features rather than to preceding states. We introduce two versions of prioritized sweeping with linear Dyna and briefly illustrate their performance empirically on the Mountain Car and Boyan Chain problems.

Ludvig, E. A., Sutton, R. S., Kehoe, E. J. (2008) Stimulus representation and the timing of reward-prediction errors in models of the dopamine system,

The phasic firing of dopamine neurons has been theorized to encode a reward-prediction error as formalized by the temporal-difference (TD) algorithm in reinforcement learning. Most TD models of dopamine have assumed a stimulus representation, known as the complete serial compound, in which each moment in a trial is distinctly represented. We introduce a more realistic temporal stimulus representation for the TD model. In our model, all external stimuli, including rewards, spawn a series of internal microstimuli, which grow weaker and more diffuse over time. These microstimuli are used by the TD learning algorithm to generate predictions of future reward. This new stimulus representation injects temporal generalization into the TD model and enhances correspondence between model and data in several experiments, including those when rewards are omitted or received early. This improved fit mostly derives from the absence of large negative errors in the new model, suggesting that dopamine alone can encode the full range of TD errors in these situations.

Kehoe, E. J., Ludvig, E. A., Dudeney, J. E., Neufeld, J., Sutton, R. S. (2008). Magnitude and timing of nictitating membrane movements during classical conditioning of the rabbit (oryctolagus cuniculus),

A trial-by-trial, subject-by-subject analysis was conducted to determine whether generation of the conditioned response (CR) occurs on a continuous or all-or-none basis. Three groups of rabbits were trained on different partial reinforcement schedules with the conditioned stimulus presented alone on 10%, 30%, or 50%, respectively, of all trials. Plots of each rabbit’s nictitating membrane movements revealed that their magnitude rose in a continuous fashion. Response growth during acquisition followed a sigmoidal curve, and the timing of CR-sized movements was largely stable throughout the experiment. The results are discussed with respect to alternative models of CR generation.

Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., Lee, M. (2008). Incremental Natural Actor-Critic Algorithms.

ABSTRACT: We present four
new reinforcement learning algorithms based on
actor-critic and natural-gradient ideas, and provide their
convergence proofs. Actor-critic reinforcement learning
methods are online approximations to policy iteration in
which the value-function parameters are estimated using
temporal difference learning and the policy parameters are
updated by stochastic gradient descent. Methods based on
policy gradient in this way are of special interest
because of their compatibility with function approximation
methods, which are needed to handle large or infinite
state spaces, and the use of temporal difference learning
in this way is of interest because in many applications it
dramatically reduces the variance of the policy gradient
estimates. The use of the natural gradient is of interest
because it can produce better conditioned
parameterizations and has been shown to further reduce
variance in some cases. Our results extend prior
two-timescale convergence results for actor-critic methods
by Konda et al. by using temporal difference learning in
the actor and by incorporating natural gradients, and
extend prior empirical studies of natural-gradient
actor-critic methods by Peters et al. by providing the
first convergence proofs and the first fully incremental
algorithms.

Sutton, R.
S., Koop, A., Silver, D. (2007). On the Role of Tracking in
Stationary Environments. In ABSTRACT: It is often
thought that learning algorithms that track the best
solution, as opposed to converging to it, are important
only on nonstationary problems. We present three results
suggesting that this is not so. First we illustrate in a
simple concrete example, the Black and White problem, that
tracking can perform better than any converging algorithm
on a stationary problem. Second, we show the same point on
a larger, more realistic problem, an application of
temporal-difference learning to computer Go. Our third
result suggests that tracking in stationary problems could
be important for meta-learning research (e.g., learning to
learn, feature selection, transfer). We apply a
meta-learning algorithm for step-size adaptation, IDBD,e
to the Black and White problem, showing that meta-learning
has a dramatic long-term effect on performance whereas, on
an analogous converging problem, meta-learning has only a
small second-order effect. This small result suggests a
way of eventually overcoming a major obstacle to
meta-learning research: the lack of an independent
methodology for task selection.

Silver, D., Sutton, R. S., and Mueller, M. (2007) Reinforcement Learning of Local Shape in the Game of Go, Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07).

ABSTRACT: We explore an
application to the game of Go of a reinforcement learning
approach based on a linear evaluation function and large
numbers of binary features. This strategy has proved
effective in game playing programs and other reinforcement
learning applications. We apply this strategy to Go by
creating over a million features based on templates for
small fragments of the board, and then use temporal
difference learning and self-play. This method identifies
hundreds of low level shapes with recognisable
significance to expert Go players, and provides quantitive
estimates of their values. We analyse the relative
contributions to performance of templates of different
types and sizes. Our results show that small,
translation-invariant templates are surprisingly
effective. We assess the performance of our program by
playing against the Average Liberty Player and a variety
of computer opponents on the 9x9 Computer Go Server. Our
linear evaluation function appears to outperform all other
static evaluation functions that do not incorporate
substantial domain knowledge.

Geramifard, A., Bowling, M., Zinkevich, M., and Sutton, R. S. (2007). iLSTD: Eligibility Traces and Convergence Analysis, Advances in Neural Information Processing Systems 19 (NIPS*06).

ABSTRACT: We present new
theoretical and empirical results with the iLSTD algorithm
for policy evaluation in reinforcement learning with
linear function approximation. iLSTD is an
incremental method for achieving results similar to LSTD,
the data-efficient, least-squares version of temporal
difference learning, without incurring the full cost of
the LSTD computation. LSTD is O(n^{2}
), where n is
the number of parameters in the linear function
approximator, while iLSTD is O(n).
In this paper, we generalize the previous iLSTD algorithm
and present three new results: (1) the first convergence
proof for an iLSTD algorithm; (2) an extension to
incorporate eligibility traces without changing the
asymptotic computational complexity; and (3) the first
empirical results with an iLSTD algorithm for a problem
(mountain car) with feature vectors large enough (n = 10,000) to show
substantial computational advantages over LSTD.

Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental Least-Squares Temporal Difference Learning. In

ABSTRACT: Approximate policy
evaluation with linear function approximation is a
commonly arising problem in reinforcement learning,
usually solved using temporal difference (TD) algorithms.
In this paper we introduce a new variant of linear TD
learning, called incremental least-squares TD learning, or
iLSTD. This method is more data efficient than
conventional TD algorithms such as TD(0) and is more
computationally efficient than non-incremental
least-squares TD methods such as LSTD (Bradtke & Barto
1996; Boyan 1999). In particular, we show that the
per-time-step complexities of iLSTD and TD(0) are O(n), where n is the number of
features, whereas that of LSTD is O(n^{2} ). This
difference can be decisive in modern applications of
reinforcement learning where the use of a large number
features has proven to be an effective solution strategy.
We present empirical comparisons, using the test problem
introduced by Boyan (1999), in which iLSTD converges
faster than TD(0) and almost as fast as LSTD.

Precup, D., Sutton, R. S.,
Paduraru, C., Koop, A., Singh, S. (2006). Off-policy
Learning with Recognizers. Advances in Neural Information
Processing Systems 18 (NIPS*05).ABSTRACT: We introduce a new
algorithm for off-policy temporal-difference learning with
function approximation that has much lower variance and
requires less knowledge of the behavior policy than prior
methods. We develop the notion of a recognizer, a filter
on actions that distorts the behavior policy to produce a
related target policy with low-variance
importance-sampling corrections. We also consider target
policies that are deviations from the state distribution
of the behavior policy, such as potential temporally
abstract options, which further reduces variance. This
paper introduces recognizers and their potential
advantages, then develops a full algorithm for MDPs and
proves that its updates are in the same direction as
on-policy TD updates, which implies asymptotic
convergence. Our algorithm achieves this without knowledge
of the behavior policy or even requiring that there exists
a behavior policy.

Sutton, R. S., Rafols, E. J., Koop, A. (2006). Temporal abstraction in temporal-difference networks. Advances in Neural Information Processing Systems 18 (NIPS*05).

ABSTRACT:
Temporal-difference (TD) networks have been proposed as a
way of representing and learning a wide variety of
predictions about the interaction between an agent and its
environment (Sutton & Tanner, 2005). These
predictions are compositional
in that their targets are defined in terms of other
predictions, and subjunctive
in that they are about what would happen if an action or
sequence of actions were taken. In conventional TD
networks, the inter-related predictions are at successive
time steps and contingent on a single action; here we
generalize them to accommodate extended time intervals and
contingency on whole ways of behaving. Our
generalization is based on the options framework for
temporal abstraction (Sutton, Precup & Singh,
1999). The primary contribution of this paper is to
introduce a new algorithm for intra-option learning in TD
networks with function approximation and eligibility
traces. We present empirical examples of our
algorithm's effectiveness and of the greater
representational expressiveness of temporally-abstract TD
networks.

Tanner, B., Sutton, R. S. (2005) TD(lambda) Networks: Temporal-Difference Networks with Eligibility Traces.

ABSTRACT:
Temporal-difference (TD) networks have been introduced as
a formalism for expressing and learning grounded world
knowledge in a predictive form (Sutton & Tanner,
2005). Like conventional TD(0) methods, the learning
algorithm for TD networks uses 1-step backups to train
prediction units about future events. In conventional TD
learning, the TD(lambda) algorithm is often used to do
more general multi-step backups of future predictions. In
our work, we introduce a generalization of the 1-step TD
network specification that is based on the TD(lambda)
learning al- gorithm, creating TD(lambda) networks. We
present experimental results that show TD(lambda) networks
can learn solutions in more complex environments than TD
networks. We also show that in problems that can be solved
by TD networks, TD(lambda) networks generally learn
solutions much faster than their 1-step counterparts.
Finally, we present an analysis of our algorithm
that shows that the computational cost of TD(lambda)
networks is only slightly more than that of TD networks.

Rafols, E. J., Ring, M. B., Sutton, R. S., Tanner, B. 2005) Using Predictive Representations to Improve Generalization in Reinforcement Learning.

ABSTRACT: The predictive
representations hypothesis holds that particularly good
generalization will result from representing the state of
the world in terms of predictions about possible future
experience. This hypothesis has been a central motivation
behind recent research in, for example, PSRs and TD
networks. In this paper we present the first explicit
investigation of this hypothesis. We show in a
reinforcement-learning example (a grid-world navigation
task) that a predictive representation in tabular form can
learn much faster than both the tabular explicit-state
representation and a tabular history-based method.

Tanner, B., Sutton, R. S. (2005) Temporal-Difference Networks with History.

ABSTRACT:
Temporal-difference (TD) networks are a formalism for
expressing and learning grounded world knowledge in a
predictive form [Sutton and Tanner, 2005]. However, not
all partially observable Markov decision processes can be
efficiently learned with TD networks. In this paper, we
extend TD networks by allowing the network-update process
(answer network) to depend on the recent history of
previous actions and observations rather than only on the
most recent action and observation. We show that this
extension enables the solution of a larger class of
problems than can be solved by the original TD networks or
by historybased methods alone. In addition, we apply TD
networks to a problem that, while still simple, is
significantly larger than has previously been considered.
We show that history-extended TD networks can learn much
of the common-sense knowledge of an egocentric gridworld
domain with a single bit of perception.

Stone, P., Sutton, R. S., Kuhlmann, G. (2005) Reinforcement Learning for RoboCup-Soccer Keepaway. Adaptive Behavior. Some simulations of keepaway referenced in the paper and keepaway software.

ABSTRACT: RoboCup simulated
soccer presents many challenges to reinforcement learning
methods, including a large state space, hidden and
uncertain state, multiple independent agents learning
simultaneously, and long and variable delays in the
effects of actions. We describe our application of
episodic SMDP Sarsa(lambda) with linear tile-coding
function approximation and variable lambda to learning
higher-level decisions in a keepaway subtask of RoboCup
soccer. In keepaway, one team, "the keepers," tries to
keep control of the ball for as long as possible despite
the efforts of "the takers." The keepers learn
individually when to hold the ball and when to pass to a
teammate. Our agents learned policies that significantly
outperform a range of benchmark policies. We demonstrate
the generality of our approach by applying it to a number
of task variations including different field sizes and
different numbers of players on each team.

Sutton, R. S., Tanner, B. (2005) Temporal-Difference
Networks. Advances
in Neural Information Processing Systems 17,
pages 1377-1384. Associated
talk from 12/14/05. Small-size version
of same talk.

ABSTRACT: We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that when actions are introduced, and the inter-prediction relationships made contingent on them, the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms.

Littman, M.L., Sutton, R.S., Singh, S. (2002). Predictive Representations of State. In:

Stone, P., Sutton, R. S. (2002). Keepaway soccer: A machine learning testbed. In: Lecture Notes in Computer Science 2377, pp. 207-237, Springer.

RoboCup simulated soccer
presents many challenges to machine learning (ML) methods,
including a large state space, hidden and uncertain state,
multiple agents, and long and variable delays in the effects
of actions. While there have been many successful ML
applications to portions of the robotic soccer task, it
appears to be still beyond the capabilities of modern machine
learning techniques to enable a team of 11 agents to
successfully learn the full robotic soccer task from sensors
to actuators. Because the successful applications to portions
of the task have been embedded in different teams and have
often addressed different sub-tasks, they have been difficult
to compare. We put forth keepaway soccer as a domain suitable
for directly comparing different machine learning approaches
to robotic soccer. It is complex enough that it can’t be
solved trivially, yet simple enough that complete machine
learning approaches are feasible. In keepaway, one team, “the
keepers,” tries to keep control of the ball for as long as
possible despite the efforts of “the takers.” The keepers
learn individually when to hold the ball and when to pass to a
teammate, while the takers learn when to charge the
ball-holder and when to cover possible passing lanes. We fully
specify the domain and summarize some initial, successful
learning results.

Stone, P., Sutton, R.S. (2001). Scaling reinforcement learning toward RoboCup soccer.

Precup, D., Sutton, R.S., Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. digitally remastered version. scanned version.

Stone, P., Sutton, R.S.,
Singh, S. (2001).
Reinforcement Learning for 3 vs. 2 Keepaway. In: *RoboCup-2000:
Robot
Soccer World Cup IV,* P. Stone, T. Balch, and G.
Kraetszchmar, Eds., Springer Verlag. An earlier version appeared in
the *Proceedings of the RoboCup-2000 Workshop,*
Melbourne, Australia.

Sutton, R.S. (2001). Method and apparatus for machine learning. U.S. Patent 6,249,781.

ABSTRACT: A method and apparatus is disclosed for machine learning of a pattern sequence which is derived from a plurality of inputs. The pattern sequence is predicted from learning rate parameters that are exponentially related to an incrementally calculated gain parameter for each input. The gain parameter are increased or decreased in real time in correlation with the accuracy of the learning process. The disclosed method and apparatus are advantageously utilized in signal processing, adaptive control systems, and pattern recognition.

Sutton, R.S., McAllester, D., Singh, S., Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation.

Precup, D., Sutton, R.S., Singh, S. (2000). Eligibility Traces for Off-Policy Policy Evaluation.

Sutton, R.S. (1999). Open
theoretical questions in reinforcement learning. In
*Proceedings of the Fourth European Conference on
Computational Learning Theory* (Proceedings
EuroCOLT'99), pp. 11-17, Fischer, P., Simon, H.U., Eds.
Springer-Verlag.

Sutton, R.S. (1999). Reinforcement
Learning. In Rob Wilson and Frank Keil (Eds.) *The MIT
Encyclopedia of the Cognitive Sciences,* MIT
Press.

Sutton, R.S., Precup, D., Singh, S.
(1999). Between MDPs and
semi-MDPs: A Framework for Temporal Abstraction in
Reinforcement Learning. *Artificial Intelligence
112*:181-211. An earlier version appeared as Technical Report 98-74,
Department of Computer Science, University of
Massachusetts, Amherst, MA 01003. April, 1998. Associated talk from
3/5/98. Humorous excerpts
from the JAIR reviews recommending rejection of this
paper.

Sutton, R.S., Singh, S., Precup, D., Ravindran, B. (1999). Improved switching among temporally abstract actions. Djvu version.

Moll, R., Barto, A.G., Perkins, T. J., Sutton, R.S.
(1999). Learning
instance-independent value functions to enhance local
search. *Advances in Neural Information
Processing Systems 11* (Proceedings of the 1998
conference), pp. 1017-1023. MIT Press.

Sutton, R. S., Reinforcement
learning: Past, present, and future. Extended
abstract in Simulated
Evolution and Learning, McKay, B., Yao, X.,
Newton, C. S., Kim, J.-H., Furuhashi, T., Eds. Published
as Lecture Notes in
Computer Science 1585, pp. 195–197, Springer,
1999.

Sutton, R.S., Barto, A.G. (1998). *Reinforcement Learning: An
Introduction*. MIT Press.

Sutton, R.S., Precup, D., Singh,
S. (1998). Intra-option
learning about temporally abstract actions. *Proceedings
of the 15th International Conference on Machine
Learning,* pp. 556-564. Morgan Kaufmann.

Precup, D., Sutton, R.S. (1998) Multi-time models for temporally abstract planning. Djvu version.

McGovern, A., and Sutton, R.S. (1998). Macro-actions in reinforcement learning: An empirical analysis. Technical Report 98-70, University of Massachusetts, Department of Computer Science.

Precup, D., Sutton, R.S., Singh,
S. (1998). Theoretical
results on reinforcement learning with temporally
abstract options. In *Machine Learning: ECML-98,
Proceedings of the 10th European Conference on Machine
Learning, Chemnitz, Germany,* pp. 382-393. Springer
Verlag.

Santamaria, J.C., Sutton, R.S., Ram, A. (1998). Experiments with reinforcement learning in problems with continuous state and action spaces,

McGovern, A., Precup, D.,
Ravindran, B., Singh, S., Sutton, R.S. (1998). Hierarchical Optimal Control
of MDPs. *Proceedings of the Tenth Yale Workshop
on Adaptive and Learning Systems,* pp. 186-191.

Barto, A. G., Sutton, R. S., Reinforcement learning in artificial intelligence. In Neural Network Models of Cognition, Donahoe, J. W., Packard Dorsel, V., Eds., pp. 358–386, Elsevier, 1997.

This chapter provides an overview of an approach to the study of learning that, in broad terms, has developed as a part of the field of Artifificial Intelligence (AI), where it is calledreinforcement learningdue to its roots in reinforcement theories of animal learning. We introduce the field from the perspective of AI and engineering, describing some of its key features, providing a formal model of the reinforcement-learning problem, and defining basic concepts that are exploited by solution methods. Detailed discussion of solution methods themselves and their history are very broad topics that we do not attempt to cover here.

Sutton, R.S. (1997). On the significance of Markov decision processes. In W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicoud (Eds.)

Precup, D., Sutton, R.S. (1997). Multi-time models for reinforcement learning.

Precup, D., Sutton, R.S. (1997). Exponentiated gradient methods for reinforcement learning.

McGovern, A., Sutton, R.S., Fagg, A.H. (1997). Roles of macro-actions in accelerating reinforcement learning.

Precup, D., Sutton, R.S., Singh, S.P. (1997). Planning with closed-loop macro actions.

Mehra, R.K., Ravichandran, B., Cabrera, J.B.D., Greve, D.N., Sutton, R.S. (1997). Towards self-learning adaptive scheduling for ATM networks, Proceedings of the 36th Conference on Decision and Control, pp. 2393-2398, San Diego, California USA.

Sutton, R.S. (1996).
Generalization in reinforcement learning: Successful
examples using sparse coarse coding. Camera ready postscript.
Digitally remastered pdf.
Djvu
version. *Advances in Neural Information
Processing Systems 8* (Proceedings of the 1995
conference), pp. 1038-1044, MIT Press. Errata: there is
small error in the equations of motion of the acrobot; for
the correct equations, see the
RL textbook page on the acrobot (thanks to Barry
Nichols for spotting this).

Singh, S.P., Sutton, R.S.
(1996). Reinforcement
learning with replacing eligibility traces. *Machine
Learning 22*: 123-158.

Precup, D., Sutton, R.S. (1996). Empirical comparison of gradient descent and exponentiated gradient descent in supervised and reinforcement learning. Technical Report UM-CS-1996-070, Department of Computer Science, University of Massachusetts, Amherst, MA 01003.

Kuvayev, L., Sutton, R.S. (1996).
Model-based
reinforcement learning with an approximate, learned
model. *Proceedings of the Ninth Yale Workshop
on Adaptive and Learning Systems,* pp. 101-105, Yale
University, New Haven, CT. Some additional results are in
this earlier version
of the same paper.

Mehra, R.K., Ravichandran, B., Sutton, R.S. (1996). Adaptive intelligent scheduling
for ATM networks. *Proceedings of the Ninth Yale
Workshop on Adaptive and Learning Systems,* pp.
106-111, Yale University, New Haven, CT.

Sutton, R.S. (1995). On the Virtues of Linear Learning and Trajectory Distributions. Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference.

Sutton, R.S. (1995). TD models: Modeling the
world at a mixture of time scales. *Proceedings
of the Twelfth International Conference on Machine
Learning,* pp. 531-539, Morgan Kaufmann.

Sutton, R.S., Singh,
S.P. (1994). On bias
and step size in temporal-difference learning. *Proceedings
of the Eighth Yale Workshop on Adaptive and Learning
Systems,* pp. 91-96, Yale University, New Haven, CT.

Sutton, R.S., Whitehead, S.D.
(1993). Online
learning with random representations. *Proceedings
of the Tenth International Conference on Machine
Learning,* pp. 314-321, Morgan Kaufmann.

Sutton, R.S. (1992). Adapting bias by gradient
descent: An incremental version of delta-bar-delta.
*Proceedings of the Tenth National Conference on
Artificial Intelligence,* pp. 171-176, MIT Press. Associated talk from 2/2/04.

Sutton, R.S. (1992). Gain adaptation
beats least squares? *Proceedings of the Seventh
Yale Workshop on Adaptive and Learning Systems,* pp.
161-166, Yale University, New Haven, CT.

Sutton, R.S. (1992). Machines that
Learn and Mimic the Brain. In * ACCESS, GTE's
Journal of Science and Technology,* 1992. Reprinted
in * Stethoscope Quarterly,* Spring.

Sutton, R.S. (1992). Reinforcement
learning architectures. *Proceedings ISKIT'92
International Symposium on Neural Information Processing*,
Fukuoka, Japan.

Sutton, R.S. (Ed.) (1992).
*Reinforcement Learning,* book version of a special
issue of *Machine Learning* on reinforcement
learning (Volume 8, Numbers 3/4). Kluwer. Introduction: The
challenge of reinforcement learning.

Sanger, T.D., Sutton, R.S., Matheus,
C.J. (1992). Iterative
construction of sparse polynomial approximations. *Advances
in Neural Information Processing Systems 4,* pp.
1064-1071, Morgan Kaufmann.

Gluck, M., Glauthier, P., Sutton, R.S.
(1992). Adaptation of
cue-specific learning rates in network models of human
category learning, * Proceedings of the
Fouteenth Annual Conference of the Cognitive Science
Society,* pp. 540-545, Erlbaum.

Sutton, R.S. (1991).
Planning by incremental dynamic programming. *Proceedings
of the Eighth International Workshop on Machine
Learning,* pp. 353-357, Morgan Kaufmann.

Sutton, R.S. (1991). Dyna, an integrated
architecture for learning, planning and reacting. *Working
Notes of the 1991 AAAI Spring Symposium on Integrated
Intelligent Architectures* and *SIGART Bulletin
2,* pp. 160-163.

Sutton, R.S. (1991). Integrated modeling and
control based on reinforcement learning and dynamic
programming. In D. S. Touretzky (ed.), *
Advances in Neural Information Processing Systems 3,*
pages 471-478.

Sutton, R.S. (1991). Reinforcement learning
architectures for animats, * Proceedings of the
First International Conference on Simulation of Adaptive
Behavior: From Animals to Animats,* pp.
288-296. smaller
gzipped version.

Sutton, R.S., Matheus, C.J. (1991). Learning polynomial functions by feature construction.

Sutton,
R.S., Barto, A.G., Williams, R. (1991). Reinforcement
learning is direct adaptive optimal control, *
Proceedings of the American Control Conference,*
pages 2143-2146. Also published in * IEEE Control
Systems Magazine 12,* No. 2, 19-22.

Miller, W. T., Sutton, R. S., Werbos, P. J. (Eds.), Neural Networks for Control. MIT Press, 1991.

Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming.

Sutton, R.S. (1990). First results
with Dyna, an integrated architecture for learning,
planning, and reacting. In * Neural Networks for
Control,* Miller, T., Sutton, R.S., & Werbos,
P., Eds., MIT Press.

Sutton,
R.S.,
Barto, A.G. (1990).
Time-derivative models of pavlovian reinforcement.
In *Learning and Computational Neuroscience:
Foundations of Adaptive Networks,* M. Gabriel and J.
Moore, Eds., pp. 497-537. MIT Press.

Barto, A.G., Sutton, R.S., Watkins, C.J.C.H. (1990). Learning and sequential decision making. In

Barto, A.G., Sutton, R.S., &
Watkins, C. (1990). Sequential
decision problems and neural networks. In D. S.
Touretzky (ed.), * Advances in Neural Information
Processing Systems 2,* pp. 686-693.

Whitehead, S., Sutton, R.S., & Ballard, D. (1990). Advances in reinforcement
learning and their implications for intelligent control,
* Proceedings of the Fifth IEEE International Symposium
on Intelligent Control 1990,* pp. 1289-1297.

Franklin, J., Sutton, R.S., Anderson, C., Selfridge, O.,
& Schwartz, D. (1990). Connectionist
learning control at GTE laboratories, *
Intelligent Control and Adaptive Systems,* G.
Rodriguez, Ed., Proc. SPIE 1196, pp. 242-253.

Anderson, C., Franklin, J., & Sutton, R.S. (1990). Learning a nonlinear model of a
manufacturing process using multilayer connectionist
networks, *Proceedings of the Fifth IEEE
International Symposium on Intelligent Control 1990,*
pp. 404-409.

Sutton, R.S. (1989). Artificial
intelligence as a control problem: Comments on the
relationship between machine learning and intelligent
control. Appeared in Machine Learning in a
dynamic world.* Proceedings of the IEEE
International Symposium on Intelligent Control 1988,*
pp. 500-507.

Sutton, R.S. (1989). Implementation details of
the TD(lambda) procedure for the case of vector
predictions and backpropagation. GTE Laboratories
Technical Report TR87-509.1, as corrected August 1989. GTE
Laboratories, 40 Sylvan Road, Waltham, MA 02254.

Franklin, J., Sutton, R.S., & Anderson, C. (1989). Application of connectionist
learning methods to manufacturing process monitoring,
* Proceedings of the IEEE International Symposium on
Intelligent Control 1988,* pp. 709-712.

Sutton, R.S. (1988). Learning to
predict by the methods of temporal differences. *Machine
Learning 3*: 9-44, erratum p. 377. Scan of paper
as published, with erratum added. Digitally remastered with
missing figure in place.

Sutton, R.S. (1988). Convergence theory for
a new kind of prediction learning, * Proceedings
of the 1988 Workshop on Computational Learning Theory,*
pp. 421-42.

Sutton, R.S. (1988). NADALINE: A
normalized adaptive linear element that learns
efficiently. GTE Laboratories Technical Report
TR88-509.4. GTE Laboratories, 40 Sylvan Road, Waltham, MA
02254.

Selfridge, O., Sutton, R.S., & Anderson, C. (1988). Selected bibliography on
connectionism, In: * Evolution, Learning, and
Cognition,* Y.C. Lee (Ed.), pp. 391-403, World
Scientific Publishing.

Sutton, R.S., & Barto,
A.G. (1987). A
temporal-difference model of classical conditioning,
* Proceedings of the Ninth Annual Conference of the
Cognitive Science Society,* pp. 355-378.

Sutton, R.S.
(1986). Two problems with
backpropagation and other steepest-descent learning
procedures for networks, * Proceedings of the
Eighth Annual Conference of the Cognitive Science
Society,* pp. 823-831. Reprinted in * Artificial
Neural Networks: Concepts and Theory,* edited by P.
Mehra and B. Wah, IEEE Computer Society Press, 1992.
smaller gzipped version.

Moore, J., Desmond, J, Berthier, N., Blazis, D., Sutton, R.S., & Barto, A.G. (1986). Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: Response topography, neuronal firing, and interstimulus intervals,

Sutton, R.S. (1985). Learning
distributed, searchable, internal models, *
Proceedings of the Distributed Artificial Intelligence
Workshop,* pp. 287-289.

Sutton, R.S., & Pinette, B.
(1985). The
learning of world models by connectionist networks,
* Proceedings of the Seventh Annual Conference of the
Cognitive Science Society,* pp. 54-64.

Barto, A.G. & Sutton, R.S. (1985). Neural problem solving.
In: * Synaptic Modification, Neuron Selectivity, and
Nervous System Organization,* W.B. Levy & J.A.
Anderson (Eds.), pp. 123-152. Lawrence Erlbaum.

Selfridge, O., Sutton,
R.S., & Barto, A.G. (1985). Training and tracking
in robotics, * Proceedings of the Ninth
International Joint Conference on Artificial
Intelligence,* pp. 670-672.

Moore, J., Desmond, J., Berthier, N., Blazis, D., Sutton,
R.S., & Barto, A.G. (1985). Connectionist
learning in real time: Sutton-Barto adaptive element and
classical conditioning of the nictitating membrane
response, * Proceedings of the Seventh Annual
Conference of the Cognitive Science Society,* pp.
318-322.

Sutton, R.S. (1984). Temporal credit
assignment in reinforcement learning (106
Mbytes). Ph.D. dissertation, Department of Computer
Science, University of Massachusetts, Amherst, MA 01003.
Published as COINS Technical Report 84-2.

Barto, A.G., Sutton, R.S.,
& Anderson, C. (1983). Neuron-like adaptive elements
that can solve difficult learning control problems, *
IEEE Transactions on Systems, Man, and Cybernetics,
SMC-13*: 834-846. 12.4 Mb pdf,
3.5 Mb gzipped

Sutton, R.S. (1982 -
unpublished draft). A
theory of salience change dependent on the relationship
between discrepancies on successive trials on which the
stimulus is present. smaller gzipped version.

Barto, A.G. & Sutton, R.S. (1982). Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element,

Barto, A.G., Anderson,
C., & Sutton, R.S. (1982). Synthesis of nonlinear control
surfaces by a layered associative network, *
Biological Cybernetics 43*:175-185.

Barto, A.G., Sutton, R.S., & Anderson, C. (1982). Spatial learning simulation
systems, * Proceedings of the 10th IMACS World
Congress on Systems Simulation and Scientific
Computation,* pp. 204-206.

Sutton, R.S. (1981). Adaptation of learning rate
parameters. In: Goal Seeking Components for Adaptive
Intelligence: An Initial Assessment, by A. G.
Barto and R. S. Sutton. Air Force Wright Aeronautical
Laboratories Technical Report AFWAL-TR-81-1070.
Wright-Patterson Air Force Base, Ohio 45433.

Sutton, R.S., &
Barto, A.G. (1981). Toward a
modern theory of adaptive networks: Expectation and
prediction, * Psychological Review 88*:135-140.
Translated into Spanish by G. Ruiz to appear in the
journal * Estudios de Psicologia.*

Sutton, R.S.,
& Barto, A.G. (1981). An adaptive
network that constructs and uses an internal model of
its world, * Cognition and Brain Theory 4*:217-246.

Barto, A.G., Sutton, R.S.
(1981). Goal seeking
components for adaptive intelligence: An initial
assessment. Air Force Wright Aeronautical
Laboratories Technical Report AFWAL-TR-81-1070.
Wright-Patterson Air Force Base, Ohio 45433. 524
pages. (Appendix C is
available separately.)

Barto, A.G., Sutton,
R.S., & Brouwer, P. (1981). Associative search network: A
reinforcement learning associative memory, *
Biological Cybernetics 40*:201-211.

Barto, A.G. & Sutton,
R.S. (1981). Landmark
learning: An illustration of associative search, *
Biological Cybernetics 42*:1-8.

Sutton, R.S. (1978). Single channel theory: A
neuronal theory of learning, * Brain Theory
Newsletter 3,* No. 3/4, pp. 72-75. (earliest
publication)

Sutton, R.S.
(1978 - unpublished). A unified theory of
expectation in classical and instrumental conditioning.
Bachelors thesis, Stanford University.

Sutton, R.S. (1978 - unpublished). Learning theory
support for a single channel theory of the brain.