Rich Sutton's Publications

Last updated June 2021.

First, a quick guide to some of the highlights, roughly in order of the work's popularity or potential current interest:
Also, some RL pubs that aren't mine, available for researchers:
Phd theses
MSc theses

For any broken links, please send email to rich@richsutton.com.


Dohare, S., Hernandez-Garcia, J.F., Rahman, P.,
Sutton, R. S., Mahmood, A.R. (2023). Maintaining Plasticity in Deep Continual Learning. arXiv:2210.14361.
ABSTRACT: Modern deep-learning systems are specialized to problem settings in which training occurs once and then never again, as opposed to continual-learning settings in which training occurs continually. If deep-learning systems are applied in a continual learning setting, then it is well known that they may fail to remember earlier examples. More fundamental, but less well known, is that they may also lose their ability to learn on new examples, a phenomenon called loss of plasticity. We provide direct demonstrations of loss of plasticity using the MNIST and ImageNet datasets repurposed for continual learning as sequences of tasks. In ImageNet, binary classification performance dropped from 89% accuracy on an early task down to 77%, about the level of a linear network, on the 2000th task. Loss of plasticity occurred with a wide range of deep network architectures, optimizers, activation functions, batch normalization, dropout, but was substantially eased by L2-regularization, particularly when combined with weight perturbation. Further, we introduce a new algorithm -- continual backpropagation -- which slightly modifies conventional backpropagation to reinitialize a small fraction of less-used units after each example and appears to maintain plasticity indefinitely.

Rafiee, B., Ghiassian, S., Jin, J., Sutton, R., Luo, J., & White, A. (2023). Auxiliary task discovery through generate-and-test. In Proceedings of the Second Conference on Lifelong Learning Agents. Also
arXiv:2210.14361.
ABSTRACT: In this paper, we explore an approach to auxiliary task discovery in reinforcement learning based on ideas from representation learning. Auxiliary tasks tend to improve data efficiency by forcing the agent to learn auxiliary prediction and control objectives in addition to the main task of maximizing reward, and thus producing better representations. Typically these tasks are designed by people. Meta-learning offers a promising avenue for automatic task discovery; however, these methods are computationally expensive and challenging to tune in practice. In this paper, we explore a complementary approach to the auxiliary task discovery: continually generating new auxiliary tasks and preserving only those with high utility. We also introduce a new measure of auxiliary tasks usefulness based on how useful the features induced by them are for the main task. Our discovery algorithm significantly outperforms random tasks, hand-designed tasks, and learning without auxiliary tasks across a suite of environments.

De Asis, K., Graves, E., & Sutton, R. S. (2023). Value-aware Importance Weighting for Off-policy Reinforcement Learning. In Proceedings of the Second Conference on Lifelong Learning Agents. Appeared August 22, 2023. Also
arXiv:2306.15625.
ABSTRACT: Importance sampling is a central idea underlying off-policy prediction in reinforcement learning. It provides a strategy for re-weighting samples from a distribution to obtain unbiased estimates under another distribution. However, importance sampling weights tend to exhibit extreme variance, often leading to stability issues in practice. In this work, we consider a broader class of importance weights to correct samples in off-policy learning. We propose the use of value-aware importance weights which take into account the sample space to provide lower variance, but still unbiased, estimates under a target distribution. We derive how such weights can be computed, and detail key properties of the resulting importance weights. We then extend several reinforcement learning prediction algorithms to the off-policy setting with these weights, and evaluate them empirically.

Sharifnassab, A., & Sutton, R. S. (2023). Toward Efficient Gradient-Based Value Estimation. In International Conference on Machine Learning (pp. 30827-30849). Also Proceedings of Machine Learning Research 202:30827-30849.
ABSTRACT: Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.

Mathewson, K. W., Parker, A. S., Sherstan, C., Edwards, A. L., Sutton, R. S., & Pilarski, P. M. (2023). Communicative capital: A key resource for human-machine shared agency and collaborative capacity. Neural Computing and Applications 35(23):16805-16819.
ABSTRACT: In this work, we present a perspective on the role machine intelligence can play in supporting human abilities. In particular, we consider research in rehabilitation technologies such as prosthetic devices, as this domain requires tight coupling between human and machine. Taking an agent-based view of such devices, we propose that human–machine collaborations have a capacity to perform tasks which is a result of the combined agency of the human and the machine. We introduce communicative capital as a resource developed by a human and a machine working together in ongoing interactions. Development of this resource enables the partnership to eventually perform tasks at a capacity greater than either individual could achieve alone. We then examine the benefits and challenges of increasing the agency of prostheses by surveying literature which demonstrates that building communicative resources enables more complex, task-directed interactions. The viewpoint developed in this article extends current thinking on how best to support the functional use of increasingly complex prostheses, and establishes insight toward creating more fruitful interactions between humans and supportive, assistive, and augmentative technologies.

Javed, K., Shah, H., Sutton, R. S., & White, M. (2023). Scalable Real-Time Recurrent Learning Using Columnar-Constructive Networks. Journal of Machine Learning Research 24(256):1-34.
ABSTRACT: Constructing states from sequences of observations is an important component of reinforcement learning agents. One solution for state construction is to use recurrent neural networks. Back-propagation through time (BPTT), and real-time recurrent learning (RTRL) are two popular gradient-based methods for recurrent learning. BPTT requires complete trajectories of observations before it can compute the gradients and is unsuitable for online updates. RTRL can do online updates but scales poorly to large networks. In this paper, we propose two constraints that make RTRL scalable. We show that by either decomposing the network into independent modules or learning the network in stages, we can make RTRL scale linearly with the number of parameters. Unlike prior scalable gradient estimation algorithms, such as UORO and Truncated-BPTT, our algorithms do not add noise or bias to the gradient estimate. Instead, they trade off the functional capacity of the network for computationally efficient learning. We demonstrate the effectiveness of our approach over Truncated-BPTT on a prediction benchmark inspired by animal learning and by doing policy evaluation of pre-trained policies for Atari 2600 games.

Sutton, R. S., Machado, M. C., Holland, G. Z., Timbers, D. S. F., Tanner, B., & White, A. (2023). Reward-respecting subtasks for model-based reinforcement learning. Artificial Intelligence 324. ArXiv:2202.03466.
ABSTRACT: To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option- critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions.

Wan, Y., Sutton, R. S. (2022). Toward discovering options that achieve faster planning. ArXiv:2205.12515.
ABSTRACT: We propose a new objective for option discovery that emphasizes the computational advantage of using options in planning. For a given set of episodic tasks and a given number of options, the objective prefers options that can be used to achieve a high return by composing few options. By composing few options, fast planning can be achieved. When faced with new tasks similar to the given ones, the discovered options are also expected to accelerate planning. Our objective extends the objective proposed by Harb et al. (2018) for the single-task setting to the multi-task setting. A closer look at Harb et al.'s objective shows that the best options discovered given one task are not likely to be useful for future unseen tasks and that the multi-task setting is indeed necessary for this purpose. In the same paper, Harb et al. also proposed an algorithm to optimize their objective, and the algorithm can be naturally extended to the multi-task setting. We empirically show that in the four-room domain the extension does not achieve a high objective value and propose a new algorithm that better optimizes the proposed objective. In the same four-room domain, we show that 1) a higher objective value is typically associated with options with which fewer planning iterations are needed to achieve near-optimal performance, 2) our new algorithm achieves a high objective value, which is close to the value achieved by a set of human-designed options, 3) the best number of planning iterations given the discovered options is much smaller and matches it obtained given human-designed options, and 4) the options produced by our algorithm also make intuitive sense because they move to and terminate at cells near hallways connecting two neighbor rooms.

Tian, T., Young, K., & Sutton, R. S. (2022). Doubly-asynchronous value iteration: Making value iteration asynchronous in actions. In Advances in Neural Information Processing Systems 35. Also  ArXiv:2207.01613.
ABSTRACT: Value iteration (VI) is a foundational dynamic programming method, important for learning and planning in optimal control and reinforcement learning. VI proceeds in batches, where the update to the value of each state must be completed before the next batch of updates can begin. Completing a single batch is prohibitively expensive if the state space is large, rendering VI impractical for many applications. Asynchronous VI helps to address the large state space problem by updating one state at a time, in-place and in an arbitrary order. However, Asynchronous VI still requires a maximization over the entire action space, making it impractical for domains with large action space. To address this issue, we propose doubly-asynchronous value iteration (DAVI), a new algorithm that generalizes the idea of asynchrony from states to states and actions. More concretely, DAVI maximizes over a sampled subset of actions that can be of any user-defined size. This simple approach of using sampling to reduce computation maintains similarly appealing theoretical properties to VI without the need to wait for a full sweep through the entire action space in each update. In this paper, we show DAVI converges to the optimal value function with probability one, converges at a near-geometric rate with probability 1-delta, and returns a near-optimal policy in computation time that nearly matches a previously established bound for VI. We also empirically demonstrate DAVI's effectiveness in several experiments.

Sutton, R. S. (2022). A history of meta-gradient: Gradient Methods for meta-learning. ArXiv:2202.09701.
ABSTRACT: Systems for learning parameters often have meta-parameters such as step sizes, initial weights, or dimensional weightings. With a given setting of the meta-parameters, the learning system is complete and capable of finding parameters that are suited to the task, but its efficiency typically depends on the particular choice of meta-parameters. This has led to interest in learning processes that can find good choices for meta-parameters automatically from experience. These higher-level learning methods are often characterized as "learning to learn" or, as we shall call them here, meta-learning. In this article we briefly review the history of meta-learning.

Sutton, R. S., Bowling, M., Pilarski, P. M. (2022). The Alberta plan for AI research. ArXiv:2208.11173.
ABSTRACT: Herein we describe our approach to artificial intelligence (AI) research, which we call the Alberta Plan. The Alberta Plan is pursued within our research groups in Alberta and by others who are like minded throughout the world. We welcome all who would join us in this pursuit.

Sutton, R. S. (2022). The quest for a common model of the intelligent agent. In 5th Multi-disciplinary Conference on Reinforcement Learning and Decision Making. ArXiv:2202.13252
ABSTRACT: The premise of the Multi-disciplinary Conference on Reinforcement Learning and Decision Making is that multiple disciplines share an interest in goal-directed decision making over time. The idea of this paper is to sharpen and deepen this premise by proposing a perspective on the decision maker that is substantive and widely held across psychology, artificial intelligence, economics, control theory, and neuroscience, which I call the common model of the intelligent agent. The common model does not include anything specific to any organism, world, or application domain. The common model does in- clude aspects of the decision maker’s interaction with its world (there must be input and output, and a goal) and internal components of the decision maker (for perception, decision-making, internal evaluation, and a world model). I identify these aspects and components, note that they are given different names in different disciplines but refer essentially to the same ideas, and discuss the challenges and benefits of devising a neutral terminology that can be used across disciplines. It is time to recognize and build on the convergence of multiple diverse disciplines on a substantive common model of the intelligent agent.

Javed, K., Shah, H., Sutton, R. S., “Scalable Online State Construction using Recurrent Networks.” In Multi-disciplinary Conference on Reinforcement Learning and Decision Making, 2022.
ABSTRACT:State construction from sensory observations is an important component of a reinforcement learning agent. One solution for state construction is to use recurrent neural networks. Two popular gradient-based methods for recurrent learning are back-propagation through time (BPTT), and real-time recurrent learning (RTRL). BPTT looks at the complete sequence of observations before computing gradients and is unsuitable for online updates. RTRL can do online updates but scales poorly to large networks. In this paper, we propose two constraints that make RTRL scalable. We show that by either decomposing the network into independent modules or learning a recurrent network incrementally, we can make RTRL scale linearly with the number of parameters. Unlike prior scalable gradient estimation algorithms, such as UORO and Truncated-BPTT, our algorithms do not add bias or noise to the gradient estimate. Instead, they trade off the functional capacity of the recurrent network to achieve scalable learning. We demonstrate the effectiveness of our approach on a prediction learning benchmark inspired by animal learning.
Wan, Y., Sutton, R. S., “Discovering Options by Minimizing the Number of Composed Options to Solve Multiple Tasks.” In Multi-disciplinary Conference on Reinforcement Learning and Decision Making, 2022.
ABSTRACT: We propose a new metric for discovering options in the tabular setting: a set options is better than the other set of same number of options if, for a given set of episodic tasks, on average, optimal solutions to the tasks can be obtained by composing fewer options. By composing fewer options to obtain solutions, planning requires less computation and can be faster. We propose a new objective function and prove that the set of options that minimizes the proposed objective results in the least number of composed options on average in the tabular setting. In the function approximation setting, where generally the optimal solutions are not achievable, this objective trades off higher returns against fewer composed options. We further propose a tabular algorithm that optimizes the proposed objective, based on the option-critic algorithm (Bacon et al., 2017). In a four-room domain with 16 tasks, we demonstrate that the options produced by our algorithm are consistent with human intuition because they seem to lead to cells near hallways connecting two rooms.

De Asis, K. Chan, A., Sutton, R. S., “Inverse Policy Evaluation for Value-based Decision Making.” In Multi-disciplinary Conference on Reinforcement Learning and Decision Making, 2022.
ABSTRACT: Value-based methods for control often involve approximate value iteration (e.g., Q-learning), and behaving greedily with respect to the resulting value estimates with some degree of entropy to ensure the state-space is sufficiently explored. Such a greedy policy is an improvement over the previous policy, under which the current values were estimated. However, value-iteration may produce value functions that do not correspond with any policy. This is especially relevant with function approximation, when the true value function cannot be perfectly represented. In this work, we explore the use of Inverse Policy Evaluation, the process of solving for a likely policy given a value function. We derive a simple, incremental algorithm for the procedure, analyze how inaccuracies in the value function manifest in the corresponding policy, and provide empirical results emphasizing key properties of the mapping from value functions to policies.

Dohare, S., Mahmood, R., Sutton, R. S., “Continual Backprop: Stochastic Gradient Descent with Persistent Randomness.” In Multi-disciplinary Conference on Reinforcement Learning and Decision Making, 2022.
ABSTRACT:The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former. We show that in continual learning setups, Backprop performs well initially, but over time its performance degrades. Stochastic gradient descent alone is insufficient to learn continually; the initial randomness enables only initial learning but not continual learning. To the best of our knowledge, ours is the first result showing this degradation in Backprop’s ability to learn. To address this degradation in Backprop’s plasticity, we propose an algorithm that continually injects random features alongside gradient descent using a new generate-and-test process. We call this the Continual Backprop algorithm. We show that, unlike Backprop, Continual Backprop is able to continually adapt in both supervised and reinforcement learning (RL) problems. Continual Backprop has the same computational complexity as Backprop and can be seen as a natural extension of Backprop for continual learning.

Sutton, R. S., Machado, M. C., Holland, G. Z., Szepesvari, D., Timbers, F., Tanner, B., White, A. “Reward-Respecting Subtasks for Model-Based Reinforcement Learning.” In Multi-disciplinary Conference on Reinforcement Learning and Decision Making, 2022.
ABSTRACT: To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress in state abstraction, but, although the theory of time abstraction has been extensively developed based on the options framework, in practice options have rarely been used in planning. One reason for this is that the space of possible options is immense and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks such as reaching a bottleneck state, or maximizing a sensory signal other than the reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. The subtasks proposed in most previous work ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option stops. We show that options and option models obtained from such reward-respecting subtasks are much more likely to be useful in planning and can be learned online and off-policy using existing learning algorithms. Reward-respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how the algorithms for learning values, policies, options, and models can be unified using general value functions.

Naik, A., Sutton, R. S. (2022). Multi-Step Average-Reward Prediction via Differential TD(λ). In Multi-disciplinary Conference on Reinforcement Learning and Decision Making.
ABSTRACT:We present Differential TD(λ), an average-reward algorithm that extends Wan et al.’s (2021) Differential TD(0) from the one-step tabular case to that of multi-step linear function approximation using eligibility traces. We characterize the algorithm’s fixed point and present a convergence theorem. Our analysis builds on prior work by Tsitsiklis and Van Roy (1999), who proposed a trace-based prediction algorithm that is restricted to the on-policy setting. Preliminary experiments show that Differential TD(λ) converges to the fixed point predicted by the theory and that an intermediate value of the bootstrapping parameter λ results in the most sample efficiency.

Rafiee, B., Abbas, Z., Ghiassian, S., Kumaraswamy, R., Sutton, R., Ludvig, E., White, A. (2022). From Eye-blinks to State Construction: Diagnostic Benchmarks for Online Representation Learning. Adaptive Behavior. Also arXiv:2011.04590.
ABSTRACT: Experiments in classical conditioning show that animals such as rabbits, pigeons, and dogs can make long temporal associations that enable multi-step prediction. To replicate this remarkable ability, an agent must construct an internal state representation that summarizes its interaction history. Recurrent neural networks can automatically construct state and learn temporal associations. But the current training methods are prohibitively expensive for online prediction -- continual learning on every time step -- which is the focus of this paper. To facilitate research in online prediction, we present three new diagnostic prediction problems inspired by classical-conditioning experiments. The proposed problems test the learning capabilities that animals readily exhibit and highlight the current recurrent learning methods' limitations. While the proposed problems are nontrivial, they are still amenable to extensive testing and analysis in the small-compute regime, thereby enabling researchers to study issues in isolation carefully, ultimately accelerating progress towards scalable online representation learning methods.

Wan, Y., Naik, A., Sutton, R. (2021). Average-Reward Learning and Planning with Options. Advances in Neural Information Processing Systems, 34, 22758-22769.
ABSTRACT: We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs. Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as sample based planning variants of our learning algorithms. Our algorithms and convergence proofs extend those recently developed by Wan, Naik, and Sutton. We also extend the notion of option-interrupting behavior from the discounted to the average-reward formulation. We show the efficacy of the proposed algorithms with experiments on a continuing version of the Four-Room domain.

Samani, A., Sutton, R. S. (2021). Learning agent state online with recurrent generate-and-test. ArXiv:2112.15236.
ABSTRACT: Learning continually and online from a continuous stream of data is challenging, especially for a reinforcement learning agent with sequential data. When the environment only provides observations giving partial information about the state of the environment, the agent must learn the agent state based on the data stream of experience. We refer to the state learned directly from the data stream of experience as the agent state. Recurrent neural networks can learn the agent state, but the training methods are computationally expensive and sensitive to the hyper-parameters, making them unideal for online learning. This work introduces methods based on the generate-and-test approach to learn the agent state. A generate-and-test algorithm searches for state features by generating features and testing their usefulness. In this process, features useful for the agent's performance on the task are preserved, and the least useful features get replaced with newly generated features. We study the effectiveness of our methods on two online multi-step prediction problems. The first problem, trace conditioning, focuses on the agent's ability to remember a cue for a prediction multiple steps into the future. In the second problem, trace patterning, the agent needs to learn patterns in the observation signals and remember them for future predictions. We show that our proposed methods can effectively learn the agent state online and produce accurate predictions.

Ashley, D. R., Ghiassian, S., Sutton, R. S. (2021) Does the Adam Optimizer Exacerbate Catastrophic Forgetting? ArXiv:2102.07686.
ABSTRACT: Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon. Despite the extensive amount of work on catastrophic forgetting, we argue that it is still unclear how exactly the phenomenon should be quantified, and, moreover, to what degree all of the choices we make when designing learning systems affect the amount of catastrophic forgetting. We use various testbeds from the reinforcement learning and supervised learning literature to (1) provide evidence that the choice of which modern gradient-based optimization algorithm is used to train an ANN has a significant impact on the amount of catastrophic forgetting and show that-surprisingly-in many instances classical algorithms such as vanilla SGD experience less catastrophic forgetting than the more modern algorithms such as Adam. We empirically compare four different existing metrics for quantifying catastrophic forgetting and (2) show that the degree to which the learning systems experience catastrophic forgetting is sufficiently sensitive to the metric used that a change from one principled metric to another is enough to change the conclusions of a study dramatically. Our results suggest that a much more rigorous experimental methodology is required when looking at catastrophic forgetting. Based on our results, we recommend inter-task forgetting in supervised learning must be measured with both retention and relearning metrics concurrently, and intra-task forgetting in reinforcement learning must---at the very least---be measured with pairwise interference.

Kudashkina, K., Wan, Y., Naik, A.,
Sutton, R. S. (2021) Planning with Expectation Models for Control, ArXiv:2104.08543.
ABSTRACT: In model-based reinforcement learning (MBRL), Wan et al. (2019) showed conditions under which the environment model could produce the expectation of the next feature vector rather than the full distribution, or a sample thereof, with no loss in planning performance. Such expectation models are of interest when the environment is stochastic and non-stationary, and the model is approximate, such as when it is learned using function approximation. In these cases a full distribution model may be impractical and a sample model may be either more expensive computationally or of high variance. Wan et al. considered only planning for prediction to evaluate a fixed policy. In this paper, we treat the control case - planning to improve and find a good approximate policy. We prove that planning with an expectation model must update a state-value function, not an action-value function as previously suggested (e.g., Sorg & Singh, 2010). This opens the question of how planning influences action selections. We consider three strategies for this and present general MBRL algorithms for each. We identify the strengths and weaknesses of these algorithms in computational experiments. Our algorithms and experiments are the first to treat MBRL with expectation models in a general setting.

Silver, D., Singh, S., Precup, D., Sutton, R. S. (2021) Reward Is Enough. Artificial Intelligence 299, 103535.
ABSTRACT: In this article we hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward. Accordingly, reward is enough to drive behaviour that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language, generalisation and imitation. This is in contrast to the view that specialised problem formulations are needed for each ability, based on other signals or objectives. Furthermore, we suggest that agents that learn through trial and error experience to maximise reward could learn behaviour that exhibits most if not all of these abilities, and therefore that powerful reinforcement learning agents could constitute a solution to artificial general intelligence.

Ghiassian, S., Sutton, R. S. (2021) An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task. ArXiv:2106.00922.
ABSTRACT: Off-policy prediction---learning the value function for one policy from data generated while following another policy---is one of the most challenging subproblems in reinforcement learning. This paper presents empirical results with eleven prominent off-policy learning algorithms that use linear function approximation: five Gradient-TD methods, two Emphatic-TD methods, Off-policy TD(λ), Vtrace, and versions of Tree Backup and ABQ modified to apply to a prediction setting. Our experiments used the Collision task, a small idealized off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. We assessed the performance of the algorithms according to their learning rate, asymptotic error level, and sensitivity to step-size and bootstrapping parameters. By these measures, the eleven algorithms can be partially ordered on the Collision task. In the top tier, the two Emphatic-TD algorithms learned the fastest, reached the lowest errors, and were robust to parameter settings. In the middle tier, the five Gradient-TD algorithms and Off-policy TD(λ) were more sensitive to the bootstrapping parameter. The bottom tier comprised Vtrace, Tree Backup, and ABQ; these algorithms were no faster and had higher asymptotic error than the others. Our results are definitive for this task, though of course experiments with more tasks are needed before an overall assessment of the algorithms' merits can be made.

Zhang, S., Wan, Y., Naik, A., Sutton, R. S., Whiteson, S. (2021). Average-Reward Off-Policy Policy Evaluation with Function Approximation. In Proceedings of the International Conference on Machine Learning.
ABSTRACT: We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.
Average-Reward Off-Policy Policy Evaluation with Function Approximation

Wan, Y., Naik, A., Sutton, R. S. (2021). Learning and Planning in Average-Reward Markov Decision Processes. In Proceedings of the International Conference on Machine Learning.
ABSTRACT: We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward. Our proof techniques are a slight generalization of those by Abounadi, Bertsekas, and Borkar (2001). In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms can be significantly easier to use.

Lee, J. Y., Sutton, R. S. (2021). Policy iterations for reinforcement learning problems in continuous time and space — Fundamental theory and methods. Automatica 126, 109421.


ABSTRACT: Policy iteration (PI) is a recursive process of policy evaluation and improvement for solving an optimal decision-making/control problem, or in other words, a reinforcement learning (RL) problem. PI has also served as the fundamental to develop RL methods. In this paper, we propose two PI methods, called differential PI (DPI) and integral PI (IPI), and their variants to solve a general RL problem in continuous time and space (CTS), with the environment modeled by a system of ordinary differential equations (ODEs). The proposed methods inherit the current ideas of PI in classical RL and optimal control and theoretically support the existing RL algorithms in CTS: TD-learning and value-gradient-based (VGB) greedy policy update. We also provide case studies including 1) discounted RL and 2) optimal control tasks. Fundamental mathematical properties — admissibility, uniqueness of the solution to the Bellman equation, monotone improvement, convergence, and optimality of the solution to the Hamilton-Jacobi-Bellman equation (HJBE) — are all investigated in-depth and improved from the existing theory, along with the general and case studies. Finally, the proposed ones are simulated with an inverted-pendulum model and their model-based and partially model-free implementations to support the theory and further investigate them.


Barto, A. G., Sutton, R. S., Anderson, C. W. (2021) Looking Back on the Actor-Critic Architecture. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 51(1).
ABSTRACT: This retrospective describes the overall research project that gave rise to the authors' paper “Neuronlike adaptive elements that can solve difficult learning control problems” that was published in the 1983 Neural and Sensory Information Processing special issue of the IEEE Transactions on Systems, Man, and Cybernetics. This look back explains how this project came about, presents the ideas and previous publications that influenced it, and describes our most closely related subsequent research. It concludes by pointing out some noteworthy aspects of this article that have been eclipsed by its main contributions, followed by commenting on some of the directions and cautions that should inform future research.

Chan, A., De Asis, K., & Sutton, R. S. (2020). Inverse policy evaluation for value-based sequential decision-making. ArXiv:2008.11329.
ABSTRACT: Value-based methods for reinforcement learning lack generally applicable ways to derive behavior from a value function. Many approaches involve approximate value iteration (e.g., Q-learning), and acting greedily with respect to the estimates with an arbitrary degree of entropy to ensure that the state-space is sufficiently explored. Behavior based on explicit greedification assumes that the values reflect those of some policy, over which the greedy policy will be an improvement. However, value-iteration can produce value functions that do not correspond to any policy. This is especially relevant in the function-approximation regime, when the true value function can't be perfectly represented. In this work, we explore the use of inverse policy evaluation, the process of solving for a likely policy given a value function, for deriving behavior from a value function. We provide theoretical and empirical results to show that inverse policy evaluation, combined with an approximate value iteration algorithm, is a feasible method for value-based control.

Dohare, S., Sutton, R. S., Mahmood, A. R. (2021). Continual backprop: Stochastic gradient descent with persistent randomness. ArXiv:2108.06325.
ABSTRACT: The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former. We show that in continual learning setups, Backprop performs well initially, but over time its performance degrades. Stochastic gradient descent alone is insufficient to learn continually; the initial randomness enables only initial learning but not continual learning. To the best of our knowledge, ours is the first result showing this degradation in Backprop's ability to learn. To address this degradation in Backprop's plasticity, we propose an algorithm that continually injects random features alongside gradient descent using a new generate-and-test process. We call this the Continual Backprop algorithm. We show that, unlike Backprop, Continual Backprop is able to continually adapt in both supervised and reinforcement learning (RL) problems. Continual Backprop has the same computational complexity as Backprop and can be seen as a natural extension of Backprop for continual learning.

Ghiassian, S., Sutton, R. S. (2021). An empirical comparison of off-policy prediction learning algorithms in the four rooms environment. ArXiv:2109.05110.
ABSTRACT: Many off-policy prediction learning algorithms have been proposed in the past decade, but it remains unclear which algorithms learn faster than others. We empirically compare 11 off-policy prediction learning algorithms with linear function approximation on two small tasks: the Rooms task, and the High Variance Rooms task. The tasks are designed such that learning fast in them is challenging. In the Rooms task, the product of importance sampling ratios can be as large as 214 and can sometimes be two. To control the high variance caused by the product of the importance sampling ratios, step size should be set small, which in turn slows down learning. The High Variance Rooms task is more extreme in that the product of the ratios can become as large as 214×25. This paper builds upon the empirical study of off-policy prediction learning algorithms by Ghiassian and Sutton (2021). We consider the same set of algorithms as theirs and employ the same experimental methodology. The algorithms considered are: Off-policy TD(λ), five Gradient-TD algorithms, two Emphatic-TD algorithms, Tree Backup(λ), Vtrace(λ), and ABTD(ζ). We found that the algorithms' performance is highly affected by the variance induced by the importance sampling ratios. The data shows that Tree Backup(λ), Vtrace(λ), and ABTD(ζ) are not affected by the high variance as much as other algorithms but they restrict the effective bootstrapping parameter in a way that is too limiting for tasks where high variance is not present. We observed that Emphatic TD(λ) tends to have lower asymptotic error than other algorithms, but might learn more slowly in some cases. We suggest algorithms for practitioners based on their problem of interest, and suggest approaches that can be applied to specific algorithms that might result in substantially improved algorithms.

De Asis, K., Chan, A., Pitis, S., Sutton, R., & Graves, D. (2020). Fixed-horizon temporal difference methods for stable reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 04, pp. 3741-3748).

ABSTRACT: We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h−1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as “the deadly triad”). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and n-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.


Dalrymple, A. N., Roszko, D. A., Sutton, R. S., & Mushahwar, V. K. (2020). Pavlovian control of intraspinal microstimulation to produce over-ground walking. Journal of Neural Engineering 17(3), 036002.

ABSTRACT: Objective. Neuromodulation technologies are increasingly used for improving function after neural injury. To achieve a symbiotic relationship between device and user, the device must augment remaining function, and independently adapt to day-to-day changes in function. The goal of this study was to develop predictive control strategies to produce over-ground walking in a model of hemisection spinal cord injury (SCI) using intraspinal microstimulation (ISMS). Approach. Eight cats were anaesthetized and placed in a sling over a walkway. The residual function of a hemisection SCI was mimicked by manually moving one hind-limb through the walking cycle. ISMS targeted motor networks in the lumbosacral enlargement to activate muscles in the other, presumably ‘paralyzed’ limb, using low levels of current (<130 μA). Four people took turns to move the ‘intact’ limb, generating four different walking styles. Two control strategies, which used ground reaction force and angular velocity information about the manually moved ‘intact’ limb to control the timing of the transitions of the ‘paralyzed’ limb through the step cycle, were compared. The first strategy used thresholds on the raw sensor values to initiate transitions. The second strategy used reinforcement learning and Pavlovian control to learn predictions about the sensor values. Thresholds on the predictions were then used to initiate transitions. Main results. Both control strategies were able to produce alternating, over-ground walking. Transitions based on raw sensor values required manual tuning of thresholds for each person to produce walking, whereas Pavlovian control did not. Learning occurred quickly during walking: predictions of the sensor signals were learned rapidly, initiating correct transitions after ≤4 steps. Pavlovian control was resilient to different walking styles and different cats, and recovered from induced mistakes during walking. Significance. This work demonstrates, for the first time, that Pavlovian control can augment remaining function and facilitate personalized walking with minimal tuning requirements.


Kudashkina, K., Pilarski, P. M., & Sutton, R. S. (2020). Document-editing Assistants and Model-based Reinforcement Learning as a Path to Conversational AI. ArXiv:2008.12095.
ABSTRACT: Intelligent assistants that follow commands or answer simple questions, such as Siri and Google search, are among the most economically important applications of AI. Future conversational AI assistants promise even greater capabilities and a better user experience through a deeper understanding of the domain, the user, or the user's purposes. But what domain and what methods are best suited to researching and realizing this promise? In this article we argue for the domain of voice document editing and for the methods of model-based reinforcement learning. The primary advantages of voice document editing are that the domain is tightly scoped and that it provides something for the conversation to be about (the document) that is delimited and fully accessible to the intelligent assistant. The advantages of reinforcement learning in general are that its methods are designed to learn from interaction without explicit instruction and that it formalizes the purposes of the assistant. Model-based reinforcement learning is needed in order to genuinely understand the domain of discourse and thereby work efficiently with the user to achieve their goals. Together, voice document editing and model-based reinforcement learning comprise a promising research direction for achieving conversational AI.

Young, K., & Sutton, R. S. (2020). Understanding the Pathologies of Approximate Policy Evaluation when Combined with Greedification in Reinforcement Learning. ArXiv:2010.15268.
ABSTRACT: Despite empirical success, the theory of reinforcement learning (RL) with value function approximation remains fundamentally incomplete. Prior work has identified a variety of pathological behaviours that arise in RL algorithms that combine approximate on-policy evaluation and greedification. One prominent example is policy oscillation, wherein an algorithm may cycle indefinitely between policies, rather than converging to a fixed point. What is not well understood however is the quality of the policies in the region of oscillation. In this paper we present simple examples illustrating that in addition to policy oscillation and multiple fixed points -- the same basic issue can lead to convergence to the worst possible policy for a given approximation. Such behaviours can arise when algorithms optimize evaluation accuracy weighted by the distribution of states that occur under the current policy, but greedify based on the value of states which are rare or nonexistent under this distribution. This means the values used for greedification are unreliable and can steer the policy in undesirable directions. Our observation that this can lead to the worst possible policy shows that in a general sense such algorithms are unreliable. The existence of such examples helps to narrow the kind of theoretical guarantees that are possible and the kind of algorithmic ideas that are likely to be helpful. We demonstrate analytically and experimentally that such pathological behaviours can impact a wide range of RL and dynamic programming algorithms; such behaviours can arise both with and without bootstrapping, and with linear function approximation as well as with more complex parameterized functions like neural networks.

Tian, T., Sutton, R. S. (2020). “Extending sliding-step importance weighting from supervised learning to reinforcement learning.” Best of IJCAI Workshops 2019, Springer.

ABSTRACT:Stochastic gradient descent (SGD) has been in the center of many advances in modern machine learning. SGD processes examples sequentially, updating a weight vector in the direction that would most reduce the loss for that example. In many applications, some examples are more important than others and, to capture this, each example is given a non-negative weight that modulates its impact. Unfortunately, if the importance weights are highly variable they can greatly exacerbate the difficulty of setting the step-size parameter of SGD. To ease this difficulty, Karampatziakis and Langford [6] developed a class of elegant algorithms that are much more robust in the face of highly variable importance weights in supervised learning. In this paper we extend their idea, which we call “sliding step,” to reinforcement learning, where importance weighting can be particularly variable due to the importance sampling involved in off-policy learning algorithms. We compare two alternative ways of doing the extension in the linear function approximation setting, then introduce specific sliding-step versions of the TD(0) and Emphatic TD(0) learning algorithms. We prove the convergence of our algorithms and demonstrate their effectiveness on both on-policy and off-policy problems. Overall, our new algorithms appear to be effective in bringing the robustness of the sliding-step technique from supervised learning to reinforcement learning.


Sutton, R. S. (2020) John McCarthy's Definition of Intelligence. Journal of Artificial General Intelligence 11(2), 66-67. doi:10.2478/jagi-2020-0003, 2 pages.


Osband, I., Doron, Y., Hessel, M., Aslanides, J., Sezener, E., Saraiva, A., McKinney, K., Lattimore, T., Szepezvari, C., Singh, S., Van Roy, B., Sutton, R., Silver, D., van Hasselt, H. (2020). “Behaviour suite for reinforcement learning.” Proceedings of the International Conference on Learning Representations, 2020.

ABSTRACT: This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. Second, to study agent behaviour through their performance on these shared benchmarks. To complement this effort, we open source github.com/deepmind/bsuite, which automates evaluation and analysis of any agent on bsuite. This library facilitates reproducible and accessible research on the core issues in RL, and ultimately the design of superior learning algorithms. Our code is Python, and easy to use within existing projects. We include examples with OpenAI Baselines, Dopamine as well as new reference implementations. Going forward, we hope to incorporate more excellent experiments from the research community, and commit to a periodic review of bsuite from a committee of prominent researchers.


Hernandez-Garcia, J. F., & Sutton, R. S. (2019). Learning Sparse Representations Incrementally in Deep Reinforcement Learning. ArXiv:1912.04002.
ABSTRACT: Sparse representations have been shown to be useful in deep reinforcement learning for mitigating catastrophic interference and improving the performance of agents in terms of cumulative reward. Previous results were based on a two step process were the representation was learned offline and the action-value function was learned online afterwards. In this paper, we investigate if it is possible to learn a sparse representation and the action-value function simultaneously and incrementally. We investigate this question by employing several regularization techniques and observing how they affect sparsity of the representation learned by a DQN agent in two different benchmark domains. Our results show that with appropriate regularization it is possible to increase the sparsity of the representations learned by DQN agents. Moreover, we found that learning sparse representations also resulted in improved performance in terms of cumulative reward. Finally, we found that the performance of the agents that learned a sparse representation was more robust to the size of the experience replay buffer. This last finding supports the long standing hypothesis that the overlap in representations learned by deep neural networks is the leading cause of catastrophic interference.

Naik, A., Shariff, R., Yasui, N., Yao, H., & Sutton, R. S. (2019). Discounted reinforcement learning is not an optimization problem. arXiv preprint arXiv:1910.02140.

ABSTRACT: Discounted reinforcement learning is fundamentally incompatible with function approximation for control in continuing tasks. It is not an optimization problem in its usual formulation, so when using function approximation there is no optimal policy. We substantiate these claims, then go on to address some misconceptions about discounting and its connection to the average reward formulation. We encourage researchers to adopt rigorous optimization approaches, such as maximizing average reward, for reinforcement learning in continuing tasks

Wan, Y., Zaheer, M., White, A., White, M., Sutton, R.S. (2019). “Planning with expectation models.” In Proceedings of the International Joint Conference on Artificial Intelligence.

ABSTRACT: Distribution and sample models are two popular model choices in model-based reinforcement learning (MBRL). However, learning these models can be intractable, particularly when the state and action spaces are large. Expectation models, on the other hand, are relatively easier to learn due to their compactness and have also been widely used for deterministic environments. For stochastic environments, it is not obvious how expectation models can be used for planning as they only partially characterize a distribution. In this paper, we propose a sound way of using expectation models for MBRL. In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state-feature vector, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm.


Sutton, R. (2019). The bitter lesson. Incomplete Ideas (blog). One and a half pages.


Hernandez-Garcia, J. F., & Sutton, R. S. (2019). Understanding multi-step deep reinforcement learning: A systematic study of the DQN target. arXiv preprint arXiv:1901.07510.
ABSTRACT: Multi-step methods such as Retrace(λ) and n-step Q-learning have become a crucial component of modern deep reinforcement learning agents. These methods are often evaluated as a part of bigger architectures and their evaluations rarely include enough samples to draw statistically significant conclusions about their performance. This type of methodology makes it difficult to understand how particular algorithmic details of multi-step methods influence learning. In this paper we combine the n-step action-value algorithms Retrace, Q-learning, Tree Backup, Sarsa, and Q(σ) with an architecture analogous to DQN. We test the performance of all these algorithms in the mountain car environment; this choice of environment allows for faster training times and larger sample sizes. We present statistical analyses on the effects of the off-policy correction, the backup length parameter n, and the update frequency of the target network on the performance of these algorithms. Our results show that (1) using off-policy correction can have an adverse effect on the performance of Sarsa and Q(σ); (2) increasing the backup length n consistently improved performance across all the different algorithms; and (3) the performance of Sarsa and Q-learning was more robust to the effect of the target network update frequency than the performance of Tree Backup, Q(σ), and Retrace in this particular task.

Gu, X., Ghiassian, S., & Sutton, R. S. (2019). Should All Temporal Difference Learning Use Emphasis?. ArXiv:1903.00194.

ABSTRACT: Emphatic Temporal Difference (ETD) learning has recently been proposed as a convergent off-policy learning method. ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but it is different from conventional TD learning even under on-policy training. A simple counterexample provided back in 2017 pointed to a potential class of problems where ETD converges but TD diverges. In this paper, we empirically show that ETD converges on a few other well-known on-policy experiments whereas TD either diverges or performs poorly. We also show that ETD outperforms TD on the mountain car prediction problem. Our results, together with a similar pattern observed under off-policy training in prior works, suggest that ETD might be a good substitute over conventional TD.


Kearney, A., Veeriah, V., Travnik, J., Pilarski, P. M., & Sutton, R. S. (2019). Learning feature relevance through step size adaptation in temporal-difference learning. ArXiv:1903.03252.

ABSTRACT: There is a long history of using meta learning as representation learning, specifically for determining the relevance of inputs. In this paper, we examine an instance of meta-learning in which feature relevance is learned by adapting step size parameters of stochastic gradient descent—building on a variety of prior work in stochastic approximation, machine learning, and artificial neural networks. In particular, we focus on stochastic meta-descent introduced in the Incremental Delta-Bar-Delta (IDBD) algorithm for setting individual step sizes for each feature of a linear function approximator. Using IDBD, a feature with large or small step sizes will have a large or small impact on generalization from training examples. As a main contribution of this work, we extend IDBD to temporal-difference (TD) learning—a form of learning which is effective in sequential, non i.i.d. problems. We derive a variety of IDBD generalizations for TD learning, demonstrating that they are able to distinguish which features are relevant and which are not. We demonstrate that TD IDBD is effective at learning feature relevance in both an idealized gridworld and a real-world robotic prediction task.


Rafiee, B., Ghiassian, S., White, A., Sutton, R.S. (2019). “Prediction in intelligence: An empirical comparison of off-policy algorithms on robots.” In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, pp. 332-340.

ABSTRACT: The ability to continually make predictions about the world may be central to intelligence. Off-policy learning and general value functions (GVFs) are well-established algorithmic techniques for learning about many signals while interacting with the world. In the past couple of years, many ambitious works have used off-policy GVF learning to improve control performance in both simulation and robotic control tasks. Many of these works use semi-gradient temporal-difference (TD) learning algorithms, like Q-learning, which are potentially divergent. In the last decade, several TD learning algorithms have been proposed that are convergent and computationally efficient, but not much is known about how they perform in practice, especially on robots. In this work, we perform an empirical comparison of modern off-policy GVF learning algorithms on three different robot platforms, providing insights into their strengths and weaknesses. We also discuss the challenges of conducting fair comparative studies of off-policy learning on robots and develop a new evaluation methodology that is successful and applicable to a relatively complicated robot domain.


Ghiassian, S., Patterson, A., White, M., Sutton, R. S., & White, A. (2018). Online off-policy prediction. arXiv preprint arXiv:1811.02597.

ABSTRACT: This paper investigates the problem of online prediction learning, where learning proceeds continuously as the agent interacts with an environment. The predictions made by the agent are contingent on a particular way of behaving, represented as a value function. However, the behavior used to select actions and generate the behavior data might be different from the one used to define the predictions, and thus the samples are generated off-policy. The ability to learn behavior-contingent predictions online and off-policy has long been advocated as a key capability of predictive-knowledge learning systems but remained an open algorithmic challenge for decades. The issue lies with the temporal difference (TD) learning update at the heart of most prediction algorithms: combining bootstrapping, off-policy sampling and function approximation may cause the value estimate to diverge. A breakthrough came with the development of a new objective function that admitted stochastic gradient descent variants of TD. Since then, many sound online off-policy prediction algorithms have been developed, but there has been limited empirical work investigating the relative merits of all the variants. This paper aims to fill these empirical gaps and provide clarity on the key ideas behind each method. We summarize the large body of literature on off-policy learning, focusing on 1- methods that use computation linear in the number of features and are convergent under off-policy sampling, and 2- other methods which have proven useful with non-fixed, nonlinear function approximation. We provide an empirical study of off-policy prediction methods in two challenging microworlds. We report each method's parameter sensitivity, empirical convergence rate, and final performance, providing new insights that should enable practitioners to successfully extend these new methods to large-scale applications


Travnik, J. B., Mathewson, K. W., Sutton, R. S., Pilarski, P. M. (2018). Reactive Reinforcement Learning in Asynchronous Environments. Frontiers in Robotics and AI 5.

ABSTRACT: The relationship between a reinforcement learning (RL) agent and an asynchronous environment is often ignored. Frequently used models of the interaction between an agent and its environment, such as Markov Decision Processes (MDP) or Semi-Markov Decision Processes (SMDP), do not capture the fact that, in an asynchronous environment, the state of the environment may change during computation performed by the agent. In an asynchronous environment, minimizing reaction time—the time it takes for an agent to react to an observation—also minimizes the time in which the state of the environment may change following observation. In many environments, the reaction time of an agent directly impacts task performance by permitting the environment to transition into either an undesirable terminal state or a state where performing the chosen action is inappropriate. We propose a class of reactive reinforcement learning algorithms that address this problem of asynchronous environments by immediately acting after observing new state information. We compare a reactive SARSA learning algorithm with the conventional SARSA learning algorithm on two asynchronous robotic tasks (emergency stopping and impact prevention), and show that the reactive RL algorithm reduces the reaction time of the agent by approximately the duration of the algorithm's learning update. This new class of reactive algorithms may facilitate safer control and faster decision making without any change to standard learning guarantees.


Kearney, A., Veeriah, V., Travnik, J. B., Sutton, R. S., Pilarski, P. M. (2018). TIDBD: Adapting Temporal-difference Step-sizes Through Stochastic Meta-descent. ArXiv:1804.03334.

ABSTRACT: In this paper, we introduce a method for adapting the step-sizes of temporal difference (TD) learning. The performance of TD methods often depends on well chosen step-sizes, yet few algorithms have been developed for setting the step-size automatically for TD learning. An important limitation of current methods is that they adapt a single step-size shared by all the weights of the learning system. A vector step-size enables greater optimization by specifying parameters on a per-feature basis. Furthermore, adapting parameters at different rates has the added benefit of being a simple form of representation learning. We generalize Incremental Delta Bar Delta (IDBD)—a vectorized adaptive step-size method for supervised learning—to TD learning, which we name TIDBD. We demonstrate that TIDBD is able to find appropriate step-sizes in both stationary and non-stationary prediction tasks, outperforming ordinary TD methods and TD methods with scalar step-size adaptation; we demonstrate that it can differentiate between features which are relevant and irrelevant for a given task, performing representation learning; and we show on a real-world robot prediction task that TIDBD is able to outperform ordinary TD methods and TD methods augmented with AlphaBound and RMSprop.


Ghiassian, S., Yu, H., Rafiee, B., Sutton, R. S. (2018). Two geometric input transformation methods for fast online reinforcement learning with neural nets. ArXiv:1805.07476.

ABSTRACT: We apply neural nets with ReLU gates in online reinforcement learning. Our goal is to train these networks in an incremental manner, without the computationally expensive experience replay. By studying how individual neural nodes behave in online training, we recognize that the global nature of ReLU gates can cause undesirable learning interference in each node’s learning behavior. We propose reducing such interferences with two efficient input transformation methods that are geometric in nature and match well the geometric property of ReLU gates. The first one is tile coding, a classic binary encoding scheme originally designed for local generalization based on the topological structure of the input space. The second one (EmECS) is a new method we introduce; it is based on geometric properties of convex sets and topological embedding of the input space into the boundary of a convex set. We discuss the behavior of the network when it operates on the transformed inputs. We also compare it experimentally with some neural nets that do not use the same input transformations, and with the classic algorithm of tile coding plus a linear function approximator, and on several online reinforcement learning tasks, we show that the neural net with tile coding or EmECS can achieve not only faster learning but also more accurate approximations. Our results strongly suggest that geometric input transformation of this type can be effective for interference reduction and takes us a step closer to fully incremental reinforcement learning with neural nets.


De Asis, K., Sutton, R. S. (2018). Per-decision Multi-step Temporal Difference Learning with Control Variates.
Proceedings of the 2018 Conference on Uncertainty in Artificial Intelligence.

ABSTRACT: Multi-step temporal difference (TD) learning is an important approach in reinforcement learning, as it unifies one-step TD learning with Monte Carlo methods in a way where intermediate algorithms can outperform either extreme. They address a bias-variance trade off between reliance on current estimates, which could be poor, and incorporating longer sampled reward sequences into the updates. Especially in the off-policy setting, where the agent aims to learn about a policy different from the one generating its behaviour, the variance in the updates can cause learning to diverge as the number of sampled rewards used in the estimates increases. In this paper, we introduce per-decision control variates for multi-step TD algorithms, and compare them to existing methods. Our results show that including the control variates can greatly improve performance on both on and off-policy multi-step temporal difference learning tasks.


Wan, Y., Zaheer, M., White, M., Sutton, R. S. (2018). Model-based Reinforcement Learning with Non-linear Expectation Models and Stochastic Environments. Workshop of the Federated Artificial Intelligence Meeting, Stockholm, Sweden.

ABSTRACT: In model-based reinforcement learning (MBRL), the model of a stochastic environment provides, for each state and action, either 1) the complete distribution of possible next states, 2) a sample next state, or 3) the expectation of the next state’s feature vector. The third case, that of an expectation model, is particularly appealing because the expectation is compact and deterministic; this is the case most commonly used, but often in a way that is not sound for non-linear models such as those obtained with deep learning. In this paper we introduce the first MBRL algorithm that is sound for non-linear expectation models and stochastic environments. Key to our algorithm, based on the Dyna architecture, is that the model is never iterated to produce a trajectory, but only used to generate single expected transitions to which a Bellman backup with a linear approximate value function is applied. In our results, we also consider the extension of the Dyna architecture to partial observability. We show the effectiveness of our algorithm by comparing it with model-free methods on partially-observable navigation tasks.


De Asis, K., Bennett, B., Sutton, R. S. (2018). Predicting Periodicity with Temporal Difference Learning. arXiv:1809.07435.

ABSTRACT: Temporal difference (TD) learning is an important approach in reinforcement learning, as it combines ideas from dynamic programming and Monte Carlo methods in a way that allows for online and incremental model-free learning. A key idea of TD learning is that it is learning predictive knowledge about the environment in the form of value functions, from which it can derive its behavior to address long-term sequential decision making problems. The agent’s horizon of interest, that is, how immediate or long-term a TD learning agent predicts into the future, is adjusted through a discount rate parameter. In this paper, we introduce an alternative view on the discount rate, with insight from digital signal processing, to include complex-valued discounting. Our results show that setting the discount rate to appropriately chosen complex numbers allows for online and incremental estimation of the Discrete Fourier Transform (DFT) of a signal of interest with TD learning. We thereby extend the types of knowledge representable by value functions, which we show are particularly useful for identifying periodic effects in the reward sequence.


Sherstan, C., Ashley, D. R., Bennett, B., Young, K., White, A., White, M., Sutton, R. S. (2018). Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return. Proceedings of the 2018 Conference on Uncertainty in Artificial Intelligence.

ABSTRACT: Temporal-difference (TD) learning methods are widely used in reinforcement learning to estimate the expected return for each state, without a model, because of their significant advantages in computational and data efficiency. For many applications involving risk mitigation, it would also be useful to estimate the variance of the return by TD methods. In this paper, we describe a way of doing this that is substantially simpler than those proposed by Tamar, Di Castro, and Mannor in 2012, or those proposed by White and White in 2016. We show that two TD learners operating in series can learn expectation and variance estimates. The trick is to use the square of the TD error of the expectation learner as the reward of the variance learner, and the square of the expectation learner’s discount rate as the discount rate of the variance learner. With these two modifications, the variance learning problem becomes a conventional TD learning problem to which standard theoretical results can be applied. Our formal results are limited to the table lookup case, for which our method is still novel, but the extension to function approximation is immediate, and we provide some empirical results for the linear function approximation case. Our experimental results show that our direct method behaves just as well as a comparable indirect method, but is generally more robust.


Young, K. J., Sutton, R. S., Yang, S. (2018). Integrating Episodic Memory into a Reinforcement Learning Agent using Reservoir Sampling. ArXiv:1806.00540.

ABSTRACT: Episodic memory is a psychology term which refers to the ability to recall specific events from the past. We suggest one advantage of this particular type of memory is the ability to easily assign credit to a specific state when remembered information is found to be useful. Inspired by this idea, and the increasing popularity of external memory mechanisms to handle long-term dependencies in deep learning systems, we propose a novel algorithm which uses a reservoir sampling procedure to maintain an external memory consisting of a fixed number of past states. The algorithm allows a deep reinforcement learning agent to learn online to preferentially remember those states which are found to be useful to recall later on. Critically this method allows for efficient online computation of gradient estimates with respect to the write process of the external memory. Thus unlike most prior mechanisms for external memory it is feasible to use in an online reinforcement learning setting.


De Asis, K., Hernandez-Garcia, J. F., Holland, G. Z., Sutton, R. S. (2018). Multi-step reinforcement learning: A unifying algorithm. Proceedings of the Conference of the Association for the Advancement of Artificial Intelligence. Also ArXiv:1703.01327.

ABSTRACT: Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, TD(λ) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter λ. Currently, there are a multitude of algorithms that can be used to perform TD control, including Sarsa, Q-learning, and Expected Sarsa. These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance. Each of these algorithms is seemingly distinct, and no one dominates the others for all problems. In this paper, we study a new multi-step action-value algorithm called Q(σ) that unifies and generalizes these existing algorithms, while subsuming them as special cases. A new parameter, σ, is introduced to allow the degree of sampling performed by the algorithm at each step during its backup to be continuously varied, with Sarsa existing at one extreme (full sampling), and Expected Sarsa existing at the other (pure expectation). Q(σ) is generally applicable to both on- and off-policy learning, but in this work we focus on experiments in the on-policy case. Our results show that an intermediate value of σ, which results in a mixture of the existing algorithms, performs better than either extreme. The mixture can also be varied dynamically which can result in even greater performance.


Yu, H., Mahmood, A. R., Sutton, R. S. (2018). On Generalized Bellman Equations and Temporal-Difference Learning. Journal of Machine Learning Research 19.

ABSTRACT: We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. To curb the high variance issue in off-policy TD learning, we propose a new scheme of setting the λ-parameters of TD, based on generalized Bellman equations. Our scheme is to set λ according to the eligibility trace iterates calculated in TD, thereby easily keeping these traces in a desired bounded range. Compared with prior work, this scheme is more direct and flexible, and allows much larger λ values for off-policy TD learning with bounded traces. As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of λ and the unique invariant probability measure of the state-trace process. These results not only lead immediately to a characterization of the convergence behavior of least-squares based implementation of our scheme, but also prepare the ground for further analysis of gradient-based implementations.


Lee, J. Y., Sutton, R. S. (2017). Integral Policy Iterations for Reinforcement Learning Problems in Continuous Time and Space. ArXiv:1705.03520.

ABSTRACT: Policy iteration (PI) is a recursive process of policy evaluation and improvement to solve an optimal decision-making, e.g., reinforcement learning (RL) or optimal control problem and has served as the fundamental to develop RL methods. Motivated by integral PI (IPI) schemes in optimal control and RL methods in continuous time and space (CTS), this paper proposes on-policy IPI to solve the general RL problem in CTS, with its environment modelled by an ordinary differential equation (ODE). In such continuous domain, we also propose four off-policy IPI methods—two are the ideal PI forms that use advantage and Q-functions, respectively, and the other two are natural extensions of the existing off-policy IPI schemes to our general RL framework. Compared to the IPI methods in optimal control, the proposed IPI schemes can be applied to more general situations and do not require an initial stabilizing policy to run; they are also strongly relevant to the RL algorithms in CTS such as advantage updating, Q-learning, and value-gradient based (VGB) greedy policy improvement. Our on-policy IPI is basically model-based but can be made partially model-free; each off-policy method is also either partially or completely model-free. The mathematical properties of the IPI methods—admissibility, monotone improvement, and convergence towards the optimal solution—are all rigorously proven, together with the equivalence of on- and off-policy IPI. Finally, the IPI methods are simulated with an inverted-pendulum model to support the theory and verify the performance.


Ghiassian, S., Rafiee, B., & Sutton, R. S. (2017). A first empirical study of emphatic temporal difference learning. arXiv preprint arXiv:1705.04185.

ABSTRACT: In this paper we present the first empirical study of the emphatic temporal-difference learning algorithm (ETD), comparing it with conventional temporal-difference learning, in particular, with linear TD(0), on on-policy and off-policy variations of the Mountain Car problem. The initial motivation for developing ETD was that it has good convergence properties under off-policy training (Sutton, Mahmood & White 2016), but it is also a new algorithm for the on-policy case. In both our on-policy and off-policy experiments, we found that each method converged to a characteristic asymptotic level of error, with ETD better than TD(0). TD(0) achieved a still lower error level temporarily before falling back to its higher asymptote, whereas ETD never showed this kind of “bounce”. In the off-policy case (in which TD(0) is not guaranteed to converge), ETD was significantly slower.


Mahmood, A. R., Yu, H., Sutton, R. S. (2017). Multi-step off-policy learning without importance sampling ratios. ArXiv:1702.03006.

ABSTRACT: To estimate the value functions of policies from exploratory data, most model-free off-policy algorithms rely on importance sampling, where the use of importance sampling ratios often leads to estimates with severe variance. It is thus desirable to learn off-policy without using the ratios. However, such an algorithm does not exist for multi-step learning with function approximation. In this paper, we introduce the first such algorithm based on temporal-difference (TD) learning updates. We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner. Our new algorithm achieves stability using a two-timescale gradient-based TD update. A prior algorithm based on lookup table representation called Tree Backup can also be retrieved using action-dependent bootstrapping, becoming a special case of our algorithm. In two challenging off-policy tasks, we demonstrate that our algorithm is stable, effectively avoids the large variance issue, and can perform substantially better than its state-of-the-art counterpart.


Barto, A. G., Thomas, P. S., Sutton, R. S. (2017). Some Recent Applications of Reinforcement Learning. In Proceedings of the Eighteenth Yale Workshop on Adaptive and Learning Systems.

ABSTRACT: Five relatively recent applications of reinforcement learning methods are described. These examples were chosen to illustrate a diversity of application types, the engineering needed to build applications, and most importantly, the impressive results that these methods are able to achieve. This paper is based on a case-study chapter of the forthcoming second edition of Sutton and Barto’s 1998 book “Reinforcement Learning: An Introduction.”


Pilarski, P. M., Sutton, R. S., Mathewson, K. W., Sherstan, C., Parker, A. S., Edwards, A. L. (2017). Communicative Capital for Prosthetic Agents. ArXiv:1711.03676.

ABSTRACT: This work presents an overarching perspective on the role that machine intelligence can play in enhancing human abilities, especially those that have been diminished due to injury or illness. As a primary contribution, we develop the hypothesis that assistive devices, and specifically artificial arms and hands, can and should be viewed as agents in order for us to most effectively improve their collaboration with their human users. We believe that increased agency will enable more powerful interactions between human users and next generation prosthetic devices, especially when the sensorimotor space of the prosthetic technology greatly exceeds the conventional control and communication channels available to a prosthetic user. To more concretely examine an agency-based view on prosthetic devices, we propose a new schema for interpreting the capacity of a human-machine collaboration as a function of both the human’s and machine’s degrees of agency. We then introduce the idea of communicative capital as a way of thinking about the communication resources developed by a human and a machine during their ongoing interaction. Using this schema of agency and capacity, we examine the benefits and disadvantages of increasing the agency of a prosthetic limb. To do so, we present an analysis of examples from the literature where building communicative capital has enabled a progression of fruitful, task-directed interactions between prostheses and their human users. We then describe further work that is needed to concretely evaluate the hypothesis that prostheses are best thought of as agents. The agent-based viewpoint developed in this article significantly extends current thinking on how best to support the natural, functional use of increasingly complex prosthetic enhancements, and opens the door for more powerful interactions between humans and their assistive technologies.


Zhang, S., Sutton, R. S. (2017). A Deeper Look at Experience Replay. Symposium on Deep Reinforcement Learning at the 31st Conference on Neural Information Processing Systems. ArXiv:1712.01275.

ABSTRACT: Recently experience replay is widely used in various deep reinforcement learning (RL) algorithms, in this paper we rethink the utility of experience replay. It introduces a new hyper-parameter, the memory buffer size, which needs carefully tuning. However unfortunately the importance of this new hyper-parameter has been underestimated in the community for a long time. In this paper we did a systematic empirical study of experience replay under various function representations. We showcase that a large replay buffer can significantly hurt the performance. Moreover, we propose a simple O(1) method to remedy the negative influence of a large replay buffer. We showcase its utility in both simple grid world and challenging domains like Atari games.


Veeriah, V., van Seijen, H., Sutton, R. S. (2017). Forward actor-critic for nonlinear function approximation in reinforcement learning. In Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems (pp. 556-564), Sao Paulo, Brazil..

ABSTRACT: Multi-step methods are important in reinforcement learning (RL). Eligibility traces, the usual way of handling them, works well with linear function approximators. Recently, van Seijen (2016) had introduced a delayed learning approach, without eligibility traces, for handling the multi-step λ-return with nonlinear function approximators. However, this was limited to action-value methods. In this paper, we extend this approach to handle n-step returns, generalize this approach to policy gradient methods and empirically study the effect of such delayed updates in control tasks. Specifically, we introduce two novel forward actor-critic methods and empirically investigate our proposed methods with the conventional actor-critic method on mountain car and pole-balancing tasks. From our experiments, we observe that forward actor-critic dramatically outperforms the conventional actor-critic in these standard control tasks. Notably, this forward actor-critic method has produced a new class of multi-step RL algorithms without eligibility traces.


Yu, H., Mahmood, A. R., Sutton, R. S. (2017). On Generalized Bellman Equations and Temporal-Difference Learning. Proceedings of the Canadian Artificial Intelligence Conference.

ABSTRACT: We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. To curb the high variance issue in off-policy TD learning, we propose a new scheme of setting the λ-parameters of TD, based on generalized Bellman equations. Our scheme is to set λ according to the eligibility trace iterates calculated in TD, thereby easily keeping these traces in a desired bounded range. Compared with prior work, this scheme is more direct and flexible, and allows much larger λ values for off-policy TD learning with bounded traces. As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of λ and the unique invariant probability measure of the state-trace process. These results not only lead immediately to a characterization of the convergence behavior of least-squares based implementation of our scheme, but also prepare the ground for further analysis of gradient-based implementations.


Veeriah, V., Zhang, S., Sutton, R. S. (2017). Crossprop: Learning Representations by Stochastic Meta-Gradient Descent in Neural Networks. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD).

ABSTRACT: Representations are fundamental to artificial intelligence. The performance of a learning system depends on the type of representation used for representing the data. Typically, these representations are hand-engineered using domain knowledge. More recently, the trend is to learn these representations through stochastic gradient descent in multi-layer neural networks, which is called backprop. Learning the representations directly from the incoming data stream reduces the human labour involved in designing a learning system. More importantly, this allows in scaling of a learning system for difficult tasks. In this paper, we introduce a new incremental learning algorithm called crossprop, which learns incoming weights of hidden units based on the meta-gradient descent approach, that was previously introduced by Sutton (1992) and Schraudolph (1999) for learning step-sizes. The final update equation introduces an additional memory parameter for each of these weights and generalizes the backprop update equation. From our experiments, we show that crossprop learns and reuses its feature representation while tackling new and unseen tasks whereas backprop re- learns a new feature representation.


Ludvig, E. A., Mirian, M. S., Kehoe, E. J., Sutton, R. S. Associative learning from replayed experience. http://biorxiv.org/content/early/2017/01/16/100800.

ABSTRACT: We develop an extension of the Rescorla-Wagner model of associative learning. In addition to learning from the current trial, the new model supposes that animals store and replay previous trials, learning from the replayed trials using the same learning rule. This simple idea provides a unified explanation for diverse phenomena that have proved challenging to earlier associative models, including spontaneous recovery, latent inhibition, retrospective revaluation, and trial spacing effects. For example, spontaneous recovery is explained by supposing that the animal replays its previous trials during the interval between extinction and test. These include earlier acquisition trials as well as recent extinction trials, and thus there is a gradual re-acquisition of the conditioned response. We present simulation results for the simplest version of this replay idea, where the trial memory is assumed empty at the beginning of an experiment, all experienced trials are stored and none removed, and sampling from the memory is performed at random. Even this minimal replay model is able to explain the challenging phenomena, illustrating the explanatory power of an associative model enhanced by learning from remembered as well as real experiences.


Murphy, S. A., Deng, Y., Laber, E. B., Maei, H. R., Sutton, R. S., Witkiewitz, K. (2016). A batch, off-policy, actor-critic algorithm for optimizing the average reward. arXiv preprint arXiv:1607.05047.
ABSTRACT:We develop an off-policy actor–critic algorithm for learning an optimal policy from a training set composed of data from multiple individuals. This algorithm is developed with a view toward its use in mobile health.

Veeriah, V., Pilarski, P. M., Sutton, R. S. (2016). Face valuing: Training user interfaces with facial expressions and reinforcement learning. Int. J. Conference on Artificial Intelligence, Workshop on Interactive Machine Learning. Also arXiv 1606.02807.

ABSTRACT: An important application of interactive machine learning is extending or amplifying the cognitive and physical capabilities of a human. To accomplish this, machines need to learn about their human users' intentions and adapt to their preferences. In most current research, a user has conveyed preferences to a machine using explicit corrective or instructive feedback; explicit feedback imposes a cognitive load on the user and is expensive in terms of human effort. The primary objective of the current work is to demonstrate that a learning agent can reduce the amount of explicit feedback required for adapting to the user's preferences pertaining to a task by learning to perceive a value of its behavior from the human user, particularly from the user's facial expressions---we call this face valuing. We empirically evaluate face valuing on a grip selection task. Our preliminary results suggest that an agent can quickly adapt to a user's changing preferences with minimal explicit feedback by learning a value function that maps facial features extracted from a camera image to expected future reward. We believe that an agent learning to perceive a value from the body language of its human user is complementary to existing interactive machine learning approaches and will help in creating successful human-machine interactive applications.

Sutton, R. S., Mahmood, A. R., White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. Journal of Machine Learning Research 17(73):1-29.  Also arXiv:1503.04269.

ABSTRACT: In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD(λ)’s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD(λ), and GQ(λ). Compared to these methods, our emphatic TD(λ) is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.


Sutton, R. S. (2015). True online Emphatic TD(λ): Quick Reference and Implementation Guide. ArXiv. Code is available in Python and C++ by downloading the source files of this arXiv paper as a zip archive.

ABSTRACT: In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD(λ)’s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD(λ), and GQ(λ). Compared to these methods, our emphatic TD(λ) is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.


Edwards, A. L., Dawson, M. R., Hebert, J. S., Sherstan, C., Sutton, R. S., Chan, K. M., Pilarski, P. M. (2015). Application of real-time machine learning to myoelectric prosthesis control: A case series in adaptive switching. Prosthetics and Orthotics International, first published September 15, 2015, doi:0309364615605373.

ABSTRACT:
Background: Myoelectric prostheses currently used by amputees can be difficult to control. Machine learning, and in particular learned predictions about user intent, could help to reduce the time and cognitive load required by amputees while operating their prosthetic device.
Objectives: The goal of this study was to compare two switching-based methods of controlling a myoelectric arm: non-adaptive (or conventional) control and adaptive control (involving real-time prediction learning).
Study Design: Case series study.
Methods: We compared non-adaptive and adaptive control in two different experiments. In the first, one amputee and one non-amputee subject controlled a robotic arm to perform a simple task; in the second, three able-bodied subjects controlled a robotic arm to perform a more complex task. For both tasks, we calculated the mean time and total number of switches between robotic arm functions over three trials.
Results: Adaptive control significantly decreased the number of switches and total switching time for both tasks compared with the conventional control method.
Conclusion: Real-time prediction learning was successfully used to improve the control interface of a myoelectric robotic arm during uninterrupted use by an amputee subject and able-bodied subjects.
Clinical Relevance: Adaptive control using real-time prediction learning has the potential to help decrease both the time and the cognitive load required by amputees in real-world functional situations when using myoelectric prostheses.



van Hasselt, H., Sutton, R. S. (2015) Learning to Predict Independent of Span. ArXiv:1508.04582.

ABSTRACT: We consider how to learn multi-step predictions effciently. Conventional algorithms wait until observing actual outcomes before performing the computations to update their predictions. If predictions are made at a high rate or span over a large amount of time, substantial computation can be required to store all relevant observations and to update all predictions when the outcome is finally observed. We show that the exact same predictions can be learned in a much more computationally congenial way, with uniform per-step computation that does not depend on the span of the predictions. We apply this idea to various settings of increasing generality, repeatedly adding desired properties and each time deriving an equivalent span-independent algorithm for the conventional algorithm that satisfies these desiderata. Interestingly, along the way several known algorithmic constructs emerge spontaneously from our derivations, including dutch eligibility traces, temporal difference errors, and averaging. This allows us to link these constructs one-to-one to the corresponding desiderata, unambiguously connecting the ‘how’ to the ‘why’. Each step, we make sure that the derived algorithm subsumes the previous algorithms, thereby retaining their properties. Ultimately we arrive at a single general temporal-difference algorithm that is applicable to the full setting of reinforcement learning.


Pilarski, P. M., Sutton, R. S., Mathewson, K. W. (2015).
Prosthetic Devices as Goal-Seeking AgentsWorkshop on Present and Future of Non-invasive Peripheral Nervous System Machine Interfaces at the International Conference on Rehabilitation Robotics, Singapore.

ABSTRACT: In this article we develop the perspective that assistive devices, and specifically artificial arms and hands, may be beneficially viewed as goal-seeking agents. We further suggest that taking this perspective enables more powerful interactions between human users and next generation prosthetic devices, especially when the sensorimotor space of the prosthetic technology greatly exceeds the conventional myoelectric control and communication channels available to a prosthetic user. As a principal contribution, we propose a schema for thinking about the capacity of a human-machine collaboration as a function of both the human and machine’s degrees of agency. Using this schema, we present a brief analysis of three examples from the literature where agency or goal-seeking behaviour by a prosthesis has enabled a progression of fruitful, task-directed interactions between a prosthetic assistant and a human director. While preliminary, the agent-based viewpoint developed in this article extends current thinking on how best to support the natural, functional use of increasingly complex prosthetic enhancements.


van Seijen, H., Mahmood, A. R., Pilarski, P. M., Machado, M. C., Sutton, R. S. (2016). True online temporal-difference learning. Journal of Machine Learning Research 17(145):1-40. Also arXiv:1502.04087.

ABSTRACT: The temporal-difference methods TD(λ) and Sarsa(λ) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, new versions of these methods were introduced, called true online TD(λ) and true online Sarsa(λ), respectively (van Seijen & Sutton, 2014). Algorithmically, these true online methods only make two small changes to the update rules of the regular methods, and the extra computational cost is negligible in most cases. However, they follow the ideas underlying the forward view much more closely. In particular, they maintain an exact equivalence with the forward view at all times, whereas the traditional versions only approximate it for small step-sizes. We hypothesize that these true online methods not only have better theoretical properties, but also dominate the regular methods empirically. In this article, we put this hypothesis to the test by performing an extensive empirical comparison. Specifically, we compare the performance of true online TD(λ)/Sarsa(λ) with regular TD(λ)/Sarsa(λ) on random MRPs, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment. We use linear function approximation with tabular, binary, and non-binary features. Our results suggest that the true online methods indeed dominate the regular methods. Across all domains/representations the learning speed of the true online methods are often better, but never worse than that of the regular methods. An additional advantage is that no choice between traces has to be made for the true online methods. Besides the empirical results, we provide an in-dept analysis of the theory behind true online temporal-difference learning. In addition, we show that new true online temporal- difference methods can be derived by making changes to the online forward view and then rewriting the update equations.


van Seijen, H., Mahmood, A. R., Pilarski, P. M., Sutton, R. S. (2015). An empirical evaluation of true online TD(λ). In Proceedings of the 2015 European Workshop on Reinforcement Learning.

ABSTRACT: The true online TD(λ) algorithm has recently been proposed (van Seijen and Sutton, 2014) as a universal replacement for the popular TD(λ) algorithm, in temporal-difference learning and reinforcement learning. True online TD(λ) has better theoretical properties than conventional TD(λ), and the expectation is that it also results in faster learning. In this paper, we put this hypothesis to the test by empirically comparing true online TD(λ) with TD(λ) on a variety of domains. Our domains include a challenging one-state and two-state example, random Markov reward processes, and a real-world myoelectric prosthetic arm. We use linear function approximation with tabular, binary, and non-binary features. We assess the algorithms along three dimensions: computational cost, learning speed, and ease of use. Our results confirm the strength of true online TD(λ): 1) for sparse feature vectors, the computational overhead with respect to TD(λ) is negligible; for non-sparse features the computation time is at most twice that of TD(λ), 2) across all domains/representations the learning speed of true online TD(λ) is often better, but never worse than that of TD(λ), and 3) true online TD(λ) is easier to use, because it does not require choosing between trace types, and it is generally more stable with respect to the step-size. Overall, our results suggest that true online TD(λ) should be the first choice when looking for an efficient, general-purpose TD method.


Mahmood, A. R., Yu, H., White, M., Sutton, R. S. (2015). Emphatic temporal-difference learning. In Proceedings of the 2015 European Workshop on Reinforcement Learning.

ABSTRACT: Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps. Recent works by Sutton, Mahmood and White (2015) and Yu (2015) show that by varying the emphasis in a particular way these algorithms become stable and convergent under off-policy training with linear function approximation. This paper serves as a unified summary of the available results from both works. Additionally, we empirically demonstrate the benefits of emphatic algorithms, due to the flexible learning using state-dependent discounting, bootstrapping and a user-specified allocation of function approximation resources.


Mahmood, A. R., Sutton, R. S. (2015). Off-policy learning based on weighted importance sampling with linear computational complexity. In Proceedings of the 2015 Conference on Uncertainty in Artificial Intelligence.

ABSTRACT: Importance sampling is an essential component of model-free off-policy learning algorithms. Weighted importance sampling (WIS) is generally considered superior to ordinary importance sampling but, when combined with function approximation, it has hitherto required computational complexity that is O(n2) or more in the number of features. In this paper we introduce new off-policy learning algorithms that obtain the benefits of WIS with O(n) computational complexity. Our algorithms maintain for each component of the parameter vector a measure of the extent to which that component has been used in previous examples. This measure is used to determine component-wise step sizes, merging the ideas of stochastic gradient descent and sample averages. We present our main WIS-based algorithm first in an intuitive acausal form (the forward view) and then derive a causal algorithm using eligibility traces that is equivalent but more efficient (the backward view). In three small experiments, our algorithms performed significantly better than prior O(n) algorithms for off-policy policy evaluation. We also show that our adaptive step-size technique alone can improve the performance of on-policy algorithms such as TD(λ) and true online TD(λ).


van Seijen, H., Sutton, R. S. (2015). A deeper look at planning as learning from replay. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, France.

ABSTRACT: In reinforcement learning, the notions of experience replay, and of planning as learning from replayed experience, have long been used to find good policies with minimal training data. Replay can be seen either as model-based reinforcement learning, where the store of past experiences serves as the model, or as a way to avoid a conventional model of the environment altogether. In this paper, we look more deeply at how replay blurs the line between model-based and model-free methods. First, we show for the first time an exact equivalence between the sequence of value functions found by a model-based policy-evaluation method and by a model-free method with replay. Second, we present a general replay method that can mimic a spectrum of methods ranging from the explicitly model-free (TD(0)) to the explicitly model-based (linear Dyna). Finally, we use insights gained from these relationships to design a new model-based reinforcement learning algorithm for linear function approximation. This method, which we call forgetful LSTD(λ), improves upon regular LSTD(λ) because it extends more naturally to online control, and improves upon linear Dyna because it is a multi-step method, enabling it to perform well even in non-Markov problems or, equivalently, in problems with significant function approximation.


Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2014). Time course of the rabbit's nictitating membrane during acquisition, extinction, and reacquisition. Learning and Memory 21:585-590. Cold Spring Harbor Press.

ABSTRACT: The present experiment tested whether or not the time course of a conditioned eyeblink response, particularly its duration, would expand and contract, as the magnitude of the conditioned response (CR) changed massively during acquisition, extinction, and reacquisition. The CR duration remained largely constant throughout the experiment, while CR onset and peak time occurred slightly later during extinction. The results suggest that computational models can account for these results by using two layers of plasticity conforming to the sequence of synapses in the cerebellar pathways that mediate eyeblink conditioning.


Mahmood, A. R., van Hasselt, H, Sutton, R. S. (2014). Weighted importance sampling for off-policy learning with linear function approximation.
Advances in Neural Information Processing Systems 27, Montreal, Canada.

ABSTRACT: Importance sampling is an essential component of off-policy model-free reinforcement learning algorithms. However, its most effective variant, weighted importance sampling, does not carry over easily to function approximation and, because of this, it is not utilized in existing off-policy learning algorithms. In this paper, we take two steps toward bridging this gap. First, we show that weighted importance sampling can be viewed as a special case of weighting the error of individual training samples, and that this weighting has theoretical and empirical benefits similar to those of weighted importance sampling. Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD(λ). We show empirically that our new WIS-LSTD(λ) algorithm can result in much more rapid and reliable convergence than conventional off-policy LSTD(λ) (Yu 2010, Bertsekas & Yu 2009).


Yao, H., Szepesvari, Cs., Sutton, R., Modayil, J., Bhatnagar, S. (2014). Universal Option Models. Advances in Neural Information Processing Systems 27, Montreal, Canada.

ABSTRACT: We consider the problem of learning models of options for real-time abstract planning, in the setting where reward functions can be specified at any time and their expected returns must be efficiently computed. We introduce a new model for an option that is independent of any reward function, called the universal option model (UOM). We prove that the UOM of an option can construct a traditional option model given a reward function, and also supports efficient computation of the option-conditional return. We extend the UOM to linear function approximation, and we show it gives the TD solution of option returns and value functions of policies over options. We provide a stochastic approximation algorithm for incrementally learning UOMs from data and prove its consistency. We demonstrate our method in two domains. The first domain is a real-time strategy game, where the controller must select the best game unit to accomplish dynamically-specified tasks. The second domain is article recommendation, where each user query defines a new reward function and an article’s relevance is the expected return from following a policy that follows the citations between articles. Our experiments show that UOMs are substantially more efficient than previously known methods in evaluating option returns and policies over options.


Edwards, A. L., Dawson, M. R., Hebert, J. S., Sutton, R. S., Chan, K. M., Pilarski, P.M. (2014). Adaptive switching in practice: Improving myoelectric prosthesis performance through reinforcement learning. In: Proceedings of MEC14: Myoelectric Controls Symposium, Fredericton, New Brunswick, Canada.


van Hasselt, H., Mahmood, A. R., Sutton, R. S. (2014). Off-policy TD(λ) with a true online equivalence. In: Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, Quebec City, Canada.

ABSTRACT: Van Seijen and Sutton (2014) recently proposed a new version of the linear TD(λ) learning algorithm that is exactly equivalent to an online forward view and that empirically performed better than its classical counterpart in both prediction and control problems. However, their algorithm is restricted to on-policy learning. In the more general case of off-policy learning, in which the policy whose outcome is predicted and the policy used to generate data may be different, their algorithm cannot be applied. One reason for this is that the algorithm bootstraps and thus is subject to instability problems when function approximation is used. A second reason true online TD(λ) cannot be used for off-policy learning is that the off-policy case requires sophisticated importance sampling in its eligibility traces. To address these limitations, we generalize their equivalence result and use this generalization to construct the first online algorithm to be exactly equivalent to an off-policy forward view. We show this algorithm, named true online GTD(λ), empirically outperforms GTD(λ) (Maei, 2011) which was derived from the same objective as our forward view but lacks the exact online equivalence. In the general theorem that allows us to derive this new algorithm, we encounter a new general eligibility-trace update.


Sutton, R. S., Mahmood, A. R., Precup, D., van Hasselt, H. (2014). A new Q(λ) with interim forward view and Monte Carlo equivalence. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China.

ABSTRACT: Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. The λ parameter smoothly shifts on-policy algorithms such as TD(λ) and Sarsa(λ) from a pure bootstrapping form (λ = 0) to a pure Monte Carlo form (λ = 1). In off-policy algorithms, including Q(λ), GQ(λ), and off-policy LSTD(λ), the λ parameter is intended to play the same role, but does not; on every exploratory action these algorithms bootstrap regardless of the value of λ, and as a result they fail to approximate Monte Carlo learning when λ = 1. It may seem that this is inevitable for any online off-policy algorithm; if updates are made on each step on which the target policy is followed, then how could just the right updates be ‘un-made’ upon deviation from the target policy? In this paper, we introduce a new version of Q(λ) that does exactly that, without significantly increased algorithmic complexity. En route to our new Q(λ), we introduce a new derivation technique based on the forward-view/backward-view analysis familiar from TD(λ) but extended to apply at every time step rather than only at the end of episodes. We apply this technique to derive first a new off-policy version of TD(λ), called PTD(λ), and then our new Q(λ), called PQ(λ).


van Seijen, H., Sutton, R. S. (2014). True online TD(λ). In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China. As published. With two minor errors corrected.

ABSTRACT:TD(λ) is a core algorithm of modern reinforcement learning. Its appeal comes from its equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner. However, the equivalence between TD(λ) and the forward view is exact only for the off-line version of the algorithm (in which updates are made only at the end of each episode). In the online version of TD(λ) (in which updates are made at each step, which generally performs better and is always used in applications) the match to the forward view is only approximate. In a sense this is unavoidable for the conventional forward view, as it itself presumes that the estimates are unchanging during an episode. In this paper we introduce a new forward view that takes into account the possibility of changing estimates and a new variant of TD(λ) that exactly achieves it. Our algorithm uses a new form of eligibility trace similar to but different from conventional accumulating and replacing traces. The overall computational complexity is the same as TD(λ), even when using function approximation. In our empirical comparisons, our algorithm outperformed TD(λ) in all of its variations. It seems, by adhering more truly to the original goal of TD(λ)—matching an intuitively clear forward view even in the online case—that we have found a new algorithm that simply improves on classical TD(λ).



Modayil, J., White, A., Sutton, R. S. (2014). Multi-timescale Nexting in a Reinforcement Learning Robot. Adaptive Behavior 22(2):146-160.

ABSTRACT: The term ‘nexting’ has been used by psychologists to refer to the propensity of people and many other animals to continually predict what will happen next in an immediate, local, and personal sense. The ability to ‘next’ constitutes a basic kind of awareness and knowledge of one’s environment. In this paper we present results with a robot that learns to next in real time, making thousands of predictions about sensory input signals at timescales from 0.1 to 8 seconds. Our predictions are formulated as a generalization of the value functions commonly used in reinforcement learning, where now an arbitrary function of the sensory input signals is used as a pseudo reward, and the discount rate determines the timescale. We show that six thousand predictions, each computed as a function of six thousand features of the state, can be learned and updated online ten times per second on a laptop computer, using the standard TD(λ) algorithm with linear function approximation. This approach is sufficiently computationally efficient to be used for real-time learning on the robot and sufficiently data efficient to achieve substantial accuracy within 30 minutes. Moreover, a single tile-coded feature representation suffices to accurately predict many different signals over a significant range of timescales. We also extend nexting beyond simple timescales by letting the discount rate be a function of the state and show that nexting predictions of this more general form can also be learned with substantial accuracy. General nexting provides a simple yet powerful mechanism for a robot to acquire predictive knowledge of the dynamics of its environment.


Modayil, J., Sutton, R. S. (2014). Prediction Driven Behavior: Learning Predictions that Drive Fixed Responses. The AAAI-14 Workshop on Artificial Intelligence and Robotics, Quebec City, Quebec, Canada.

ABSTRACT: We introduce a new method for robot control that combines prediction learning with a fixed, crafted response—the robot learns to make a temporally-extended prediction during its normal operation, and the prediction is used to select actions as part of a fixed behavioral response. Our method is inspired by Pavlovian conditioning experiments in which an animal’s behavior adapts as it learns to predict an event. Surprisingly the animal’s behavior changes even in the absence of any benefit to the animal (i.e. the animal is not modifying its behavior to maximize reward). Our method for robot control combines a fixed response with online prediction learning, thereby producing an adaptive behavior. This method is different from standard non-adaptive control methods and also from adaptive reward-maximizing control methods. We show that this method improves upon the performance of two reactive controls, with visible benefits within 2.5 minutes of real-time learning on the robot. In the first experiment, the robot turns off its motors when it predicts a future over-current condition, which reduces the time spent in unsafe over-current conditions and improves efficiency. In the second experiment, the robot starts to move when it predicts a human-issued request, which reduces the apparent latency of the human-robot interface.


White, A., Modayil, J., Sutton, R. S. (2014). Surprise and curiosity for big data robotics. In Workshop on Sequential Decision-Making with Big Data at the Twenty-Eighth AAAI Conference on Artificial Intelligence.


ABSTRACT:This paper introduces a new perspective on curiosity and intrinsic motivation, viewed as the problem of generating behavior data for parallel off-policy learning. We provide 1) the first measure of surprise based on off-policy general value function learning progress, 2) the first investigation of reactive behavior control with parallel gradient temporal difference learning and function approximation, and 3) the first demonstration of using curiosity driven control to react to a non-stationary learning task—all on a mobile robot. Our approach improves scalability over previous off-policy, robot learning systems, essential for making progress on the ultimate big-data decision making problem—life-long robot learning


Edwards, A. L., Kearney, A., Dawson, M. R., Sutton, R. S., Pilarski, P. M
. (2013). Temporal-difference learning to assist human decision making during the control of an artificial limb. In: Proceedings of the First Interdisciplinary Conference on Reinforcement Learning and Decision Making, and http://arxiv.org/abs/1309.4714.

ABSTRACT: In this work we explore the use of reinforcement learning (RL) to help with human decision making, combining state-of-the-art RL algorithms with an application to prosthetics. Managing human-machine interaction is a problem of considerable scope, and the simplification of human-robot interfaces is especially important in the domains of biomedical technology and rehabilitation medicine. For example, amputees who control artificial limbs are often required to quickly switch between a number of control actions or modes of operation in order to operate their devices. We suggest that by learning to anticipate (predict) a user's behaviour, artificial limbs could take on an active role in a human's control decisions so as to reduce the burden on their users. Recently, we showed that RL in the form of general value functions (GVFs) could be used to accurately detect a user's control intent prior to their explicit control choices. In the present work, we explore the use of temporal-difference learning and GVFs to predict when users will switch their control influence between the different motor functions of a robot arm. Experiments were performed using a multi-function robot arm that was controlled by muscle signals from a user's body (similar to conventional artificial limb control). Our approach was able to acquire and maintain forecasts about a user's switching decisions in real time. It also provides an intuitive and reward-free way for users to correct or reinforce the decisions made by the machine learning system. We expect that when a system is certain enough about its predictions, it can begin to take over switching decisions from the user to streamline control and potentially decrease the time and effort needed to complete tasks. This preliminary study therefore suggests a way to naturally integrate human- and machine-based decision making systems.


Silver, D., Sutton, R. S., Mueller, M. (2013). Temporal-difference search in computer go. In: Proceedings of the ICAPS-13 Workshop on Planning and Learning.

ABSTRACT: Temporal-difference (TD) learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. Monte-Carlo tree search is a recent algorithm for simulation-based search, which has been used to achieve master-level play in Go. We have introduced a new approach to high-performance planning (Silver et al., 2012). Our method, TD search, combines TD learning with simulation-based search. Like Monte-Carlo tree search, value estimates are updated by learning online from simulated experience. Like TD learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We applied TD search to the game of 9 × 9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed a vanilla Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9 × 9 Computer Go Server.


Mahmood, A. R., Sutton, R. S. (2013). Representation Search through Generate and Test. In: Proceedings of the AAAI Workshop on Learning Rich Representations from Low-Level Sensors, Bellevue, Washington, USA.

ABSTRACT: Learning representations from data is one of the fundamental problems of artificial intelligence and machine learning. Many different approaches exist for learning representations, but what constitutes a good representation is not yet well understood. Much less is known about how to learn representations from an unending stream of data. In this work, we view the problem of representation learning as one of learning features (e.g., hidden units of neural networks) such that performance of the underlying base system continually improves. We study an important case where learning is done fully online (i.e, on an example-by-example basis). In the presence of an unending stream of data, the computational cost of the learning element should not grow with time and cannot be much more than that of the performance element. Few methods can be used effectively in this case. We show that a search approach to representation learning can naturally fit into this setting. In this approach good representations are searched by generating different features and then testing them for utility. We show that a fully online method can be developed, which utilizes this generate and test approach, learns representations continually, and adds only a small fraction to the overall computation. We believe online representation search constitutes an important step to- ward effective and inexpensive solutions to representation learning problems.


van Seijen, H., Sutton, R. S. (2013). Efficient planning in MDPs by small backups. In: Proceedings of the 30th International Conference on Machine Learning, pp. 361-369, Atlanta, USA.

ABSTRACT: Efficient planning plays a crucial role in model-based reinforcement learning. This efficiency is largely determined by the order in which backups are performed. A particularly effective ordering strategy is the strategy employed by prioritized sweeping. Prioritized sweeping orders backups according to a heuristic, such that backups that cause a large value change are selected first. The Bellman error predicts the value change caused by a full backup exactly, but its computation is expensive. Hence, methods do not use the Bellman error as a heuristic, but try to approximate the ordering induced by it with a heuristic that is cheaper to compute. We introduce the first efficient prioritized sweeping method that uses exact Bellman error ordering. The core of this method is a novel backup that allows for efficient computation of the Bellman error, while its effect as well as its computational complexity is equal to that of a full backup. We demonstrate empirically that the obtained method achieves substantially higher convergence rates than other prioritized sweeping implementations.


Pilarski, P. M., Dick, T. B., Sutton, R. S. (2013). Real-time prediction learning for the simultaneous actuation of multiple prosthetic joints. In: Proceedings of the 2013 IEEE International Conference on Rehabilitation Robotics, Seattle, USA, June 24–26. 8 pages.

ABSTRACT: Integrating learned predictions into a prosthetic control system promises to enhance multi-joint prosthesis use by amputees. In this article, we present a preliminary study of different cases where it may be beneficial to use a set of temporally extended predictions—learned and maintained in real time—within an engineered or learned prosthesis controller. Our study demonstrates the first successful combination of actor-critic reinforcement learning with real-time prediction learning. We evaluate this new approach to control learning during the myoelectric operation of a robot limb. Our results suggest that the integration of real-time prediction and control learning may speed control policy acquisition, allow unsupervised adaptation in myoelectric controllers, and facilitate synergies in highly actuated limbs. These experiments also show that temporally extended prediction learning enables anticipatory actuation, opening the way for coordinated motion in assistive robotic devices. Our work therefore provides initial evidence that real-time prediction learning is a practical way to support intuitive joint control in increasingly complex prosthetic systems.


Pilarski, P. M., Dawson, M. R., Degris, T., Chan, K. M., Hebert, J. S., Carey, J. P., Sutton, R. S. (2013). Adaptive Artificial Limbs: A Real-time Approach to Prediction and Anticipation, IEEE Robotics & Automation Magazine 20(1):53-64.

ABSTRACT: Predicting the future has long been regarded as a powerful means to improvement and success. The ability to make accurate and timely predictions enhances our ability to control our situation and our environment. Assistive robotics is one prominent area where foresight of this kind can bring improved quality of life. In this article, we present a new approach to acquiring and maintaining predictive knowledge during the online, ongoing operation of an assistive robot. The ability to learn accurate, temporally abstracted predictions is shown through two case studies—able-bodied subjects engaging in the myoelectric control of a robot arm, and an amputee participant’s interactions with a myoelectric training robot. To our knowledge, this work is the first demonstration of a practical method for real-time prediction learning during myoelectric control. Our approach therefore represents a fundamental tool for addressing one major unsolved problem: amputee-specific adaptation during the ongoing operation of a prosthetic device. The findings in this article also contribute a first explicit look at prediction learning in prosthetics as an important goal in its own right, independent of its intended use within a specific controller or system. Our results suggest that real-time learning of predictions and anticipations is a significant step towards more intuitive myoelectric prostheses and other assistive robotic devices.


Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2013). Timing and cue competition in conditioning of the nictitating membrane response of the rabbit (Oryctolagus cuniculus), Learning and Memory 20:97-102. Cold Spring Harbor Press.

ABSTRACT:Rabbits were classically conditioned using compounds of tone and light conditioned stimuli (CSs) presented with either simultaneous onsets (Experiment 1) or serial onsets (Experiment 2) in a delay conditioning paradigm. Training with the simultaneous compound reduced the likelihood of a conditioned response (CR) to the individual CSs (“mutual overshadowing”) but left CR timing unaltered. CR peaks were consistently clustered around the time of unconditioned stimulus (US) delivery. Training with the serial compound (CSA->CSB->US) reduced responding to CSB (“temporal primacy/information effect”) but this effect was prevented by prior CSB->US pairings. In both cases, serial compound training altered CR timing. On CSA->CSB test trials, the CRs were accelerated; the CR peaks occurred after CSB onset but well before the time of US delivery. Conversely, CRs on CSB– trials were decelerated; the distribution of CR peaks was variable but centered well after the US. Timing on CSB– trials was at most only slightly accelerated. The results are discussed with respect to processes of generalization and spectral timing applicable to the cerebellar and forebrain pathways in eyeblink preparations.


White, A., Modayil, J., Sutton, R. S. (2012). Scaling Life-long Off-policy Learning. In Proceedings of the Second Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-Epirob), San Diego, USA.

ABSTRACT:In this paper, we pursue an approach to scaling life-long learning using parallel, off-policy reinforcement-learning algorithms. In life-long learning, a robot continually learns from a life-time of experience, slowly acquiring and applying skills and knowledge to new situations. Many of the benefits of life-long learning are a result of scaling the amount of training data, processed by the robot, to long sensorimotor streams. Another dimension of scaling can be added by allowing off-policy sampling from the unending stream of sensorimotor data generated by a long-lived robot. Recent algorithmic developments have made it possible to apply off-policy algorithms to life-long learning, in a sound way, for the first time. We assess the scalability of these off-policy algorithms on a physical robot. We show that hundreds of accurate multi-step predictions can be learned about several policies in parallel and in realtime. We present the first online measures of off-policy learning progress. Finally we demonstrate that our robot, using the new off-policy measures, can learn 8000 predictions about 300 distinct policies, a substantial increase in scale compared to previous simulated and robotic life-long learning systems.


Modayil, J., White, A., Pilarski, P. M., Sutton, R. S. (2012). Acquiring a Broad Range of Empirical Knowledge in Real Time by Temporal-Difference Learning. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Seoul, Korea.

ABSTRACT: Several robot capabilities rely on predictions about the temporally extended consequences of a robot’s behaviour. We describe how a robot can both learn and make many such predictions in real time using a standard algorithm. Our experiments show that a mobile robot can learn and make thousands of accurate predictions at 10 Hz. The predictions are about the future of all of the robot’s sensors and many internal state variables at multiple time-scales. All the predictions share a single set of features and learning parameters. We demonstrate the generality of this method with an application to a different platform, a robot arm operating at 50 Hz. Here, learned predictions can be used to measurably improve the user interface. The temporally extended predictions learned in real time by this method constitute a basic form of knowledge about the dynamics of the robot’s interaction with the environment. We also show how this method can be extended to express more general forms of knowledge.


Modayil, J., White, A., Pilarski, P. M., Sutton, R. S. (2012). Acquiring Diverse Predictive Knowledge in Real Time by Temporal-difference Learning. International Workshop on Evolutionary and Reinforcement Learning for Autonomous Robot Systems, Montpellier, France.

ABSTRACT: Existing robot algorithms demonstrate several capabilities that are enabled by a robot’s knowledge of the temporally- extended consequences of its behaviour. This knowledge consists of real-time predictions—predictions that are conventionally computed by iterating a small one-timestep model of the robot’s dynamics. Given the utility of such predictions, alternatives are desirable when this conventional approach is not applicable, for example when an adequate model of the one-timestep dynamics is either not available or not computationally tractable. We describe how a robot can both learn and make many such predictions in real-time using a standard reinforcement learning algorithm. Our experiments show that a mobile robot can learn and make thousands of accurate predictions at 10 Hz about the future of all of its sensors and many internal state vari- ables at multiple time-scales. The method uses a single set of features and learning parameters that are shared across all the predictions. We demonstrate the generality of these predictions with an application to a different platform, a robot arm operating at 50 Hz. Here, the predictions are about which arm joint the user wants to move next, a difficult situation to model analytically, and we show how the learned predictions enable measurable improvements to the user interface. The predictions learned in real-time by this method constitute a basic form of knowledge about the robot’s interaction with the environ- ment, and extensions of this method can express more general forms of knowledge.


Ludvig, E. A., Sutton, R. S., Kehoe, E. J. (2012). Evaluating the TD model of classical conditioning. Learning and Behavior 40(3):305-319. Springer.

ABSTRACT: The temporal-difference (TD) algorithm from reinforcement learning provides a simple method for incrementally learning predictions of upcoming events. Applied to classical conditioning, TD models suppose that animals learn a real-time prediction of the unconditioned stimulus (US) on the basis of all available conditioned stimuli (CSs). In the TD model, similar to other error-correction models, learning is driven by prediction errors—the difference between the change in US prediction and the actual US. With the TD model, however, learning occurs continuously from moment to moment and is not artificially constrained to occur in trials. Accordingly, a key feature of any TD model is the assumption about the representation of a CS on a moment-to-moment basis. Here, we evaluate the performance of the TD model with a heretofore unexplored range of classical conditioning tasks. To do so, we consider three stimulus representations that vary in their degree of temporal generalization and evaluate how the representation influences the performance of the TD model on these conditioning tasks.


Pilarski, P. M., Sutton, R. S. (2012)  Between instruction and reward: Human-prompted switching. Proceedings of the AAAI Fall Symposium on Robots Learning Interactively from Human Teachers. AAAI Technical Report FS-12-07, pp. 46–52.

ABSTRACT: Intelligent systems promise to amplify, augment, and extend innate human abilities. A principal example is that of assistive rehabilitation robots—artificial intelligence and machine learning enable new electromechanical systems that restore biological functions lost through injury or illness. In order for an intelligent machine to assist a human user, it must be possible for a human to communicate their intentions and preferences to their non-human counterpart. While there are a number of techniques that a human can use to direct a machine learning system, most research to date has focused on the contrasting strategies of instruction and reward. The primary contribution of our work is to demonstrate that the middle ground between instruction and reward is a fertile space for research and immediate technological progress. To support this idea, we introduce the setting of human-prompted switching, and illustrate the successful combination of switching with interactive learning using a concrete real-world example: human control of a multi-joint robot arm. We believe techniques that fall between the domains of instruction and reward are complementary to existing approaches, and will open up new lines of rapid progress for interactive human training of machine learning systems.


Pilarski, P. M., Dawson, M. R., Degris, T., Carey, J. P., Chan, K. M., Hebert, J. S., Sutton, R. S. (2012). Towards Prediction-Based Prosthetic Control. In Proceedings of the 2012 Conference of the International Functional Electrical Stimulation Society, September 9-12, Banff, Canada, 4 pages.

ABSTRACT: Predictions and anticipations form a basis for pattern-recognition-based control systems, including those used in next-generation prostheses and assistive devices. In this work, we outline how new methods in real-time prediction learning provide an approach to one of the principal open problems in multi-function myoelectric control—robust, ongoing, amputee-specific adaptation. Techniques from reinforcement learning and general value functions are applied to learn and update predictions during continuing human interactions with multi-function robotic appendages. Tests were performed with a targeted motor and sensory reinnervation amputee subject and non-amputee participants. Results show that this online prediction learning approach is able to accurately anticipate a diverse set of human and robot signals by more than two seconds, and at different levels of temporal abstraction. These methods allow predictions to be adapted during real-time use, which is an important step toward the intuitive control of powered prostheses and other assistive devices.


Degris, T., White, M., Sutton, R. S. (2012). Off-policy Actor-Critic. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Great Britain. Demo.

ABSTRACT: This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.


Pilarski, P. M., Dawson, M. R., Degris, T., Carey, J. P., Sutton, R. S. (2012). Dynamic Switching and Real-time Machine Learning for Improved Human Control of Assistive Biomedical Robots. In Proceedings of the Fourth IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob), June 24-27, Roma, Italy, 296-302.

ABSTRACT: A general problem for human-machine interaction occurs when a machine’s controllable dimensions outnumber the control channels available to its human user. In this work, we examine one prominent example of this problem: amputee switching between the multiple functions of a powered artificial limb. We propose a dynamic switching approach that learns during ongoing interaction to anticipate user behaviour, thereby presenting the most effective control option for a given context or task. Switching predictions are learned in real time using temporal difference methods and reinforcement learning, and demonstrated within the context of a robotic arm and a multi-function myoelectric controller. We find that a learned, dynamic switching order is able to out-perform the best fixed (non-adaptive) switching regime on a standard prosthetic proficiency task, increasing the number of optimal switching suggestions by 23%, and decreasing the expected transition time between degrees of freedom by more than 14%. These preliminary results indicate that real-time machine learning, specifically online prediction and anticipation, may be an important tool for developing more robust and intuitive controllers for assistive biomedical robots. We expect these techniques will transfer well to near-term use by patients. Future work will describe clinical testing of this approach with a population of amputee patients.


Sutton, R. S. (2012). Beyond reward: The problem of knowledge and data (extended abstract). In: Proceedings of the 21st International Conference on Inductive Logic Programming, S.H. Muggleton, A. Tamaddoni-Nezhad, F.A. Lisi (Eds.): ILP 2011, LNAI 7207, pp. 2--6. Springer, Heidelberg.


Degris, T.,
Modayil, J. (2012). Scaling-up Knowledge for a Cognizant Robot.” In notes of the AAAI Spring Symposium on Designing Intelligent Robots: Reintegrating AI.

ABSTRACT: This paper takes a new approach to the old adage that knowledge is the key for artificial intelligence. A cognizant robot is a robot with a deep and immediately accessible understanding of its interaction with the environment—an understanding the robot can use to flexibly adapt to novel situations. Such a robot will need a vast amount of situated, revisable, and expressive knowledge to display flexible intelligent behaviors. Instead of relying on human-provided knowledge, we propose that an arbitrary robot can autonomously acquire pertinent knowledge directly from everyday interaction with the environment. We show how existing ideas in reinforcement learning can enable a robot to maintain and improve its knowledge. The robot performs a continual learning process that scales-up knowledge acquisition to cover a large number of facts, skills and predictions. This knowledge has semantics that are grounded in sensorimotor experience. We see the approach of developing more cognizant robots as a necessary key step towards broadly competent robots.


Degris, T., Pilarski, P. M., Sutton, R. S. (2012). Model-Free Reinforcement Learning with Continuous Action in Practice. In Proceedings of the 2012 American Control Conference.

ABSTRACT: Reinforcement learning methods are often considered as a potential solution to enable a robot to adapt to changes in real time to an unpredictable environment. However, with continuous action, only a few existing algorithms are practical for real-time learning. In such a setting, most effective methods have used a parameterized policy structure, often with a separate parameterized value function. The goal of this paper is to assess such actor-critic methods to form a fully specified practical algorithm. Our specific contributions include 1) developing the extension of existing incremental policy-gradient algorithms to use eligibility traces, 2) an empirical comparison of the resulting algorithms using continuous actions, 3) the evaluation of a gradient-scaling technique that can significantly improve performance. Finally, we apply our actor--critic algorithm to learn on a robotic platform with a fast sensorimotor cycle (10ms). Overall, these results constitute an important step towards practical real-time learning control with continuous action.



Mahmood, A. R., Sutton, R. S., Degris, T., Pilarski, P. M. (2012). Tuning-free step-size adaptation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Kyoto, Japan.


ABSTRACT: Incremental learning algorithms based on gradient descent are effective and popular in online supervised learning, reinforcement learning, signal processing, and many other application areas. An oft-noted drawback of these algorithms is that they include a step-size parameter that needs to be tuned for best performance, which might require manual intervention and significant domain knowledge or additional data. In many cases, an entire vector of step-size parameters (e.g., one for each input feature) needs to be tuned in order to attain the best performance of the algorithm. To address this, several methods have been proposed for adapting step sizes online. For example, Sutton’s IDBD method can find the best vector step size for the LMS algorithm, and Schraudolph’s ELK1 method, an extension of IDBD to neural networks, has proven effective on large applications, such as 3D hand tracking. However, to date all such step-size adaptation methods have included a tunable step-size parameter of their own, which we call the meta-step-size parameter. In this paper we show that the performance of existing step-size adaptation methods are strongly dependent on the choice of their meta-step-size parameter and that their meta-step-size parameter cannot be set reliably in a problem-independent way. We introduce a series of modifications and normalizations to the IDBD method that together eliminate the need to tune the meta-step-size parameter to the particular problem. We show that the resulting overall algorithm, called Autostep, performs as well or better than the existing step-size adaptation methods on a number of idealized and robot prediction problems and does not require any tuning of its meta-step-size parameter. The ideas behind Autostep are not restricted to the IDBD method and the same principles are potentially applicable to other incremental learning settings, such as reinforcement learning.



Silver, D., Sutton, R. S., Müller, M. (2012). Temporal-difference search in computer Go, Machine Learning 87(2):183-219. Pdf copy for personal research purposes only.
ABSTRACT: Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9 × 9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9 × 9 Computer Go Server.
 
Modayil, J., White, A., Sutton, R. S. (2012). Multi-timescale Nexting in a Reinforcement Learning Robot. Presented at the 2012 International Conference on Adaptive Behaviour, Odense, Denmark. To appear in: SAB 12, LNAI 7426, pp. 299-309, T. Ziemke, C. Balkenius, and J. Hallam, Eds., Springer Heidelberg. Also ArXiv preprint 1112.1133.
ABSTRACT: Maintaining accurate world knowledge in a complex and changing environment is a perennial problem for robots and other artificial intelligence systems.  Our architecture for addressing this problem, called Horde, consists of a large number of independent reinforcement learning sub-agents, or demons. Each demon is responsible for answering a single predictive or goal-oriented question about the world, thereby contributing in a factored, modular way to the system's overall knowledge. The questions are in the form of a value function, but each demon has its own policy, reward function, termination function, and terminal-reward function unrelated to those of the base problem. Learning proceeds in parallel by all demons simultaneously so as to extract the maximal training information from whatever actions are taken by the system as a whole. Gradient-based temporal-difference learning methods are used to learn efficiently and reliably with function approximation in this off-policy setting. Horde runs in constant time and memory per time step, and is thus suitable for learning online in real-time applications such as robotics. We present results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience. Horde is a significant incremental step towards a real-time architecture for efficient learning of general knowledge from unsupervised sensorimotor interaction.
 
Pilarski, P. M., Dawson, M. R., Degris, T.,Fahimi, F., Carey, J. P., Sutton, R. S. (2011). Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning, Proceedings of the 2011 IEEE International Conference on Rehabilitation Robotics (ICORR), Zurich, Switzerland, pp. 134-140.

ABSTRACT: As a contribution toward the goal of adaptable, intelligent artificial limbs, this work introduces a continuous actor-critic reinforcement learning method for optimizing the control of multi-function myoelectric devices. Using a simulated upper-arm robotic prosthesis, we demonstrate how it is possible to derive successful limb controllers from myoelectric data using only a sparse human-delivered training signal, without requiring detailed knowledge about the task domain. This reinforcement-based machine learning framework is well suited for use by both patients and clinical staff, and may be easily adapted to different application domains and the needs of individual amputees. To our knowledge, this is the first myoelectric control approach that facilitates the online learning of new amputee-specific motions based only on a one-dimensional (scalar) feedback signal provided by the user of the prosthesis.


Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the Tenth International Conference on Autonomous Agents and Multiagent Systems, Taipei, Taiwan.
ABSTRACT: Maintaining accurate world knowledge in a complex and changing environment is a perennial problem for robots and other artificial intelligence systems.  Our architecture for addressing this problem, called Horde, consists of a large number of independent reinforcement learning sub-agents, or demons. Each demon is responsible for answering a single predictive or goal-oriented question about the world, thereby contributing in a factored, modular way to the system's overall knowledge. The questions are in the form of a value function, but each demon has its own policy, reward function, termination function, and terminal-reward function unrelated to those of the base problem. Learning proceeds in parallel by all demons simultaneously so as to extract the maximal training information from whatever actions are taken by the system as a whole. Gradient-based temporal-difference learning methods are used to learn efficiently and reliably with function approximation in this off-policy setting. Horde runs in constant time and memory per time step, and is thus suitable for learning online in real-time applications such as robotics. We present results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience. Horde is a significant incremental step towards a real-time architecture for efficient learning of general knowledge from unsupervised sensorimotor interaction.
 
Modayil, J., Pilarski, P. M., White, A., Degris, T., Sutton, R. S. (2010).
Off-policy knowledge maintenance for robots, Robotics Science and Systems Workshop (Towards Closing the Loop: Active Learning for Robotics). Extended abstract and poster.

ABSTRACT: A fundamental difficulty in robotics arises from changes in the experienced environment—periods when the robot’s current situation differs from past experience. We present an architecture whereby many independent reinforcement learn- ing agents (or demons) observe the behaviour of a single robot. Each demon learns one piece of world knowledge represented with a generalized value function. This architecture allows the demons to update their knowledge online and off-policy from the robot’s behaviour. We present one approach to active exploration using curiosity—an internal measure of learning progress—and conclude with a preliminary result showing how a robot can adapt its prediction of the time needed to come to a full stop.

Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2010). Timing in trace conditioning of the nictitating membrane response of the rabbit (Oryctolagus cuniculus): Scalar, nonscalar, and adaptive features. Learning and Memory 17:600-604. Cold Spring Harbor Press.

Using interstimulus intervals (ISIs) of 125, 250, and 500 msec in trace conditioning of the rabbit nictitating membrane response, the offset times and durations of conditioned responses (CRs) were collected along with onset and peak latencies. All measures were proportional to the ISI, but only onset and peak latencies conformed to the criterion for scalar timing. Regarding the CR’s possible protective overlap of the unconditioned stimulus (US), CR duration increased with ISI, while the peak’s alignment with the US declined. Implications for models of timing and CR adaptiveness are discussed.


Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel.
ABSTRACT: We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, Greedy-GQ, is an extension of recent work on gradient temporal-difference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear function approximation.

Maei, H. R., Sutton, R. S. (2010). GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial General Intelligence, Lugano, Switzerland. GQ(λ) quick reference guide. GQ(λ) reference implementation in java, C++, and C with header file.

ABSTRACT: A new family of gradient temporal-difference learning algorithms have recently been introduced by Sutton, Maei and others in which function approximation is much more straightforward. In this paper, we introduce the GQ(λ) algorithm which can be seen as extension of that work to a more general setting including eligibility traces and off-policy learning of temporally abstract predictions. These extensions bring us closer to the ultimate goal of this work—the development of a universal prediction learning algorithm suitable for learning experientially grounded knowledge of the world. Eligibility traces are essential to this goal because they bridge the temporal gaps in cause and effect when experience is processed at a temporally fine resolution. Temporally abstract predictions are also essential as the means for representing abstract, higher-level knowledge about courses of action, or options. GQ(λ) can be thought of as an extension of Q-learning. We extend existing convergence results for policy evaluation to this setting and carry out a forward-view/backward-view analysis to derive and prove the validity of the new algorithm.


Kehoe, E. J., Ludvig, E. A., Sutton, R. S. (2009). Magnitude and timing of conditioned responses in delay and trace classical conditioning of the nictitating membrane respons of the rabbit (oryctolagus cuniculus), Behavioral Neuroscience 123(5):1095-1101.

ABSTRACT: The present experiment characterized conditioned nictitating membrane (NM) movements as a function of CS duration, using the full range of discernible movements (>.06 mm) rather than movements exceeding a conventional criterion (>.50 mm). The CS–US interval was fixed at 500 ms, while across groups, the duration of the CS was 50 ms (trace), 550 ms (delay), or 1050 ms (extended delay). The delay group showed the highest level of acquisition. When tested with the different CS durations, the delay and extended delay groups showed large reductions in their responses when their CS was shortened to 50 ms, but the trace group maintained its response at all durations. Timing of the conditioned movements appeared similar across all manipulations. The results suggest that the CS has both a fine timing function tied to CS onset and a general predictive function tied to CS duration, both of which may be mediated by cerebellar pathways.


Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S. (2010). Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation. In Advances in Neural Information Processing Systems 22, Vancouver, BC, pp. 1204-1212. December 2009. MIT Press.

ABSTRACT: We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness.


Interview with Richard S. Sutton, by Verena Heidrich-Meisner of Kuenstliche Intelligenz. (2009). pp. 41-43.


Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., Lee, M. (2009). Natural Actor-Critic Algorithms. Automatica. Unfortunately, due to a disagreement with the publisher, and to the mistake of using Bibtex, all the citations in this paper are in alphabetical order rather than the order in which credit is due. We regret this. Conference version. TR version with experiments.

ABSTRACT: We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradient in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.


Sutton, R. S. (2009). The grand challenge of predictive empirical abstract knowledge. Working Notes of the IJCAI-09 Workshop on Grand Challenges for Reasoning from Experiences.

ABSTRACT: We survey ongoing work at the University of Alberta on an experience-oriented approach to artificial intelligence (AI) based an reinforcement learning ideas. We seek to ground world knowledge in a minimal ontology of just the signals passing back and forth between the AI agent and its environment at a fine temporal scale. The challenge is to connect these low-level signals to higher-level representations in such a way that the knowledge remains grounded and autonomously verifiable.  The mathematical ideas of temporally abstract options, option models, and temporal-difference networks can be applied to begin to address this challenge. This has been illustrated in several simple computational worlds, and we seek now to extend the approach to a physically realized robot. A recent theoretical development is the extension of simple temporal-difference methods to off-policy forms using function approximation; this should enable, for the first time, efficient and reliable intra-option learning, bringing the goal of predictive empirical abstract knowledge closer to achievement.


Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada. Presentation at ICML-09.
Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility.  In this paper we introduce two new related algorithms with better convergence rates.  The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.

Kehoe, E. J., Olsen, K. N., Ludvig, E. A., Sutton, R. S. (2009). Scalar timing varies with response magnitude in classical conditioning of the rabbit nictitating membrane response (Oryctolagus cuniculus). Behavioral Neuroscience 123, 212–217.
The present experiment was aimed at characterizing the timing of conditioned nictitating membrane (NM) movements as function of the interstimulus interval (ISI) in delay conditioning for rabbits (Oryctolagus cuniculus). Onset latency and peak latency were approximately, but not strictly, scalar for all but the smallest movements (<.10 mm). That is, both the mean and standard deviation of the timing measures increased in proportion to the ISI, but their coefficients of variation (standard deviation/mean) tended to be larger for shorter ISIs. For all ISIs, the absolute timing of the NM movements covaried with magnitude. The smaller movements (approximately, .11-.50 mm) were highly variable, and their peaks tended to occur well after the time of US delivery. The larger movements (>.50 mm) were less variable, and their peaks were better aligned with the time of US delivery. These results are discussed with respect to their implications for current models of timing in eyeblink conditioning.

Sutton, R. S., Szepesvari, Cs., Maei, H. R. (2009). A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Advances in Neural Information Processing Systems 21, Vancouver, BC. December 2008. MIT Press.
We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters.  We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience.  The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm.  We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD's quadratic computational complexity.  GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.

Ludvig, E.,  Sutton, R. S., Verbeek, E., Kehoe, E. J. (2009). A computational model of hippocampal function in trace conditioning. In Advances in Neural Information Processing Systems 21, pp. 993-1000, Vancouver, BC. December 2008. MIT Press.
We present a new reinforcement-learning model for the role of the hippocampus in classical conditioning, focusing on the differences between trace and delay conditioning. In the model, all stimuli are represented both as unindividuated wholes and as a series of temporal elements with varying delays. These two stimulus representations interact, producing different patterns of learning in trace and delay conditioning. The model proposes that hippocampal lesions eliminate long-latency temporal elements, but preserve short-latency temporal elements. For trace conditioning, with no contiguity between stimulus and reward, these long-latency temporal elements are vital to learning adaptively timed responses. For delay conditioning, in contrast, the continued presence of the stimulus supports conditioned responding, and the short-latency elements suppress responding early in the stimulus. In accord with the empirical data, simulated hippocampal damage impairs trace conditioning, but not delay conditioning, at medium-length intervals. With longer intervals, learning is impaired in both procedures, and, with shorter intervals, in neither. In addition, the model makes novel predictions about the response topography with extended stimuli or post-training lesions. These results demonstrate how temporal contiguity, as in delay conditioning, changes the timing problem faced by animals, rendering it both easier and less susceptible to disruption by hippocampal lesions.

Cutumisu, M., Szafron, D., Bowling, M., Sutton R. S. (2008). Agent learning using action-dependent learning rates in computer role-playing games. Proceedings of the 4th Artificial Intelligence and Interactive Digital Entertainment Conference, pp. 22-29.

We introduce the ALeRT (Action-dependent Learning Rates with Trends) algorithm that makes two modifications to the learning rate and one change to the exploration rate of traditional reinforcement learning techniques. Our learning rates are action-dependent and increase or decrease based on trends in reward sequences. Our exploration rate decreases when the agent is learning successfully and increases otherwise. These improvements result in faster learning. We implemented this algorithm in NWScript, a scripting language used by BioWare Corp.’s Neverwinter Nights game, with the goal of improving the behaviours of game agents so that they react more intelligently to game events. Our goal is to provide an agent with the ability to (1) discover favourable policies in a multi-agent computer role-playing game situation and (2) adapt to sudden changes in the environment.


Silver, D., Sutton, R. S., and Mueller, M. (2008)
. Sample-based learning and search with permanent and transient memories. In Proceedings of the 25th International Conference on Machine Learning, pp. 968-975, Helsinki, Finland. July 2008. Omnipress.
We study a reinforcement learning architecture that encompasses both learning and planning, and apply it to high performance Computer Go. In this domain, the most successful planning methods are based on sample-based search algorithms, such as UCT, in which states are individuated, and the most successful learning methods are based on temporal-difference learning algorithms, such as Sarsa, in which linear function approximation is used. In both cases, an estimate of the value function is formed, but in the first case it is transient, computed and then discarded after each move, whereas in the second case it is more permanent, slowly accumulating over many moves and games. The idea of our architecture, Dyna-2, is for the transient planning memory and the permanent learning memory to remain separate, but for both to be based on linear function approximation and both to be updated by Sarsa. This architecture subsumes Sarsa and UCT as special cases. We apply Dyna-2 to 9x9 Computer Go, using a million binary features in the function approximator, based on templates matching small fragments of the board. We test our approach against GnuGo, a standard benchmark opponent, and on the Computer Go Online Server (CGOS). We show that the combination of permanent and transient memories performs better than either memory alone, and significantly outperforms a traditional approach using alpha-beta search. Our program based on Dyna-2 achieves a higher rating on CGOS than any other handcrafted or traditional search based program. This is the first exploration of the Dyna concept in a large-scale, high-performance application.

Sutton, R. S., Szepesvari, Cs., Geramifard, A., Bowling, M., (2008). Dyna-style planning with linear function approximation and prioritized sweeping, Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, pp. 528-536.
We consider the problem of efficiently learning optimal control policies and value functions over large state spaces in an online setting in which estimates must be available after each interaction with the world. This paper develops an explicitly model-based approach extending the Dyna architecture to linear function approximation. Dyna-style planning proceeds by generating imaginary experience from the world model and then applying model-free reinforcement learning algorithms to the imagined state transitions.  Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions.  In the policy evaluation setting, we prove that the limit point is the  least-squares (LSTD) solution.  An implication of our results is that prioritized-sweeping can be soundly extended to the linear approximation case, backing up to preceding features rather than to preceding states. We introduce two versions of prioritized sweeping with linear Dyna and briefly illustrate their performance empirically on the Mountain Car and Boyan Chain problems.

Ludvig, E. A., Sutton, R. S., Kehoe, E. J. (2008) Stimulus representation and the timing of reward-prediction errors in models of the dopamine system, Neural Computation 20:3034-3054.
The phasic firing of dopamine neurons has been theorized to encode a reward-prediction error as formalized by the temporal-difference (TD) algorithm in reinforcement learning. Most TD models of dopamine have assumed a stimulus representation, known as the complete serial compound, in which each moment in a trial is distinctly represented. We introduce a more realistic temporal stimulus representation for the TD model. In our model, all external stimuli, including rewards, spawn a series of internal microstimuli, which grow weaker and more diffuse over time. These microstimuli are used by the TD learning algorithm to generate predictions of future reward. This new stimulus representation injects temporal generalization into the TD model and enhances correspondence between model and data in several experiments, including those when rewards are omitted or received early. This improved fit mostly derives from the absence of large negative errors in the new model, suggesting that dopamine alone can encode the full range of TD errors in these situations.

Kehoe, E. J., Ludvig, E. A., Dudeney, J. E., Neufeld, J., Sutton, R. S. (2008). Magnitude and timing of nictitating membrane movements during classical conditioning of the rabbit (oryctolagus cuniculus), Behavioral Neuroscience 122(2):471-176.
A trial-by-trial, subject-by-subject analysis was conducted to determine whether generation of the conditioned response (CR) occurs on a continuous or all-or-none basis. Three groups of rabbits were trained on different partial reinforcement schedules with the conditioned stimulus presented alone on 10%, 30%, or 50%, respectively, of all trials. Plots of each rabbit’s nictitating membrane movements revealed that their magnitude rose in a continuous fashion. Response growth during acquisition followed a sigmoidal curve, and the timing of CR-sized movements was largely stable throughout the experiment. The results are discussed with respect to alternative models of CR generation.

Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., Lee, M. (2008). Incremental Natural Actor-Critic Algorithms. Twenty First Annual Conference on Neural Information Processing Systems, pp. 105-112.  Conference held in Vancouver, Canada, December 2007. Journal version. TR version with experiments.

ABSTRACT: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradient in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces, and the use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the policy gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda et al. by using temporal difference learning in the actor and by incorporating natural gradients, and extend prior empirical studies of natural-gradient actor-critic methods by Peters et al. by providing the first convergence proofs and the first fully incremental algorithms.


Sutton, R. S., Koop, A., Silver, D. (2007). On the Role of Tracking in Stationary Environments.  In Proceedings of the 2007 International Conference on Machine LearningAssociated talk from 6/21/07.

ABSTRACT: It is often thought that learning algorithms that track the best solution, as opposed to converging to it, are important only on nonstationary problems. We present three results suggesting that this is not so. First we illustrate in a simple concrete example, the Black and White problem, that tracking can perform better than any converging algorithm on a stationary problem. Second, we show the same point on a larger, more realistic problem, an application of temporal-difference learning to computer Go. Our third result suggests that tracking in stationary problems could be important for meta-learning research (e.g., learning to learn, feature selection, transfer). We apply a meta-learning algorithm for step-size adaptation, IDBD,e to the Black and White problem, showing that meta-learning has a dramatic long-term effect on performance whereas, on an analogous converging problem, meta-learning has only a small second-order effect. This small result suggests a way of eventually overcoming a major obstacle to meta-learning research: the lack of an independent methodology for task selection.


Silver, D., Sutton, R. S., and Mueller, M. (2007) Reinforcement Learning of Local Shape in the Game of Go, Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07).

ABSTRACT: We explore an application to the game of Go of a reinforcement learning approach based on a linear evaluation function and large numbers of binary features. This strategy has proved effective in game playing programs and other reinforcement learning applications. We apply this strategy to Go by creating over a million features based on templates for small fragments of the board, and then use temporal difference learning and self-play. This method identifies hundreds of low level shapes with recognisable significance to expert Go players, and provides quantitive estimates of their values. We analyse the relative contributions to performance of templates of different types and sizes. Our results show that small, translation-invariant templates are surprisingly effective. We assess the performance of our program by playing against the Average Liberty Player and a variety of computer opponents on the 9x9 Computer Go Server. Our linear evaluation function appears to outperform all other static evaluation functions that do not incorporate substantial domain knowledge.


Geramifard, A., Bowling, M., Zinkevich, M., and Sutton, R. S. (2007). iLSTD: Eligibility Traces and Convergence Analysis, Advances in Neural Information Processing Systems 19 (NIPS*06).

ABSTRACT: We present new theoretical and empirical results with the iLSTD algorithm for policy evaluation in reinforcement learning with linear function approximation.  iLSTD is an incremental method for achieving results similar to LSTD, the data-efficient, least-squares version of temporal difference learning, without incurring the full cost of the LSTD computation. LSTD is O(n2 ), where n is the number of parameters in the linear function approximator, while iLSTD is O(n). In this paper, we generalize the previous iLSTD algorithm and present three new results: (1) the first convergence proof for an iLSTD algorithm; (2) an extension to incorporate eligibility traces without changing the asymptotic computational complexity; and (3) the first empirical results with an iLSTD algorithm for a problem (mountain car) with feature vectors large enough (n = 10,000) to show substantial computational advantages over LSTD.


Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental Least-Squares Temporal Difference Learning.   In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06), pages 356-361.

ABSTRACT: Approximate policy evaluation with linear function approximation is a commonly arising problem in reinforcement learning, usually solved using temporal difference (TD) algorithms. In this paper we introduce a new variant of linear TD learning, called incremental least-squares TD learning, or iLSTD. This method is more data efficient than conventional TD algorithms such as TD(0) and is more computationally efficient than non-incremental least-squares TD methods such as LSTD (Bradtke & Barto 1996; Boyan 1999). In particular, we show that the per-time-step complexities of iLSTD and TD(0) are O(n), where n is the number of features, whereas that of LSTD is O(n2 ). This difference can be decisive in modern applications of reinforcement learning where the use of a large number features has proven to be an effective solution strategy. We present empirical comparisons, using the test problem introduced by Boyan (1999), in which iLSTD converges faster than TD(0) and almost as fast as LSTD.


Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy Learning with RecognizersAdvances in Neural Information Processing Systems 18 (NIPS*05).

ABSTRACT: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has much lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for MDPs and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Our algorithm achieves this without knowledge of the behavior policy or even requiring that there exists a behavior policy.


Sutton, R. S., Rafols, E. J., Koop, A. (2006). Temporal abstraction in temporal-difference networksAdvances in Neural Information Processing Systems 18 (NIPS*05).

ABSTRACT: Temporal-difference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment (Sutton & Tanner, 2005).  These predictions are compositional in that their targets are defined in terms of other predictions, and subjunctive in that they are about what would happen if an action or sequence of actions were taken.  In conventional TD networks, the inter-related predictions are at successive time steps and contingent on a single action; here we generalize them to accommodate extended time intervals and contingency on whole ways of behaving.  Our generalization is based on the options framework for temporal abstraction (Sutton, Precup & Singh, 1999).  The primary contribution of this paper is to introduce a new algorithm for intra-option learning in TD networks with function approximation and eligibility traces.  We present empirical examples of our algorithm's effectiveness and of the greater representational expressiveness of temporally-abstract TD networks.


Tanner, B., Sutton, R. S. (2005) TD(lambda) Networks: Temporal-Difference Networks with Eligibility Traces.   Proceedings of the 2005 International Conference on Machine Learning, pp 889-896.

ABSTRACT: Temporal-difference (TD) networks have been introduced as a formalism for expressing and learning grounded world knowledge in a predictive form (Sutton & Tanner, 2005). Like conventional TD(0) methods, the learning algorithm for TD networks uses 1-step backups to train prediction units about future events. In conventional TD learning, the TD(lambda) algorithm is often used to do more general multi-step backups of future predictions. In our work, we introduce a generalization of the 1-step TD network specification that is based on the TD(lambda) learning al- gorithm, creating TD(lambda) networks. We present experimental results that show TD(lambda) networks can learn solutions in more complex environments than TD networks. We also show that in problems that can be solved by TD networks, TD(lambda) networks generally learn solutions much faster than their 1-step counterparts. Finally, we  present an analysis of our algorithm that shows that the computational cost of TD(lambda) networks is only slightly more than that of TD networks.


Rafols, E. J., Ring, M. B., Sutton, R. S., Tanner, B. 2005) Using Predictive Representations to Improve Generalization in Reinforcement Learning.   Proceedings of the 2005 International Joint Conference on Artificial Intelligence, pp 835-840.

ABSTRACT: The predictive representations hypothesis holds that particularly good generalization will result from representing the state of the world in terms of predictions about possible future experience. This hypothesis has been a central motivation behind recent research in, for example, PSRs and TD networks. In this paper we present the first explicit investigation of this hypothesis. We show in a reinforcement-learning example (a grid-world navigation task) that a predictive representation in tabular form can learn much faster than both the tabular explicit-state representation and a tabular history-based method.


Tanner, B., Sutton, R. S. (2005) Temporal-Difference Networks with History.   Proceedings of the 2005 International Joint Conference on Artificial Intelligence, pp 865-870.

ABSTRACT: Temporal-difference (TD) networks are a formalism for expressing and learning grounded world knowledge in a predictive form [Sutton and Tanner, 2005]. However, not all partially observable Markov decision processes can be efficiently learned with TD networks. In this paper, we extend TD networks by allowing the network-update process (answer network) to depend on the recent history of previous actions and observations rather than only on the most recent action and observation. We show that this extension enables the solution of a larger class of problems than can be solved by the original TD networks or by historybased methods alone. In addition, we apply TD networks to a problem that, while still simple, is significantly larger than has previously been considered. We show that history-extended TD networks can learn much of the common-sense knowledge of an egocentric gridworld domain with a single bit of perception.


Stone, P., Sutton, R. S., Kuhlmann, G. (2005) Reinforcement Learning for RoboCup-Soccer KeepawayAdaptive Behavior.  Some simulations of keepaway referenced in the paper and keepaway software.

ABSTRACT: RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our application of episodic SMDP Sarsa(lambda) with linear tile-coding function approximation and variable lambda to learning higher-level decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, "the keepers," tries to keep control of the ball for as long as possible despite the efforts of "the takers." The keepers learn individually when to hold the ball and when to pass to a teammate. Our agents learned policies that significantly outperform a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.


Sutton, R. S., Tanner, B. (2005) Temporal-Difference NetworksAdvances in Neural Information Processing Systems 17, pages 1377-1384.  Associated talk from 12/14/05Small-size version of same talk.

ABSTRACT: We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that when actions are introduced, and the inter-prediction relationships made contingent on them, the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms.

Littman, M.L., Sutton, R.S., Singh, S. (2002). Predictive Representations of State. In: Advances in Neural Information Processing Systems 14:1555-1561 (Proceedings of the 2001 conference). MIT Press. 
ABSTRACT: We show that states of a dynamical system can be usefully represented by multi-step, action-conditional predictions of future observations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less dependent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear specialization of the predictive approach with the state representations used in POMDP and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model.

Stone, P., Sutton, R. S. (2002). Keepaway soccer: A machine learning testbed. In: Lecture Notes in Computer Science 2377, pp. 207-237, Springer.

RoboCup simulated soccer presents many challenges to machine learning (ML) methods, including a large state space, hidden and uncertain state, multiple agents, and long and variable delays in the effects of actions. While there have been many successful ML applications to portions of the robotic soccer task, it appears to be still beyond the capabilities of modern machine learning techniques to enable a team of 11 agents to successfully learn the full robotic soccer task from sensors to actuators. Because the successful applications to portions of the task have been embedded in different teams and have often addressed different sub-tasks, they have been difficult to compare. We put forth keepaway soccer as a domain suitable for directly comparing different machine learning approaches to robotic soccer. It is complex enough that it can’t be solved trivially, yet simple enough that complete machine learning approaches are feasible. In keepaway, one team, “the keepers,” tries to keep control of the ball for as long as possible despite the efforts of “the takers.” The keepers learn individually when to hold the ball and when to pass to a teammate, while the takers learn when to charge the ball-holder and when to cover possible passing lanes. We fully specify the domain and summarize some initial, successful learning results.


Stone, P., Sutton, R.S. (2001). Scaling reinforcement learning toward RoboCup soccer. Proceedings of the 18th International Conference on Machine Learning, pp. 537-544. ABSTRACT: RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple agents, and long and variable delays in the effects of actions. We describe our application of episodic SMDP Sarsa(lambda) with linear tile-coding function approximation and variable lambda to learning higher-level decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, "the keepers," tries to keep control of the ball for as long as possible despite the efforts of "the takers." The keepers learn individually when to hold the ball and when to pass to a teammate, while the takers learn when to charge the ball-holder and when to cover possible passing lanes. Our agents learned policies that significantly out-performed a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.
Precup, D., Sutton, R.S., Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. digitally remastered version. scanned version. Proceedings of the 18th International Conference on Machine Learning. Associated talk from 7/1/01. ABSTRACT: We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(lambda) over state-action pairs with importance sampling ideas from our previous work. We prove that, given training under any epsilon-soft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the action-value function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem, showing reduced variance compared to the most obvious importance sampling algorithm for this problem. Our current results are limited to episodic tasks with episodes of bounded length.

Stone, P., Sutton, R.S., Singh, S. (2001). Reinforcement Learning for 3 vs. 2 Keepaway. In: RoboCup-2000: Robot Soccer World Cup IV, P. Stone, T. Balch, and G. Kraetszchmar, Eds., Springer Verlag. An earlier version appeared in the Proceedings of the RoboCup-2000 Workshop, Melbourne, Australia.

ABSTRACT: As a sequential decision problem, robotic soccer can benefit from research in reinforcement learning. We introduce the 3 vs. 2 keepaway domain, a subproblem of robotic soccer implemented in the RoboCup soccer server. We then explore reinforcement learning methods for policy evaluation and action selection in this distributed, real-time, partially observable, noisy domain. We present empirical results demonstrating that a learned policy can dramatically outperform hand-coded policies.
Sutton, R.S. (2001). Method and apparatus for machine learning. U.S. Patent 6,249,781.
ABSTRACT: A method and apparatus is disclosed for machine learning of a pattern sequence which is derived from a plurality of inputs. The pattern sequence is predicted from learning rate parameters that are exponentially related to an incrementally calculated gain parameter for each input. The gain parameter are increased or decreased in real time in correlation with the accuracy of the learning process. The disclosed method and apparatus are advantageously utilized in signal processing, adaptive control systems, and pattern recognition.

Sutton, R.S., McAllester, D., Singh, S., Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems 12 (Proceedings of the 1999 conference), pp. 1057-1063. MIT Press. An earlier version, as submitted May 1999, appeared as an AT&T Labs Technical Report. Later we began but never finished another paper, Comparing Policy-Gradient Algorithms.
ABSTRACT:Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

Precup, D., Sutton, R.S., Singh, S. (2000). Eligibility Traces for Off-Policy Policy Evaluation. Proceedings of the 17th International Conference on Machine Learning, pp. 759-766. Morgan Kaufmann. ABSTRACT: Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods. Here we generalize eligibility traces to off-policy learning, in which one learns about a policy different from the policy that generates the data. Off-policy methods can greatly multiply learning, as many policies can be learned about from the same data stream, and have been identified as particularly useful for learning about subgoals and temporally extended macro-actions. In this paper we consider the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method. We analyze and compare this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling. Our main results are 1) to establish the consistency and bias properties of the new methods and 2) to empirically rank the new methods, showing improvement over one-step and Monte Carlo methods. Our results are restricted to model-free, table-lookup methods and to offline updating (at the end of each episode) although several of the algorithms could be applied more generally.

Sutton, R.S. (1999). Open theoretical questions in reinforcement learning. In Proceedings of the Fourth European Conference on Computational Learning Theory (Proceedings EuroCOLT'99), pp. 11-17, Fischer, P., Simon, H.U., Eds. Springer-Verlag.


Sutton, R.S. (1999). Reinforcement Learning. In Rob Wilson and Frank Keil (Eds.) The MIT Encyclopedia of the Cognitive Sciences, MIT Press.

ABSTRACT:Reinforcement learning is an approach to artificial intelligence that emphasizes learning by the individual from its interaction with its environment. This contrasts with classical approaches to artificial intelligence and machine learning, which have downplayed learning from interaction, focusing instead on learning from a knowledgeable teacher, or on reasoning from a complete model of the environment. Modern reinforcement learning research is highly interdisciplinary; it includes researchers specializing in operations research, genetic algorithms, neural networks, psychology, and control engineering.


Sutton, R.S., Precup, D., Singh, S. (1999). Between MDPs and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artificial Intelligence 112:181-211. An earlier version appeared as Technical Report 98-74, Department of Computer Science, University of Massachusetts, Amherst, MA 01003. April, 1998. Associated talk from 3/5/98.  Humorous excerpts from the JAIR reviews recommending rejection of this paper.

ABSTRACT: Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action in this framework to include options---closed-loop policies for taking action over a period of time. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Overall, we show that options enable temporally abstract knowledge and action to be included in the reinforcement learning framework in a natural and general way. In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Q-learning. Formally, a set of options defined over an MDP constitutes a semi-Markov decision process (SMDP), and the theory of SMDPs provides the foundation for the theory of options. However, the most interesting issues concern the interplay between the underlying MDP and the SMDP and are thus beyond SMDP theory. We present results for three such cases: 1) we show that the results of planning with options can be used during execution to interrupt options and thereby perform even better than planned, 2) we introduce new intra-option methods that are able to learn about an option from fragments of its execution, and 3) we propose a notion of subgoal that can be used to improve the options themselves. All of these results have precursors in the existing literature; the contribution of this paper is to establish them in a simpler and more general setting with fewer changes to the existing reinforcement learning framework. In particular, we show that these results can be obtained without committing to (or ruling out) any particular approach to state abstraction, hierarchy, function approximation, or the macro-utility problem.
Sutton, R.S., Singh, S., Precup, D., Ravindran, B. (1999). Improved switching among temporally abstract actions. Djvu version. Advances in Neural Information Processing Systems 11 (Proceedings of the 1998 conference), MIT Press. Associated talk from 12/2/98. ABSTRACT: In robotics and other control applications it is commonplace to have a pre-existing set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be obtained by switching flexibly between given controllers, and example applications of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs.

Moll, R., Barto, A.G., Perkins, T. J., Sutton, R.S. (1999). Learning instance-independent value functions to enhance local search. Advances in Neural Information Processing Systems 11 (Proceedings of the 1998 conference), pp. 1017-1023. MIT Press.

ABSTRACT: Reinforcement learning methods can be used to improve the performance of local search algorithms for combinatorial optimization by learning an evaluation function that predicts the outcome of search. The evaluation function is therefore able to guide search more effectively to low-cost solutions than can the original cost function. We describe a reinforcement learning method for enhancing local search that combines aspects of previous work by Zhang and Dietterich (1995) and Boyan and Moore (1997, 1998). In an off-line learning phase, a value function is learned that is useful for guiding search for multiple problem sizes and instances. We illustrate our technique by developing several such functions for the Dial-A-Ride Problem, a variant of the well-known Traveling Salesman's Problem. Overall, our learned hillclimbing results exhibit an improvement of more than 30% over the standard Dial-A-Ride local search algorithm.


Sutton, R. S., Reinforcement learning: Past, present, and future. Extended abstract in Simulated Evolution and Learning, McKay, B., Yao, X., Newton, C. S., Kim, J.-H., Furuhashi, T., Eds. Published as Lecture Notes in Computer Science 1585, pp. 195–197, Springer, 1999.


Sutton, R.S., Barto, A.G. (1998). Reinforcement Learning: An Introduction. MIT Press.

This introductory textbook on reinforcement learning is targeted toward engineers and scientists in artificial intelligence, operations research, neural networks, and control systems, and we hope it will also be of interest to psychologists and neuroscientists. It also contains several new results. An html version is available.


Sutton, R.S., Precup, D., Singh, S. (1998). Intra-option learning about temporally abstract actions. Proceedings of the 15th International Conference on Machine Learning, pp. 556-564. Morgan Kaufmann.

ABSTRACT: Several researchers have proposed modeling temporally abstract actions in reinforcement learning by the combination of a policy and a termination condition, which we refer to as an option. Value functions over options and models of options can be learned using methods designed for semi-Markov decision processes (SMDPs). However, these methods all require an option to be executed to termination. In this paper we explore methods that learn about an option from small fragments of experience consistent with that option, even if the option itself is not executed. We call these methods intra-option learning methods because they learn from experience within an option. Intra-option methods are sometimes much more efficient than SMDP methods because they can use off-policy temporal-difference mechanisms to learn simultaneously about all the options consistent with an experience, not just the few that were actually executed. In this paper we present intra-option learning methods for learning value functions over options and for learning multi-step models of the consequences of options. We present computational examples in which these new methods learn much faster than SMDP methods and learn effectively when SMDP methods cannot learn at all. We also sketch a convergence proof for intra-option value learning.
Precup, D., Sutton, R.S. (1998) Multi-time models for temporally abstract planning. Djvu version. Advances in Neural Information Processing Systems 10, MIT Press. ABSTRACT: Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Current model-based reinforcement learning is based on one-step models that can not represent common-sense higher-level actions, such as going to lunch, grasping an object, or flying to Denver. This paper generalizes prior work on temporally abstract models (Sutton, 1995) and extends it from the prediction setting to include actions, control, and planning. We introduce a more general form of temporally abstract model, the multi-time model, and establish its suitability for planning and learning by virtue of its relationship to Bellman equations. This paper summarizes the theoretical framework of multi-time models and illustrates their potential advantages in a gridworld planning task.

McGovern, A., and Sutton, R.S. (1998). Macro-actions in reinforcement learning: An empirical analysis. Technical Report 98-70, University of Massachusetts, Department of Computer Science.

ABSTRACT: Several researchers have proposed reinforcement learning methods that obtain advantages in learning by using temporally extended actions, or macro-actions, but none has carefully analyzed what these advantages are. In this paper, we separate and analyze two advantages of using macro-actions in reinforcement learning: the effect on exploratory behavior, independent of learning, and the effect on the speed with which the learning process propagates accurate value information. We empirically measure the separate contributions of these two effects in gridworld and simulated robotic environments. In these environments, both effects were significant, but the effect of value propagation was larger. We also compare the accelerations of value propagation due to macro-actions and eligibility traces in the gridworld environment. Although eligibility traces increased the rate of convergence to the optimal value function compared to learning with macro-actions but without eligibility traces, eligibility traces did not permit the optimal policy to be learned as quickly as it was using macro-actions.

Precup, D., Sutton, R.S., Singh, S. (1998). Theoretical results on reinforcement learning with temporally abstract options. In Machine Learning: ECML-98, Proceedings of the 10th European Conference on Machine Learning, Chemnitz, Germany, pp. 382-393. Springer Verlag.

ABSTRACT: We present new theoretical results on planning within the framework of temporally abstract reinforcement learning (Precup & Sutton, 1997; Sutton, 1995). Temporal abstraction is a key step in any decision making system that involves planning and prediction. In temporally abstract reinforcement learning, the agent is allowed to choose among "options", whole courses of action that may be temporally extended, stochastic, and contingent on previous events. Examples of options include closed-loop policies such as picking up an object, as well as primitive actions such as joint torques. Knowledge about the consequences of options is represented by special structures called multi-time models. In this paper we focus on the theory of planning with multi-time models. We define new Bellman equations that are satisfied for sets of multi-time models. As a consequence, multi-time models can be used interchangeably with models of primitive actions in a variety of well-known planning methods including value iteration, policy improvement and policy iteration.
Santamaria, J.C., Sutton, R.S., Ram, A. (1998). Experiments with reinforcement learning in problems with continuous state and action spaces, Adaptive Behavior 6(2): 163-218. postscript. An earlier version appeared as Technical Report UM-CS-1996-088, Department of Computer Science, University of Massachusetts, Amherst, MA 01003. The source code for all the experiments is also available here. ABSTRACT: A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the long-term utility or value of any given state and it is important because an agent can use it to decide what to do next. A common problem in reinforcement learning when applied to systems having continuous states and action spaces is that the value function must operate with a domain consisting of real-valued variables, which means that it should be able to represent the value of infinitely many state and action pairs. For this reason, function approximators are used to represent the value function when a close-form solution of the optimal policy is not available. In this paper, we extend a previously proposed reinforcement learning algorithm so that it can be used with function approximators that generalize the value of individual experiences across both, state and action spaces. In particular, we discuss the benefits of using sparse coarse-coded function approximators to represent value functions and describe in detail three implementations: CMAC, instance-based, and case-based. Additionally, we discuss how function approximators having different degrees of resolution in different regions of the state and action spaces may influence the performance and learning efficiency of the agent. We propose a simple and modular technique that can be used to implement function approximators with non-uniform degrees of resolution so that it can represent the value function with higher accuracy in important regions of the state and action spaces. We performed extensive experiments in the double-integrator and pendulum swing up systems to demonstrate the proposed ideas.

McGovern, A., Precup, D., Ravindran, B., Singh, S., Sutton, R.S. (1998). Hierarchical Optimal Control of MDPs. Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, pp. 186-191.

ABSTRACT: Fundamental to reinforcement learning, as well as to the theory of systems and control, is the problem of representing knowledge about the environment and about possible courses of action hierarchically, at a multiplicity of interrelated temporal scales. For example, a human traveler must decide which cities to go to, whether to fly, drive, or walk, and the individual muscle contractions involved in each step. In this paper we survey a new approach to reinforcement learning in which each of these decisions is treated uniformly. Each low-level action and high-level course of action is represented as an option, a (sub)controller and a termination condition. The theory of options is based on the theories of Markov and semi-Markov decision processes, but extends these in significant ways. Options can be used in place of actions in all the planning and learning methods conventionally used in reinforcement learning. Options and models of options can be learned for a wide variety of different subtasks, and then rapidly combined to solve new tasks. Options enable planning and learning simultaneously at a wide variety of times scales, and toward a wide variety of subtasks, substantially increasing the efficiency and abilities of reinforcement learning systems.
Barto, A. G., Sutton, R. S., Reinforcement learning in artificial intelligence. In Neural Network Models of Cognition, Donahoe, J. W., Packard Dorsel, V., Eds., pp. 358–386, Elsevier, 1997.
This chapter provides an overview of an approach to the study of learning that, in broad terms, has developed as a part of the field of Artifificial Intelligence (AI), where it is called reinforcement learning due to its roots in reinforcement theories of animal learning.  We introduce the field from the perspective of AI and engineering, describing some of its key features, providing a formal model of the reinforcement-learning problem, and defining basic concepts that are exploited by solution methods. Detailed discussion of solution methods themselves and their history are very broad topics that we do not attempt to cover here.

Sutton, R.S. (1997). On the significance of Markov decision processes. In W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicoud (Eds.) Artificial Neural Networks -- ICANN'97, pp. 273-282. Springer. ABSTRACT: Formulating the problem facing an intelligent agent as a Markov decision process (MDP) is increasingly common in artificial intelligence, reinforcement learning, artificial life, and artificial neural networks. In this short paper we examine some of the reasons for the appeal of this framework. Foremost among these are its generality, simplicity, and emphasis on goal-directed interaction between the agent and its environment. MDPs may be becoming a common focal point for different approaches to understanding the mind. Finally, we speculate that this focus may be an enduring one insofar as many of the efforts to extend the MDP framework end up bringing a wider class of problems back within it.
Precup, D., Sutton, R.S. (1997). Multi-time models for reinforcement learning. Proceedings of the ICML'97 Workshop on Modelling in Reinforcement Learning. ABSTRACT: Reinforcement learning can be used not only to predict rewards, but also to predict states, i.e. to learn a model of the world's dynamics. Models can be defined at different levels of temporal abstraction. Multi-time models are models that focus on predicting what will happen, rather than when a certain event will take place. Based on multi-time models, we can define abstract actions, which enable planning (presumably in a more efficient way) at various levels of abstraction.
Precup, D., Sutton, R.S. (1997). Exponentiated gradient methods for reinforcement learning. Proceedings of the 14th International Conference on Machine Learning, pp. 272-277, Morgan Kaufmann. ABSTRACT: This paper introduces and evaluates a natural extension of linear exponentiated gradient methods that makes them applicable to reinforcement learning problems. Just as these methods speed up supervised learning, we find that they can also increase the efficiency of reinforcement learning. Comparisons are made with conventional reinforcement learning methods on two test problems using CMAC function approximators and replacing traces. On a small prediction task, exponentiated gradient methods showed no improvement, but on a larger control task (Mountain Car) they improved the learning speed by approximately 25%. A more detailed analysis suggests that the difference may be due to the distribution of irrelevant features.
McGovern, A., Sutton, R.S., Fagg, A.H. (1997). Roles of macro-actions in accelerating reinforcement learning. Proceedings of the 1997 Grace Hopper Celebration of Women in Computing, pp. 13-17. ABSTRACT: We analyze the use of built-in policies, or macro-actions, as a form of domain knowledge that can improve the speed and scaling of reinforcement learning algorithms. Such macro-actions are often used in robotics, and macro-operators are also well-known as an aid to state-space search in AI systems. The macro-actions we consider are closed-loop policies with termination conditions. The macro-actions can be chosen at the same level as primitive actions. Macro-actions commit the learning agent to act in a particular, purposeful way for a sustained period of time. Overall, macro-actions may either accelerate or retard learning, depending on the appropriateness of the macro-actions to the particular task. We analyze their effect in a simple example, breaking the acceleration effect into two parts: 1) the effect of the macro-action in changing exploratory behavior, independent of learning, and 2) the effect of the macro-action on learning, independent of its effect on behavior. In our example, both effects are significant, but the latter appears to be larger. Finally, we provide a more complex gridworld illustration of how appropriately chosen macro-actions can accelerate overall learning.
Precup, D., Sutton, R.S., Singh, S.P. (1997). Planning with closed-loop macro actions. Working notes of the 1997 AAAI Fall Symposium on Model-directed Autonomous Systems, pp. 70-76. ABSTRACT: Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Conventional model-based reinforcement learning uses primitive actions that last one time step and that can be modeled independently of the learning agent. These can be generalized to macro actions, multi-step actions specified by a arbitrary policy and a way of terminating. Macro actions generalize the classical notion of a macro operator in that they are closed loop, uncertain, and of variable duration. Macro actions are needed to represent common-sense higher-level actions such as going to lunch, grasping an object, or traveling to a distant city. This paper generalizes prior work on temporally abstract models (Sutton, 1995) and extends it from the prediction setting to include actions, control, and planning. We define a semantics of models of macro actions that guarantees the validity of planning using such models. This paper present new results in the theory of planning with macro actions and illustrates its potential advantages in a gridworld task.

Mehra, R.K., Ravichandran, B., Cabrera, J.B.D., Greve, D.N., Sutton, R.S. (1997). Towards self-learning adaptive scheduling for ATM networks, Proceedings of the 36th Conference on Decision and Control, pp. 2393-2398, San Diego, California USA.

Sutton, R.S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. Camera ready postscript. Digitally remastered pdf. Djvu version. Advances in Neural Information Processing Systems 8 (Proceedings of the 1995 conference), pp. 1038-1044, MIT Press. Errata: there is small error in the equations of motion of the acrobot; for the correct equations, see the RL textbook page on the acrobot (thanks to Barry Nichols for spotting this).

ABSTRACT: On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases, there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparse-coarse-coded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they used learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes (rollouts), as in classical Monte Carlo methods, and as in the TD(lambda) algorithm when lambda=1. However, in our experiments this always resulted in substantially poorer performance. We conclude that reinforcement learning can work robustly in conjunction with function approximators, and that there is little justification at present for avoiding the case of general lambda.


Singh, S.P., Sutton, R.S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning 22: 123-158.

ABSTRACT: The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that the method corresponding to replacing traces is closely related to the maximum likelihood solution for these tasks, and that its mean squared error is always lower in the long run. Computational results confirm these analyses and show that they are applicable more generally. In particular, we show that replacing traces significantly improve performance and reduce parameter sensitivity on the Mountain-Car task, a full reinforcement-learning problem with a continuous state space, when using a feature-based function approximator.
Precup, D., Sutton, R.S. (1996). Empirical comparison of gradient descent and exponentiated gradient descent in supervised and reinforcement learning. Technical Report UM-CS-1996-070, Department of Computer Science, University of Massachusetts, Amherst, MA 01003. ABSTRACT: This report describes a series of results using the exponentiated gradient descent (EG) method recently proposed by Kivinen and Warmuth. Prior work is extended by comparing speed of learning on a nonstationary problem and on an extension to backpropagation networks. Most significantly, we present an extension of the EG method to temporal-difference and reinforcement learning. This extension is compared to conventional reinforcement learning methods on two test problems using CMAC function approximators and replace traces. On the larger of the two problems, the average loss was approximately 25% smaller for the EG method. The relative computational complexity and parameter sensitivity of the two methods is also discussed.

Kuvayev, L., Sutton, R.S. (1996). Model-based reinforcement learning with an approximate, learned model. Proceedings of the Ninth Yale Workshop on Adaptive and Learning Systems, pp. 101-105, Yale University, New Haven, CT. Some additional results are in this earlier version of the same paper.

ABSTRACT: Model-based reinforcement learning, in which a model of the environment's dynamics is learned and used to supplement direct learning from experience, has been proposed as a general approach to learning and planning. We present the first experiments with this idea in which the model of the environment's dynamics is both approximate and learned online. These experiments involve the Mountain Car task, which requires approximation of both value function and model because it has continuous state variables. We used models of the simplest possible form, state-aggregation or grid models, and CMACs to represent the value function. We find that model-based methods do indeed perform better than model-free reinforcement learning on this task, but only slightly.

Mehra, R.K., Ravichandran, B., Sutton, R.S. (1996). Adaptive intelligent scheduling for ATM networks. Proceedings of the Ninth Yale Workshop on Adaptive and Learning Systems, pp. 106-111, Yale University, New Haven, CT.

ABSTRACT: Asynchronous Transfer Mode (ATM) is a fast emerging information technology that promises to provide interoperable multi-media services for commercial and defense applications. Unlike commercial broadband networks which ATM was originally designed for, defense or tactical ATM networks must be able to traverse low-rate transmission links. To allow more flexible and efficient use of this limited bandwidth resource, optimal traffic management in ATM networks is critical. This paper demonstrates the feasibility of developing Self-Learning Adaptive (SLA) scheduling algorithms using Reinforcement Learning (RL). This techniques was applied to simulated data and proved to be more efficient than fixed static scheduling methods.

Sutton, R.S. (1995). On the Virtues of Linear Learning and Trajectory Distributions. Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference.


Sutton, R.S. (1995). TD models: Modeling the world at a mixture of time scales. Proceedings of the Twelfth International Conference on Machine Learning, pp. 531-539, Morgan Kaufmann.

ABSTRACT: Temporal-difference (TD) learning can be used not just to predict rewards, as is commonly done in reinforcement learning, but also to predict states, i.e., to learn a model of the world's dynamics. We present theory and algorithms for intermixing TD models of the world at different levels of temporal abstraction within a single structure. Such multi-scale TD models can be used in model-based reinforcement-learning architectures and dynamic programming methods in place of conventional Markov models. This enables planning at higher and varied levels of abstraction, and, as such, may prove useful in formulating methods for hierarchical or multi-level planning and reinforcement learning. In this paper we treat only the prediction problem---that of learning a model and value function for the case of fixed agent behavior. Within this context, we establish the theoretical foundations of multi-scale models and derive TD algorithms for learning them. Two small computational experiments are presented to test and illustrate the theory. This work is an extension and generalization of the work of Singh (1992), Dayan (1993), and Sutton & Pinette (1985).

Sutton, R.S., Singh, S.P. (1994). On bias and step size in temporal-difference learning. Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pp. 91-96, Yale University, New Haven, CT.

ABSTRACT: We present results for three new algorithms for setting the step-size parameters, alpha and lambda, of temporal-difference learning methods such as TD(lambda). The overall task is that of learning to predict the outcome of an unknown Markov chain based on repeated observations of its state trajectories. The new algorithms select step-size parameters online in such a way as to eliminate the bias normally inherent in temporal-difference methods. We compare our algorithms with conventional Monte Carlo methods. Monte Carlo methods have a natural way of setting the step size: for each state s they use a step size of 1/n_s, where n_s is the number of times state s has been visited. We seek and come close to achieving comparable step-size algorithms for TD(lambda). One new algorithm uses a lambda=1/n_s schedule to achieve the same effect as processing a state backwards with TD(0), but remains completely incremental. Another algorithm uses a lambda at each time equal to the estimated transition probability of the current transition. We present empirical results showing improvement in convergence rate over Monte Carlo methods and conventional TD(lambda). A limitation of our results at present is that they apply only to tasks whose state trajectories do not contain cycles.

Sutton, R.S., Whitehead, S.D. (1993). Online learning with random representations. Proceedings of the Tenth International Conference on Machine Learning, pp. 314-321, Morgan Kaufmann.

ABSTRACT: We consider the requirements of online learning---learning which must be done incrementally and in realtime, with the results of learning available soon after each new example is acquired. Despite the abundance of methods for learning from examples, there are few that can be used effectively for online learning, e.g., as components of reinforcement learning systems. Most of these few, including radial basis functions, CMACs, Kohonen's self-organizing maps, and those developed in this paper, share the same structure. All expand the original input representation into a higher dimensional representation in an unsupervised way, and then map that representation to the final answer using a relatively simple supervised learner, such as a perceptron or LMS rule. Such structures learn very rapidly and reliably, but have been thought either to scale poorly or to require extensive domain knowledge. To the contrary, some researchers (Rosenblatt, 1962; Gallant & Smith, 1987; Kanerva, 1988; Prager & Fallside, 1988) have argued that the expanded representation can be chosen largely at random with good results. The main contribution of this paper is to develop and test this hypothesis. We show that simple random-representation methods can perform as well as nearest-neighbor methods (while being more suited to online learning), and significantly better than backpropagation. We find that the size of the random representation does increase with the dimensionality of the problem, but not unreasonably so, and that the required size can be reduced substantially using unsupervised-learning techniques. Our results suggest that randomness has a useful role to play in online supervised learning and constructive induction.


Sutton, R.S. (1992). Adapting bias by gradient descent: An incremental version of delta-bar-delta. Proceedings of the Tenth National Conference on Artificial Intelligence, pp. 171-176, MIT Press. Associated talk from 2/2/04.

ABSTRACT: Appropriate bias is widely viewed as the key to efficient learning and generalization. I present a new algorithm, the Incremental Delta-Bar-Delta (IDBD) algorithm, for the learning of appropriate biases based on previous learning experience. The IDBD algorithm is developed for the case of a simple, linear learning system---the LMS or delta rule with a separate learning-rate parameter for each input. The IDBD algorithm adjusts the learning-rate parameters, which are an important form of bias for this system. Because bias in this approach is adapted based on previous learning experience, the appropriate testbeds are drifting or non-stationary learning tasks. For particular tasks of this type, I show that the IDBD algorithm performs better than ordinary LMS and in fact finds the optimal learning rates. The IDBD algorithm extends and improves over prior work by Jacobs and by me in that it is fully incremental and has only a single free parameter. This paper also extends previous work by presenting a derivation of the IDBD algorithm as gradient descent in the space of learning-rate parameters. Finally, I offer a novel interpretation of the IDBD algorithm as an incremental form of hold-one-out cross validation.

Sutton, R.S. (1992). Gain adaptation beats least squares? Proceedings of the Seventh Yale Workshop on Adaptive and Learning Systems, pp. 161-166, Yale University, New Haven, CT.

ABSTRACT: I present computational results suggesting that gain-adaptation algorithms based in part on connectionist learning methods may improve over least squares and other classical parameter-estimation methods for stochastic time-varying linear systems. The new algorithms are evaluated with respect to classical methods along three dimensions: asymptotic error, computational complexity, and required prior knowledge about the system. The new algorithms are all of the same order of complexity as LMS methods, O(n), where n is the dimensionality of the system, whereas least-squares methods and the Kalman filter are O(n^2). The new methods also improve over the Kalman filter in that they do not require a complete statistical model of how the system varies over time. In a simple computational experiment, the new methods are shown to produce asymptotic error levels near that of the optimal Kalman filter and significantly below those of least-squares and LMS methods. The new methods may perform better even than the Kalman filter if there is any error in the filter's model of how the system varies over time.


Sutton, R.S. (1992). Machines that Learn and Mimic the Brain. In ACCESS, GTE's Journal of Science and Technology, 1992. Reprinted in Stethoscope Quarterly, Spring.


Sutton, R.S. (1992). Reinforcement learning architectures. Proceedings ISKIT'92 International Symposium on Neural Information Processing, Fukuoka, Japan.

ABSTRACT: Reinforcement learning is the learning of a mapping from situations to actions so as to maximize a scalar reward or reinforcement signal. The learner is not told which action to take, as in most forms of learning, but instead must discover which actions yield the highest reward by trying them. In the most interesting and challenging cases, actions affect not only the immediate reward, but also the next situation, and through that all subsequent rewards. These two characteristics---trial-and-error search and delayed reward---are the two most important distinguishing features of reinforcement learning. In this paper I present a brief overview of the development of reinforcement learning architectures over the past decade, including reinforcement-comparison, actor-critic, and Q-learning architectures. Finally, I present Dyna, a class of architectures based on reinforcement learning but which go beyond trial-and-error learning to include a learned internal model of the world. By intermixing conventional trial and error with hypothetical trial and error using the world model, Dyna systems can plan and learn optimal behavior very rapidly.


Sutton, R.S. (Ed.) (1992). Reinforcement Learning, book version of a special issue of Machine Learning on reinforcement learning (Volume 8, Numbers 3/4). Kluwer.  Introduction: The challenge of reinforcement learning.


Sanger, T.D., Sutton, R.S., Matheus, C.J. (1992). Iterative construction of sparse polynomial approximations. Advances in Neural Information Processing Systems 4, pp. 1064-1071, Morgan Kaufmann.

ABSTRACT: We present an iterative algorithm for nonlinear regression based on construction of sparse polynomials. Polynomials are built sequentially from lower to higher order. Selection of new terms is accomplished using a novel look-ahead approach that predicts whether a variable contributes to the remaining error. The algorithm is based on the tree-growing heuristic in LMS Trees which we have extended to approximation of arbitrary polynomials of the input features. In addition, we provide a new theoretical justification for this heuristic approach. The algorithm is shown to discover a known polynomial from samples, and to make accurate estimates of pixel values in an image processing task.


Gluck, M., Glauthier, P., Sutton, R.S. (1992). Adaptation of cue-specific learning rates in network models of human category learning, Proceedings of the Fouteenth Annual Conference of the Cognitive Science Society, pp. 540-545, Erlbaum.

ABSTRACT: Recent engineering considerations have prompted an improvement to the least mean squares (LMS) learning rule for training one-layer adaptive networks: incorporating a dynamically modifiable learning rate for each associative weight accellerates overall learning and provides a mechanism for adjusting the salience of individual cues (Sutton, 1992a,b). Prior research has established that the standard LMS rule can characterize aspects of animal learning (Rescorla & Wagner, 1972) and human category learning (Gluck & Bower, 1988a,b). We illustrate here how this enhanced LMS rule is analogous to adding a cue-salience or attentional component to the psychological model, giving the network model a means for discriminating between relevant and irrelevant cues. We then demonstrate the effectiveness of this enhanced LMS rule for modeling human performance in two non-stationary learning tasks for which the standard LMS network model fails to adequately account for the data (Hurwitz, 1990; Gluck, Glauthier, & Sutton, in preparation).

Sutton, R.S. (1991). Planning by incremental dynamic programming. Proceedings of the Eighth International Workshop on Machine Learning, pp. 353-357, Morgan Kaufmann.

ABSTRACT: This paper presents the basic results and ideas of dynamic programming as they relate most directly to the concerns of planning in AI. These form the theoretical basis for the incremental planning methods used in the integrated architecture Dyna. These incremental planning methods are based on continually updating an evaluation function and the situation-action mapping of a reactive system. Actions are generated by the reactive system and thus involve minimal delay, while the incremental planning process guarantees that the actions and evaluation function will eventually be optimal---no matter how extensive a search is required. These methods are well suited to stochastic tasks and to tasks in which a complete and accurate model is not available. For tasks too large to implement the situation-action mapping as a table, supervised-learning methods must be used, and their capabilities remain a significant limitation of the approach.

Sutton, R.S. (1991). Dyna, an integrated architecture for learning, planning and reacting. Working Notes of the 1991 AAAI Spring Symposium on Integrated Intelligent Architectures and SIGART Bulletin 2, pp. 160-163.

ABSTRACT: Dyna is an AI architecture that integrates learning, planning, and reactive execution. Learning methods are used in Dyna both for compiling planning results and for updating a model of the effects of the agent's actions on the world. Planning is incremental and can use the probabilistic and ofttimes incorrect world models generated by learning processes. Execution is fully reactive in the sense that no planning intervenes between perception and action. Dyna relies on machine learning methods for learning from examples---these are among the basic building blocks making up the architecture---yet is not tied to any particular method. This paper briefly introduces Dyna and discusses its strengths and weaknesses with respect to other architectures.


Sutton, R.S. (1991). Integrated modeling and control based on reinforcement learning and dynamic programming. In D. S. Touretzky (ed.), Advances in Neural Information Processing Systems 3, pages 471-478.

ABSTRACT: This is a summary of results with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods. Dyna architectures integrate trial-and-error (reinforcement) learning and execution-time planning into a single process operating alternately on the world and on a learned forward model of the world. We describe and show results for two Dyna architectures. The Dyna-PI architecture is based on dynamic programming's policy iteration method and can be related to existing AI ideas such as evaluation functions and universal plans (reactive systems). Using a navigation task, results are shown for a simple Dyna-PI system which simultaneously learns by trial and error, learns a world model, and plans optimal routes using the evolving world model. The Dyna-Q architecture is based on Watkins's Q-learning, a new kind of reinforcement learning. Dyna-Q uses a less familiar set of data structures than does Dyna-PI, but is arguably simpler to implement and use. We show that Dyna-Q architectures are easy to adapt for use in changing environments.


Sutton, R.S. (1991). Reinforcement learning architectures for animats, Proceedings of the First International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pp. 288-296.  smaller gzipped version.

ABSTRACT: In the first part of this paper I argue that the learning problem facing animats is essentially that which has been studied as the reinforcement learning problem---the learning of behavior by trial and error without an explicit teacher. A brief overview is presented of the development of reinforcement learning architectures over the past decade, with references to the literature.

The second part of the paper presents Dyna, a class of architectures based on reinforcement learning but which go beyond trial-and-error learning. Dyna architectures include a learned internal model of the world. By intermixing conventional trial and error with hypothetical trial and error using the world model, Dyna systems can plan and learn optimal behavior very rapidly. Results are shown for simple Dyna systems that learn from trial and error while they simultaneously learn a world model and use it to plan optimal action sequences. We also show that Dyna architectures are easy to adapt for use in changing environments.


Sutton, R.S., Matheus, C.J. (1991). Learning polynomial functions by feature construction. Proceedings of the Eighth International Workshop on Machine Learning, pp. 208-212, Morgan Kaufmann. ABSTRACT: We present a method for learning higher-order polynomial functions from examples using linear regression and feature construction. Regression is used on a set of training instances to produce a weight vector for a linear function over the feature set. If this hypothesis is imperfect, a new feature is constructed by forming the product of the two features that most effectively predict the squared error of the current hypothesis. This algorithm is then repeated. In an extension to this method, the specific pair of features to combine is selected by measuring their joint ability to predict the hypothesis' error.


Sutton, R.S., Barto, A.G., Williams, R. (1991). Reinforcement learning is direct adaptive optimal control, Proceedings of the American Control Conference, pages 2143-2146. Also published in IEEE Control Systems Magazine 12, No. 2, 19-22.

ABSTRACT: In this paper we present a control-systems perspective on one of the major neural-network approaches to learning control, reinforcement learning. Control problems can be divided into two classes: 1) regulation and tracking problems, in which the objective is to follow a reference trajectory, and 2) optimal control problems, in which the objective is to extremize a functional of the controlled system's behavior that is not necessarily defined in terms of a reference trajectory. Adaptive methods for problems of the first kind are well known, and include self-tuning regulators and model-reference methods, whereas adaptive methods for optimal-control problems have received relatively little attention. Moreover, the adaptive optimal-control methods that have been studied are almost all indirect methods, in which controls are recomputed from an estimated system model at each step. This computation is inherently complex, making adaptive methods in which the optimal controls are estimated directly more attractive. We view reinforcement learning methods as a computationally simple, direct approach to the adaptive optimal control of nonlinear systems. For concreteness, we focus on one reinforcement learning method (Q-learning) and on its analytically proven capabilities for one class of adaptive optimal control problems (markov decision problems with unknown transition probabilities).
Miller, W. T., Sutton, R. S., Werbos, P. J. (Eds.), Neural Networks for Control.  MIT Press, 1991.


Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Conference on Machine Learning, pp. 216-224, Morgan Kaufmann. Also appeared as "Artificial intelligence by dynamic programming," in Proceedings of the Sixth Yale Workshop on Adaptive and Learning Systems, pp. 89-95.  smaller gzipped postscript version. ABSTRACT: This paper extends previous work with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods. Dyna architectures integrate trial-and-error (reinforcement) learning and execution-time planning into a single process operating alternately on the world and on a learned model of the world. In this paper, I present and show results for two Dyna architectures. The Dyna-PI architecture is based on dynamic programming's policy iteration method and can be related to existing AI ideas such as evaluation functions and universal plans (reactive systems). Using a navigation task, results are shown for a simple Dyna-PI system that simultaneously learns by trial and error, learns a world model, and plans optimal routes using the evolving world model. The Dyna-Q architecture is based on Watkins's Q-learning, a new kind of reinforcement learning. Dyna-Q uses a less familiar set of data structures than does Dyna-PI, but is arguably simpler to implement and use. We show that Dyna-Q architectures are easy to adapt for use in changing environments.


Sutton, R.S. (1990). First results with Dyna, an integrated architecture for learning, planning, and reacting. In Neural Networks for Control, Miller, T., Sutton, R.S., & Werbos, P., Eds., MIT Press.


Sutton, R.S., Barto, A.G. (1990). Time-derivative models of pavlovian reinforcement. In Learning and Computational Neuroscience: Foundations of Adaptive Networks, M. Gabriel and J. Moore, Eds., pp. 497-537. MIT Press.

Presentation and analysis of models of classical conditioning based on temporal-difference learning. Most comprehensive treatment of the TD model of classical conditioning.
Barto, A.G., Sutton, R.S., Watkins, C.J.C.H. (1990). Learning and sequential decision making. In Learning and Computational Neuroscience: Foundations of Adaptive Networks, M. Gabriel and J.W. Moore, Eds., pp. 539-602, MIT Press. The following abstract is from the technical report version of this paper. ABSTRACT: In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of sequential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been proposed as models of animal learning in classical conditioning experiments. Here we relate TD methods to decision tasks formulated in terms of a stochastic dynamical system whose behavior unfolds over time under the influence of a decision maker's actions. Strategies are sought for selecting actions so as to maximize a measure of long-term payoff gain. Mathematically, tasks such as this can be formulated as Markovian decision problems, and numerous methods have been proposed for learning how to solve such problems. We show how a TD method can be understood as a novel synthesis of concepts from the theory of stochastic dynamic programming, which comprises the standard method for solving such tasks when a model of the synamical system is available, and the theory of parameter estimation, which provides the appropriate context for studying learning rules in the form of equations for updating associative strengths in behavioral models, or connection weights in connectionist networks. Because this report is oriented primarily toward the non-engineer interested in animal learning, it presents tutorials on stochastic sequential decision tasks, stochastic dynamic programming, and parameter estimation.


Barto, A.G., Sutton, R.S., & Watkins, C. (1990). Sequential decision problems and neural networks. In D. S. Touretzky (ed.), Advances in Neural Information Processing Systems 2, pp. 686-693.

ABSTRACT: Many problems require the formation of sequences of decisions whose consequences emerge over time periods of variable and uncertain duration. Decision-making strategies must be formed that take into account both the expected short- and long-term consequences of decisions. Problems such as this can be formulated as stochastic sequential decision problems, and can be solved by stochastic dynamic programming methods. In this paper, we describe the stochastic sequential decision framework and dynamic programming, and we show how the Adaptive Heuristic Critic (AHC) algorithm, a variant of which was used in the adaptive pole-balancer of Barto, Sutton, and Anderson, fits into this framework. We relate the AHC algorithm to its antececents and discuss the utility of neural-network methods within this context. We present several examples illustrating the utility of these methods, and conclude by discussing implications for adaptive optimal control and the modeling of animal learning.


Whitehead, S., Sutton, R.S., & Ballard, D. (1990). Advances in reinforcement learning and their implications for intelligent control, Proceedings of the Fifth IEEE International Symposium on Intelligent Control 1990, pp. 1289-1297.


Franklin, J., Sutton, R.S., Anderson, C., Selfridge, O., & Schwartz, D. (1990). Connectionist learning control at GTE laboratories, Intelligent Control and Adaptive Systems, G. Rodriguez, Ed., Proc. SPIE 1196, pp. 242-253.


Anderson, C., Franklin, J., & Sutton, R.S. (1990). Learning a nonlinear model of a manufacturing process using multilayer connectionist networks, Proceedings of the Fifth IEEE International Symposium on Intelligent Control 1990, pp. 404-409.


Sutton, R.S. (1989). Artificial intelligence as a control problem: Comments on the relationship between machine learning and intelligent control. Appeared in Machine Learning in a dynamic world. Proceedings of the IEEE International Symposium on Intelligent Control 1988, pp. 500-507.


Sutton, R.S. (1989). Implementation details of the TD(lambda) procedure for the case of vector predictions and backpropagation. GTE Laboratories Technical Report TR87-509.1, as corrected August 1989. GTE Laboratories, 40 Sylvan Road, Waltham, MA 02254.


Franklin, J., Sutton, R.S., & Anderson, C. (1989). Application of connectionist learning methods to manufacturing process monitoring, Proceedings of the IEEE International Symposium on Intelligent Control 1988, pp. 709-712.


Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3: 9-44, erratum p. 377. Scan of paper as published, with erratum added. Digitally remastered with missing figure in place.

ABSTRACT: This article introduces a class of incremental learning procedures specialized for prediction---that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporal-difference methods have been used in Samuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervised-learning methods. For most real-world prediction problems, temporal-difference methods require less memory and less peak computation than conventional methods; and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporal-difference methods can be applied to advantage.


Sutton, R.S. (1988). Convergence theory for a new kind of prediction learning, Proceedings of the 1988 Workshop on Computational Learning Theory, pp. 421-42.


Sutton, R.S. (1988). NADALINE: A normalized adaptive linear element that learns efficiently. GTE Laboratories Technical Report TR88-509.4. GTE Laboratories, 40 Sylvan Road, Waltham, MA 02254.

ABSTRACT: This paper introduces a variant of the ADALINE in which the input signals are normalized to have zero mean and unit variance, and in which the bias or "threshold weight" is learned slightly differently. These changes result in a linear learning element that learns much more efficiently and rapidly, and that is much less dependent on the choice of the step-size parameter. Using simulation experiments, learning time improvements of from 30% to hundreds of times are shown. The memory and computational complexity of the new element remains O(N), where N is the number of input signals, and the added computations are entirely local. Theoretical analysis indicates that the new element learns optimally fast in a certain sense and to the extent that the input signals are statistically independent.


Selfridge, O., Sutton, R.S., & Anderson, C. (1988). Selected bibliography on connectionism, In: Evolution, Learning, and Cognition, Y.C. Lee (Ed.), pp. 391-403, World Scientific Publishing.


Sutton, R.S., & Barto, A.G. (1987). A temporal-difference model of classical conditioning, Proceedings of the Ninth Annual Conference of the Cognitive Science Society, pp. 355-378.

ABSTRACT: Rescorla and Wagner's model of classical conditioning has been one of the most influential and successful theories of this fundamental learning process. The learning rule of their theory was first described as a learning procedure for connectionist networks by Widrow and Hoff. In this paper we propose a similar confluence of psychological and engineering constraints. Sutton has recently argued that adaptive prediction methods called temporal-difference methods have advantages over other prediction methods for certain types of problems. Here we argue that temporal-difference methods can provide detailed accounts of aspects of classical conditioning behavior. We present a model of classical conditioning behavior that takes the form of a temporal-difference prediction method. We argue that it is an improvement over the Rescorla-Wagner model in its handling of within-trial temporal effects such as the ISI dependency, primacy effects, and the facilitation of remote associations in serial-compound conditioning. The new model is closely related to the model of classical conditioning that we proposed in 1981, but avoids some of the problems with that model recently identified by Moore et al. We suggest that the theory of adaptive prediction on which our model is based provides insight into the functionality of classical conditioning behavior.


Sutton, R.S. (1986). Two problems with backpropagation and other steepest-descent learning procedures for networks, Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pp. 823-831. Reprinted in Artificial Neural Networks: Concepts and Theory, edited by P. Mehra and B. Wah, IEEE Computer Society Press, 1992.  smaller gzipped version.

ABSTRACT: This article contributes to the theory of network learning procedures by identifying and anlyzing two problems with the backpropagation procedure of Rumelhart, Hinton, and Williams (1985) that may slow its learning. Both problems are due to backpropagation's being a gradient- or steepest-descent method in the weight space of the network. The first problem is that steepest descent is a particularly poor descent procedure for surfaces conatining ravines---places which curve more sharply in some directions than others---and such ravines are common and pronounced in performance surfaces arising from networks. The second problem is that steepest descent results in a high level of interference between learning with different patterns, because those units that have so far been found most useful are also those most likely to be changed to handle new patterns. The same problems probably also arise with the Boltzmann machine learning procedure (Ackley, Hinton, and Sejnowski, 1985) and with reinforcement learning procedures (Barto and Anderson, 1985), as these are also steepest-descent procedures. Finally, some directions in which to look for improvements to backpropagation based on alternative descent procedures are briefly considered.
Moore, J., Desmond, J, Berthier, N., Blazis, D., Sutton, R.S., & Barto, A.G. (1986). Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: Response topography, neuronal firing, and interstimulus intervals, Behavioural Brain Research 21: 143-154. ABSTRACT: A neuron-like adaptive element with computational features suitable for classical conditioning, the Sutton-Barto (S-B) model, was extended to simulate real-time aspects of the conditioned nictitating membrane (NM) response. The aspects of concern were response topography, CR-related neuronal firing, and interstimulus interval (ISI) effects for forward-delay and trace conditioning paradigms. The topography of the NM CR has the following features: response latency after CS onset decreases over trials; response amplitude increases gradually within the ISI and attains its maximum coincidentally with the UR. A similar pattern characterizes the firing of some (but not all) neurons in brain regions demonstrated experimentally to be important for NM conditioning. The variant of the S-B model described in this paper consists of a set of parameters and implementation rules based on 10-ms computational time steps. It differs from the original S-B model in a number of ways. The main difference is the assumption that CS inputs to the adaptive element are not instantaneous but are instead shaped by unspecified coding processes so as to produce outputs that conform with the real-time properties of NM conditioning. The model successfully simulates the aforementioned features of NM response topography. It is also capable of simulating appropriate ISI functions, i.e. with maximum conditioning strength with ISIs of 250 ms, for forward-delay and trace paradigms. The original model's successful treatment of multiple-CS phenomena, such as blocking, conditioned inhibition, and higher-order conditioning, are retained by the present model.


Sutton, R.S. (1985). Learning distributed, searchable, internal models, Proceedings of the Distributed Artificial Intelligence Workshop, pp. 287-289.


Sutton, R.S., & Pinette, B. (1985). The learning of world models by connectionist networks, Proceedings of the Seventh Annual Conference of the Cognitive Science Society, pp. 54-64.

ABSTRACT: Connectionist learning schemes have hitherto seemed far removed from cognitive skills such as reasoning, planning, and the formation of internal models. In this article we investigate what sort of world models a connectionist system might learn and how it might do so. A learning scheme is presented that forms such models based on observed stimulus-stimulus relationships. The basis of the scheme is a recurrently connected network of simple, neuron-like processing elements. The net produces a set of predictions of future stimuli based on the current stimuli, where these predictions are based on a model and involve multiple-step chains of predictions. Results are presented from computer simulations of the scheme connected to a simple world consisting of a stochastic maze (Markov process). By wandering around the maze the network learns its construction. When reinforcement is subsequently introduced, the solution to the maze is learned much more quickly than it is without the "exploration" period. The form and logic of the experiment is the same as that of the latent learning experiments of animal learning research.


Barto, A.G. & Sutton, R.S. (1985). Neural problem solving. In: Synaptic Modification, Neuron Selectivity, and Nervous System Organization, W.B. Levy & J.A. Anderson (Eds.), pp. 123-152. Lawrence Erlbaum.


Selfridge, O., Sutton, R.S., & Barto, A.G. (1985). Training and tracking in robotics, Proceedings of the Ninth International Joint Conference on Artificial Intelligence, pp. 670-672.

ABSTRACT: We explore the use of learning schemes in training and adapting performance on simple coordination tasks. The tasks are 1-D pole balancing. Several programs incorporating learning have already achieved this [1,3,8]: the problem is to move a cart along a short piece of track so as to keep a pole balanced on its end; the pole is hinged to the cart at its bottom, and the cart is moved either to the left or to the right by a force of constant magnitude. The form of the task considered here, after [3], involves a genuinely difficult credit-assignment problem. We use a learning scheme previously developed and analyzed [1,7] to achieve performance through reinforcement, and extend it to include changing and new requirements. For example, the length or mass of the pole can change, the bias of the force, its strength, and so on; and the system can be tasked to avoid certain regions altogether. In this way we explore the learning system's ability to adapt to changes and to profit from a selected training sequence, both of which are of obvious utility in practical robotics applications.

The results here were obtained using a computer simulation of the pole-balancing problem. A movie will be shown of the performance of the system under the various requirements and tasks.

[1] Barto, Sutton, & Anderson, 1983
[3] Michie & Chambers, 1968
[7] Sutton, 1984
[8] Widrow & Smith, 1964


Moore, J., Desmond, J., Berthier, N., Blazis, D., Sutton, R.S., & Barto, A.G. (1985). Connectionist learning in real time: Sutton-Barto adaptive element and classical conditioning of the nictitating membrane response, Proceedings of the Seventh Annual Conference of the Cognitive Science Society, pp. 318-322.


Sutton, R.S. (1984). Temporal credit assignment in reinforcement learning (106 Mbytes).  Ph.D. dissertation, Department of Computer Science, University of Massachusetts, Amherst, MA 01003. Published as COINS Technical Report 84-2.

ABSTRACT: This dissertation describes computational experiments comparing the performance of a range of reinforcement learning algorithms. The experiments are designed to focus on aspects of the credit-assignment problem having to do with determining when the behavior that deserves credit occurred. The issues of knowledge representation involved in developing new features or refining existing ones are not addressed.

The algorithms considered include some from learning automata theory, mathematical learning theory, early "cybernetic" approaches to learning, Samuel's checker-playing program, Michie and Chambers's "Boxes" system, and a number of new algorithms. The tasks were selected to involve, first in isolation and then in combination, the issues of misleading generalizations, delayed reinforcement, unbalanced reinforcement, and secondary reinforcement. The tasks range from simple, abstract "two-armed bandit" tasks to a physically realistic pole-balancing task.

The results indicate several areas where the algorithms presented here perform substantially better than those previously studied. An unbalanced distribution of reinforcement, misleading generalizations, and delayed reinforcement can greatly retard learning and in some cases even make it counterproductive. Performance can be substantially improved in the presence of these common problems through the use of mechanisms of reinforcement comparison and secondary reinforcement. We present a new algorithm similar to the "learning-by-generalization" algorithm used for altering the static evaluation function in Samuel's checker-playing program. Simulation experiments indicate that the new algorithm performs better than a version of Samuel's algorithm suitably modified for reinforcement learning tasks. Theoretical analysis in terms of an "ideal reinforcement signal" sheds light on the relationship between these two algorithms and other temporal credit-assignment algorithms.


Barto, A.G., Sutton, R.S., & Anderson, C. (1983). Neuron-like adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, SMC-13: 834-846. 12.4 Mb pdf3.5 Mb gzipped

ABSTRACT: It is shown how a system consisting of two neuronlike adaptive elements can solve a difficult learning control problem. The task is to balance a pole that is hinged to a movable cart by applying forces to the cart's base. It is assumed that the equations of motion of the cart-pole system are not known and that the only feedback evaluating performance is a failure signal that occurs when the pole falls past a certain angle from the vertical, or the cart reaches an end of a track. This evaluative feedback is of much lower quality than is required by standard adaptive control techniques. It is argued that the learning problems faced by adaptive elements that are components of adaptive networks are at least as difficult as this version of the pole-balancing problem. The learning system consists of a single associative search element (ASE) and a single adaptive critic element (ACE). In the course of learning to balance the pole, the ASE constructs associations between input and output by searching under the influence of reinforcement feedback, and the ACE constructs a more informative evaluation function than reinforcement feedback alone can provide. The differences between this approach and other attempts to solve problems using neuronlike elements are discussed, as is the relation of this work to classical and instrumental conditioning in animal learning studies and its possible implications for research in the neurosciences.


Sutton, R.S. (1982 - unpublished draft). A theory of salience change dependent on the relationship between discrepancies on successive trials on which the stimulus is present. smaller gzipped version.

ABSTRACT: All current theories of salience change (Frey and Sears, 1978; Kirk, 1974; Mackintosh, 1975; Dickinson and Mackintosh, 1979; Moore and Stickney, 1980; Pearce and Hall, 1980) determine change in salience based solely on events occurring on a single trial. However, a theoretical view of classical conditioning as prediction suggests that the relationship between the discrepancies on different trials on which a stimulus occurs should be used to change its salience. This paper presents a theory of salience change based on this idea and discusses its relationship to known experimental results and other theories of salience change. Several experiments are proposed where the theory makes novel predictions subject to experimental test. A mathematical derivation of the main equation of the theory from high-level assumptions is presented in the appendix.
Barto, A.G. & Sutton, R.S. (1982). Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element, Behavioral Brain Research 4:221-235. SUMMARY: A neuron-like adaptive element is described that produces an important feature of the anticipatory nature of classical conditioning. The response that occurs after training (conditioned response) usually begins earlier than the reinforcing stimulus (unconditioned stimulus). The conditioned response therefore usually anticipates the unconditioned stimulus. This aspect of classical conditioning has been largely neglected by hypotheses that neurons provide single unit analogs of classical conditioning. This paper briefly presents the model and extends earlier results by computer simulation of conditioned inhibition and chaining of associations.


Barto, A.G., Anderson, C., & Sutton, R.S. (1982). Synthesis of nonlinear control surfaces by a layered associative network, Biological Cybernetics 43:175-185.

ABSTRACT: An approach to solving nonlinear control problems is illustrated by means of a layered associative network composed of adaptive elements capable of reinforcement learning. The first later adaptively develops a representation in terms of which the second layer can solve the problem linearly. The adaptive elements comprising the network employ a novel type of learning rule whose properties, we argue, are essential to the adaptive behavior of the layered network. The behavior of the network is illustrated by means of a spatial learning problem that requires the formation of nonlinear associations. We argue that this approach to nonlinearity can be extended to a large class of nonlinear control problems.


Barto, A.G., Sutton, R.S., & Anderson, C. (1982). Spatial learning simulation systems, Proceedings of the 10th IMACS World Congress on Systems Simulation and Scientific Computation, pp. 204-206.


Sutton, R.S. (1981). Adaptation of learning rate parameters.  In: Goal Seeking Components for Adaptive Intelligence: An Initial Assessment, by A. G. Barto and R. S. Sutton. Air Force Wright Aeronautical Laboratories Technical Report AFWAL-TR-81-1070. Wright-Patterson Air Force Base, Ohio 45433.


Sutton, R.S., & Barto, A.G. (1981). Toward a modern theory of adaptive networks: Expectation and prediction, Psychological Review 88:135-140. Translated into Spanish by G. Ruiz to appear in the journal Estudios de Psicologia.

ABSTRACT: Many adaptive neural theories are based on neuronlike adaptive elements that can behave as single unit analogs of associative conditioning. In this article we develop a similar adaptive element, but one which is more closely in accord with the facts of animal learning theory than elements commonly studied in adaptive network research. We suggest that an essential feature of classical conditioning that has been largely overlooked by adaptive network theorists is its predictive nature. The adaptive element we present learns to increase its response rate in anticipation of increased stimulation, producing a conditioned response before the occurrence of the unconditioned stimulus. The element also is in strong agreement with the behavioral data regarding the effects of stimulus context, since it is a temporally refined extension of the Rescorla-Wagner model. We show by computer simulation that the element becomes sensitive to the most reliable, nonredundant, and earliest predictors of reinforcement. We also point out that the model solves many of the stability and saturation problems encountered in network simulations. Finally, we discuss our model in light of recent advances in the physiology and biochemistry of synaptic mechanisms.


Sutton, R.S., & Barto, A.G. (1981). An adaptive network that constructs and uses an internal model of its world, Cognition and Brain Theory 4:217-246.

ABSTRACT: Many theorists have emphasized the role of an "internal model of the world" in directing intelligent adaptive behavior. An internal model can be used to internally simulate the consequences of possible actions in order to choose among them without the necessity of overtly performing them. Animal learning theorists have taken latent learning experiments as demonstrations that animals can learn and use such internal models. In this paper, we describe an adaptive network of neuronlike components that constructs and uses an internal model, and we demonstrate this ability by describing a computer simulation of its behavior in a simplified analog of a latent learning task. The task has been made as simple as possible while still retaining those features that make behavior in latent learning tasks difficult to account for by connectionist models. The network illustrates a principle by which connectionist-like learning rules can give rise to behavior apparently requiring the formation and use of internal models. As such, it may help form a bridge between brain theory and connectionist models on the one hand, and cognitive and information processing models on the other.


Barto, A.G., Sutton, R.S. (1981). Goal seeking components for adaptive intelligence: An initial assessment. Air Force Wright Aeronautical Laboratories Technical Report AFWAL-TR-81-1070. Wright-Patterson Air Force Base, Ohio 45433. 524 pages.  (Appendix C is available separately.)

ABSTRACT: This report assesses the promise of a network approach to adaptive problem solving in which the network components themselves possess considerable adaptive power. We show that components designed with attention to the temporal aspects of reinforcement learning can acquire knowledge about feedback pathways in which they are embedded and can use this knowledge to seek their preferred inputs, thus combining pattern recognition, search, and control functions. A review of adaptive network research shows that networks of components having these capabilities have not been studied previously. We demonstrate that simple networks of these elements can solve types of problems that are beyond the capabilities of networks studied in the past. An associative memory is presented that retains the generalization capabilities and noise resistance of associative memories previously studied but does not require a "teacher" to provide the desired associations. It conducts active, closed-loop searches for the most rewarding associations. We provide an example in which these searches are conducted through the system's external environment and an example in which they are conducted through an internal predictive model of that environment. The latter system is capable of a simple form of latent learning. We argue that components capable of making progress toward their goals when embedded in environments that are indifferent, or even hostile, with respect to these goals can form the substrate for a decentralized, parallel, and pervasively adaptive problem solver. We discuss the hypothesis that neural information processing can be understood in these terms, and we relate our results to animal learning data.


Barto, A.G., Sutton, R.S., & Brouwer, P. (1981). Associative search network: A reinforcement learning associative memory, Biological Cybernetics 40:201-211.

ABSTRACT: An associative memory system is presented which does not require a "teacher" to provide the desired associations. For each input key it conducts a search for the output pattern which optimizes an external payoff or reinforcement signal. The associative search network (ASN) combines pattern recognition and function optimization capabilities in a simple and effective way. We define the associative search problem, discuss conditions under which the associative search network is capable of solving it, and present results from computer simulations. The synthesis of sensori-motor control surfaces is discussed as an example of the associative search problem.


Barto, A.G. & Sutton, R.S. (1981). Landmark learning: An illustration of associative search, Biological Cybernetics 42:1-8.

ABSTRACT: In a previous paper we defined the associative search problem and presented a system capable of solving it under certain conditions. In this paper we interpret a spatial learning problem as an associative search task and describe the behavior of an adaptive network capable of solving it. This example shows how naturally the associative search problem can arise and permits the search, association, and generalization properties of the adaptive network to be clearly illustrated.


Sutton, R.S. (1978). Single channel theory: A neuronal theory of learning, Brain Theory Newsletter 3, No. 3/4, pp. 72-75. (earliest publication)


Sutton, R.S. (1978 - unpublished). A unified theory of expectation in classical and instrumental conditioning. Bachelors thesis, Stanford University.


Sutton, R.S. (1978 - unpublished). Learning theory support for a single channel theory of the brain.