Reinforcement Learning FAQ:
Frequently Asked Questions about Reinforcement Learning

Edited by Rich Sutton

Initiated 8/13/01
Last updated 2/4/04

I get many questions about reinforcement learning -- how to think about it and how do it successfully in practice. This FAQ is an attempt to pull together in one place some of the more common questions and answers. I have been free with my opinions, but I would also welcome the opinions of others for inclusion here. In you have anything to add to my comments, including whole new questions or totally different answers, please send me a note at rich@richsutton.com.

General Questions
Questions about studying and teaching RL
Nuts and Bolts of RL
Advice and Opinions
Miscellaneous Questions
- I am working on a project due soon. Can you help me?
- I am having trouble with your code. Can you help me get it to work?

General Questions

What is Reinforcement Learning?

Reinforcement learning (RL) is learning from interaction with an environment, from the consequences of action, rather than from explicit teaching. RL become popular in the 1990s within machine learning and artificial intelligence, but also within operations research and with offshoots in psychology and neuroscience.

Most RL research is conducted within the mathematical framework of Markov decision processes (MDPs). MDPs involve a decision-making agent interacting with its environment so as to maximize the cumulative reward it receives over time. The agent perceives aspects of the environment's state and selects actions. The agent may estimate a value function and use it to construct better and better decision-making policies over time.

RL algorithms are methods for solving this kind of problem, that is, problems involving sequences of decisions in which each decision affects what opportunities are available later, in which the effects need not be deterministic, and in which there are long-term goals. RL methods are intended to address the kind of learning and decision making problems that people and animals face in their normal, everyday lives.

For more information, see the sources recommended for an introduction to RL.

Is RL just trial-and-error learning, or does it include planning?

Modern reinforcement learning concerns both trial-and-error learning without a model of the environment, and deliberative planning with a model. By "a model" here we mean a model of the dynamics of the environment. In the simplest case, this means just an estimate of the state-transition probabilities and expected immediate rewards of the environment. In general it means any predictions about the environment's future behavior conditional on the agent's behavior.

How does RL relate to Neuro-Dynamic Programming?

To a first approximation, Reinforcement Learning and Neuro-Dynamic Programming are synonomous. The name "reinforcement learning" came from psychology (although psychologists rarely use exactly this term) and dates back to the eary days of cybernetics. For example, Marvin Minsky used this term in his 1954 thesis, and Barto and Sutton revived it in the early 1980's. The name "neuro-dynamic programming" was coined by Bertsekas and Tsitsiklis in 1996 to capture the idea of the field as a combination of neural networks and dynamic programming.

In fact, neither name is very descriptive of the subject, and I recommend you use neither when you want to be technically precise. Names such as this are useful when referring to a general body of research, but not for carefully distinguishing ideas from one another. In that sense, there is no point in trying to draw a careful distinction between the referents of these two names.

The problem with "reinforcement learning" is that it is dated. Much of the field does not concern learning at all, but just planning from complete knowledge of the problem (a model of the environment). Almost the same methods are used for planning and learning, and it seems off-point to emphasize learning in the name of the field. "Reinforcement" does not seem particularly pertinent either.

The problem with "neuro-dynamic programming" is similar in that neither neural networks nor dynamic programming are critical to the field. Many of the methods, such as "Monte Carlo", or "rollout" methods, are completely unrelated to dynamic programming, and neural networks are just one choice among many for method of function approximation. Moreover, one could argue that the component names, "neural networks" and "dynamic programming", are each not very descriptive of their respective methods.

What advantages does RL offer in Operations Research problems?

Using function approximation, RL can apply to much larger state spaces than classical sequential optimization techniques such as dynamic programming. In addition, using simulations (sampling), RL can apply to systems that are too large or complicated to explicitly enumerate the next-state transition probabilities.

How does RL relate to Neuroscience?

Ideally, the ideas of reinforcement learning could constitute part of a computational theory of what the brain is doing and why. A number of links have been drawn between reinforcement learning and neuroscience, beginning with early models of classical conditioning based on temporal-difference learning (see Barto and Sutton, 1982; Sutton and Barto, 1981, 1990; Moore et al., 1986), and continuing through work on foraging and prediction learning (see Montague et al., 1995, 1996), and on dopamine neurons in monkeys as a temporal-difference-error distribution system. A good survey paper is available. See also Suri, submitted. A book collects a number of relevant papers. Doya has extensively developed RL models of the basal ganglia. Many of these areas are very active at present and changing rapidly.

How does RL relate to the psychology of animal behavior?

Broadly speaking, RL works as a pretty good model of instrumental learning, though a detailed argument for this has never been publically made (the closest to this is probably Barto, Sutton and Watkins, 1990). On the other hand, the links between classical (or Pavlovian) conditioning and temporal-difference (TD) learning (one of the central elements of RL) are close and widely acknowledged (see Sutton and Barto, 1990).

Ron Sun has developed hybrid models combining high-level and low-level skill learning, based in part on RL, which make contact with psychological data (see Sun, Merrill, and Peterson, 2001).

How does RL relate to behaviorism?

Formally, RL is unrelated to behaviorism, or at least to the aspects of behaviorism that are widely viewed as undesireable. Behaviorism has been disparaged for focusing exclusively on behavior, refusing to consider what was going on inside the head of the subject. RL of course is all about the algorithms and processes going on inside the agent. For example, we often consider the construction of internal models of the environment within the agent, which is far outside the scope of behaviorism.

Nevertheless, RL shares with behaviorism its origins in animal learning theory, and in its focus on the interface with the environment. RL's states and actions are essentially animal learning theory's stimuli and responses. Part of RL's point is that these are the essential common final path for all that goes on in the agent's head. In the end it all comes down to the actions taken and the states perceived.

How does RL relate to genetic algorithms?

Most work with genetic algorithms simulates evolution, not learning during an individual's life, and because of this is very different from work in RL. That having been said, there are two provisos. First, there is a large body of work on classifier systems that uses or is closely related to genetic algorithms. This work is concerned with learning during a single agent's lifetime (using GAs to organize the components of the agent's mind) and is in fact RL research. The second proviso is that GA work is often related to RL by virtue of being applied to the same problems. For example, GA methods can be applied to evolve a backgammon player and that player can be compared with a player learned by RL methods. In fact, a large portion of systems evolved by GAs are controllers that could alternatively be learned by RL methods. It is tempting here to make a blanket statement about which class of methods is more appropriate or performs better. A crucial distinction is that between problems in which state information is available and those in which it is not. In backgammon, for example, the state of the board is available. In problems like these RL methods seem to have a distinct advantage over evolutionary methods such as GAs. On other problems there is little state information. Suppose you wish to learn a golf swing, or some other ballistic motion that is generally over before useful sensory feedback is available. Then the system is far from Markov and there is little or no advantage provided by addressing the state information during a single swing. In these cases RL methods would not be expected to have an advantage over evolutionary methods. In fact, if they insisted on trying to treat sensory information as state, as in using temporal-difference methods, they may well do worse.

Who invented RL?

There is plenty of blame here to go around. Minsky, Bellman, Howard, Andreae, Werbos, Barto, Sutton, Watkins... All played important or early roles. See the history from the Sutton and Barto text.

Questions about studying and teaching RL

What sources do you recommend for an introduction to reinforcement learning?

For a general introduction, I recommend my book with Prof. Barto:

Reinforcement Learning: An Introduction, by Richard S. Sutton and Andrew G. Barto. MIT Press 1998. Online version. There is also a Japanese translation available. For a more formal treatment, including rigorous proofs, I recommend the text by Bertsekas and Tsitsiklis: Neuro-dynamic Programming, by Dimitri P. Bersekas and John N. Tsitsiklis. Athena Press, 1996. If you don't have time for a textbook-length treatment, your best bet is one or both of these two papers: Reinforcement learning: A survey, by Kaelbling, L.P., Littman, M.L., and Moore, A.W., in the Journal of Artificial Intelligence Research, 4:237--285, 1996. Learning and sequential decision making, by Barto, A.G., Sutton, R.S., & Watkins, C.J.C.H., in Learning and Computational Neuroscience, M. Gabriel and J.W. Moore (Eds.), pp. 539--602, 1990, MIT Press.

Where can I find an online version of the Sutton and Barto book?

http://richsutton.com/book/the-book.html

There is also a lot of other material there: code, figures, errata, some teaching materials.

Where can I find an electronic version of the book suitable for printing?

Pdfs of pre-publication versions of Chapters 1, 3, and 9 (only) are available at http://richsutton.com/book/the-book.html

Where can I find the answers to the exercises?

An instructor's manual containing answers to all the non-programming exercises is available to qualified teachers. Readers using the book for self study can obtain answers on a chapter-by-chapter basis after working on the exercises themselves. For details see here.

I have found an apparent error in the book. What should I do about it?

First, please check the list of errata. If you don't find your error there, then please send me a note describing it at rich@richsutton.com. If I agree that it is an error, then I will add it the the errata list. For your troubles you will be credited in the list for the discovery of the error.

How can I obtain an evaluation copy of the book?

Qualified instructor can generally obtain an examination copy from MIT press. Please see the MIT Press home page for the book.

Nuts and Bolts of RL

My state and/or action space is huge! Can I still apply RL?

Yes, but you can't get by with simple tables; you will need some kind of function approximation.

"Function approximation" refers to the use of a parameterized functional form to represent the value function (and/or the policy), as opposed to a simple table. A table is able to represent the value of each state separately, without confusion, interaction, or generalization with the value of any other state. In typical problems, however, there are far too many states to learn or represent their values individually; instead we have to generalize from observed to states to new, unobserved ones. In principle, this need not be a problem. There are a host of supervised learning methods that can used to approximate functions. However, there are both theoretical and practical pitfalls, and some care is needed. See Chapter 8 of the Sutton and Barto text.

For the most part, the theoretical foundation that RL adopts from dynamic programming is no longer valid in the case of function approximation. For example, Q-learning with linear function approximation is known to be unsound. The strongest positive result is for on-policy prediction with linear function approximators ( Tsitsiklis and Van Roy, 1997; Tadic, 2001). This is an area of active current research (e.g., see Gordon, 2001; Precup, Sutton & Dasgupta, 2001).

Most RL work assumes the action space is discrete; what about continuous actions?

It is true that most RL work has considered discrete action spaces, but this was usually done for convenience, not as an essential limitation of the ideas; and there are exceptions. Nevertheless, it is often not obvious how to extend RL methods to continuous, or even large discrete, action spaces. The key problem is that RL methods typically involve a max or sum over elements of the action space, which is not feasible if the space is large or infinite. The natural approach is to replace the enumeration of actions with a sample of them, and average (just as we replace the enumeration of possible next states with a sample of the, and average). This requires either a very special structure for the action-value function, or else a stored representation of the best known policy. Actor-critic methods are one approach.

With no attempt to be exhaustive, some of the earlier RL research with continuous actions includes:

Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229--256.
Millington, P.J. (1991). Associative Reinforcement Learning for Optimal Control, M.S. Thesis, Massachusetts Institute of Technology, Technical Report CSDL-T-1070.
Baird, L. C., & Klopf, A. H. (1993). Reinforcement learning with high-dimensional, continuous actions. Technical Report WL-TR-93-1147. Wright-Patterson Air Force Base Ohio: Wright Laboratory.
Santamaria, J.C., Sutton, R.S., Ram, A. (1998). Experiments with reinforcement learning in problems with continuous state and action spaces, Adaptive Behavior 6(2):163-218.
Yamada, S., Nakashima, M., and Shiono, S. (1998). Reinforcement learning to train a cooperative network with both discrete and continuous output neurons, IEEE Trans. on Neural Networks 9(6):1502-1508.

What about continuous time?

Although the ideas of RL extend naturally, in principle, to continuous time, there has been relatively little work here. The best work on this so far is probably that by Kenji Doya.

What is the curse of dimensionality?

The curse of dimensionality refers to the tendency of a state space to grow exponentially in its dimension, that is, in the number of state variables. This is of course a problem for methods such as dynamic programming and other table-based RL methods whose complexity scales linearly with the number of states. Many RL methods are able to partially escape the curse by sampling and by function approximation.

Curiously, the key step in the tile-coding approach to function approximation is expanding the original state representation into a vastly higher dimensional space. This makes many complex nonlinear relationships in the original representation simpler and linear in the expanded representation. The more dimensions in the expanded representation, the more functions can be learned easily and quickly. I call this the blessing of dimensionality. It is one of the ideas behind modern support vector machines, but in fact it goes back at least as far as the perceptron.

Isn't tile-coding just grids, and thus subject to the curse of dimensionality?

Basically, no. Tile coding is a quite general idea and many of the ways it can be used avoid the curse of dimensionality. There are at least three common tricks: 1) you can consider subsets of the state variables in separate tilings, 2) you can use overlapping tilings to get vastly higher resolution in high dimensional spaces than would be possible with a simple grid using the same amount of memory, and 3) you can hash down to a smaller space so as only to devote memory to the portions of state space that actually occur.

Why do you call it "tile coding" instead of the conventional name, "CMACs"?

The idea of tile coding is essentially the same as that underlying CMACs. Without in any way intending to detract from the credit due the pioneers of CMACs (e.g, Albus, Marr, Miller...), sometimes it is better to switch to a new name. The name CMAC stands for, among other things, "Cerebellar Model Articulatory Controller," which seems pretty inappropriate for the current usage. The original CMAC also used a slightly different algorithm -- a correlation rule rather than an error correction rule. The use in reinforcement learning steps it even farther away from its original use. What remains is not so much a learning system (much less a cerebellar model), but a way of coding states for use by a learning system. The key features of the coding is that it is based on multiple exhaustive partitions (tilings!) of the state space, and that it is particularly suited for implementation on serial, digital computers.

Are RL methods stable with function approximation?

The situation is a bit complicated and in flux at present. Stability guarantees depend on the specific algorithm and function approximator, and on the way it is used. This is what we knew as of August 2001:

For some nonlinear parameterized function approximators (FA), any temporal-difference (TD) learning method (including Q-learning and Sarsa) can become unstable (parameters and estimates going to infinity). [Tsitsiklis & Van Roy 1996]
TD(lambda) with linear FA converges near the best linear solution when trained on-policy... [ Tsitsiklis & Van Roy 1997]
...but may become unstable when trained off-policy (updating states with a different distribution than that seen when following the policy). [ Baird 1995]
From which it follows that Q-learning with linear FA can also be unstable. [ Baird 1995]
Sarsa(lambda), on the other hand, is guaranteed stable, although only the weakest of error bounds has been shown. [Gordon 2001]
New linear TD algorithms for the off-policy case have been shown convergent near the best solution. [Precup, Sutton & Dasgupta 2001]

Since then, the Perkins and Precup result from NIPS 2002 has appeared, which may at last have resolved the question positively by proving the convergence of Sarsa(0) with linear function approximation and an appropriate exploration regime.

Advice and Opinions

I am doing RL with a backpropagation neural network and it doesn't work; what should I do?

It is a common error to use a backpropagation neural network as the function approximator in one's first experiments with reinforcement learning, which almost always leads to an unsatisfying failure. The primary reason for the failure is that backpropation is fairly tricky to use effectively, doubly so in an online application like reinforcement learning. It is true that Tesauro used this approach in his strikingly successful backgammon application, but note that at the time of his work with TDgammon, Tesauro was already an expert in applying backprop networks to backgammon. He had already built the world's best computer player of backgammon using backprop networks. He had already learned all the tricks and tweaks and parameter settings to make backprop networks learn well. Unless you have a similarly extensive background of experience, you are likely to be very frustrated using a backprop network as your function approximator in reinforcement learning.

The solution is to step back and accept you can only innovate in one area at a time. First make sure you understand RL ideas in the tabular case, and the general principles of function approximation in RL. Then proceed to a better understood function approximator, such as a linear one. In my own work I never go beyond linear function approximators. I just augment them with larger and larger feature vectors, using coarse tile-coding (see e.g., Chapter 8 of the book, Sutton, 1996; Stone and Sutton, 2001). It may be that this approach would always be superior to backpropagation nets, but that remains to be seen. But someone new to learning and RL algorithms certainly should not start with backprop nets.

Also see here for some of the details for using TD methods with backprop nets.

What is the best algorithm to use?

The true answer, of course, is that we don't know, and that it probably hasn't been invented yet. Each algorithm has strengths and weaknesses, and the current favorite changes every few years. In the 1980s actor-critic methods were very popular, but in the 1990s they were largely superceded by value-function methods such as Q-learning and Sarsa. Q-learning is probably still the most widely used, but its instability with function approximation, discovered in 1995, probably rules it out for the long run. Recently policy-based methods such as actor-critic and value-function-less methods, including some of those from the 1980s, have become popular again.

So, it seems we must keep our minds and options open as RL moves forward.

Why do you disparage the term Q-value?

RL researchers often talk about Q-values, Q-tables, and Q-functions, but to me such usage almost always seems like jargon---that is, like apparently precise technical terms that are off-putting to newcomers but really add nothing. In this case the Q- prefix means only that the values, tables, or functions pertain to state-action pairs. There are several pertinent functions that take state-action pairs, for example, that returning the optimal values, that returning the true values for some policy, and those returning the approximation of one of these two values. In most cases, we give these functions different notations, all of which involve the letter Q. That is why I consider the name Q-function to be apparently technical, but in fact imprecise, and thus undesireable, off-putting jargon. If you are just talking in general about the value of an action at a state, then the term I prefer is simply "action value".

Miscellaneous Questions

I am working on a project due soon. Can you help me?

Reinforcement learning is a big topic, with many large branches and under-explored side areas. And there are no cookbook (easy, ready, immediate) solutions. I can sometimes offer advice that may point you in the right direction, but it will almost always take time, lots of your time, to study and come to understand the issues before you will see the appropriate solution. If you are running up against a deadline, there is little that I can do.

I am having trouble with your code. Can you help me get it to work?

This code is made available on an as-is basis, and in many cases I can provide little or no support. In many cases, I have no personal experience with the code, and in others it will not even run outside of my special computer environments. I make the code available nevertheless, because reading it may clarify some of the implementation points for you. In a few cases, like the tile coding implementations in C++, I have written the code myself to run universally, and I can provide some support.

Maybe you could contribute some of your own code, perhaps a class project, that is better documented and works on a wider range of computational platforms.

Reinforcement Learning FAQ: Frequently Asked Questions about Reinforcement Learning