As usual, we begin with the prediction problem, that of estimating the state-value function from experience generated using policy . The novelty in this chapter is that the approximate value function at time t, , is represented not as a table, but as a parameterized functional form with parameter vector . This means that the value function depends totally on , varying from time step to time step only as varies. For example, might be the function computed by an artificial neural network, with the vector of connection weights. By adjusting the weights, any of a wide range of different functions can be implemented by the network. Or might be the function computed by a decision tree, where is all the parameters defining the split points and leaf values of the tree. Typically, the number of parameters (the number of components of ) is much less than the number of states, and changing one parameter changes the estimated value of many states. Consequently, when a single state is backed up, the change generalizes from that state to affect the values of many other states.
All of the prediction methods covered in this book have been described as backups, that is, as updates to an estimated value function that shift its value at particular states toward a ``backed-up value" for that state. Let us refer to an individual backup by the notation , where s is the state backed up and v is the backed-up value that s's estimated value is shifted toward. For example, the DP backup for value prediction is , the Monte Carlo backup is , the TD(0) backup is , and the general TD() backup is . In the DP case, an arbitrary state s is backed up, whereas in the the other cases the state, , encountered in (possibly simulated) experience is backed up.
It is natural to interpret each backup as specifying an example of the desired input-output behavior of the estimated value function. In a sense, the backup means that the estimated value for state s should be more like v. Up to now, the actual update implementing the backup has been trivial: the table entry for s's estimated value has simply been shifted a fraction of the way towards v. Now we permit arbitrarily complex and sophisticated function approximation methods to implement the backup. The normal inputs to these methods are examples of the desired input-output behavior of the function they are trying to approximate. We use these methods for value prediction simply by passing to them the of each backup as a training example. The approximate function they produce we then interpret as an estimated value function.
Viewing each backup as a conventional training example in this way enables us to use any of a wide range of existing function approximation methods for value prediction. In principle, we can use any method for supervised learning from examples, including artificial neural networks, decision trees, and various kinds of multivariate regression. However, not all function-approximation methods are equally well suited for use in reinforcement learning. The most sophisticated neural-network and statistical methods all assume a static training set over which multiple passes are made. In reinforcement learning, however, it is important that learning be able to occur online, while interacting with the environment or with a model of the environment. To do this requires methods that are able to learn efficiently from incrementally-acquired data. In addition, reinforcement learning generally requires function approximators able to handle nonstationary target functions (functions that change over time). For example, in GPI (Generalized Policy Iteration) control methods we often seek to learn while changes. Even if the policy remains the same, the target values of training examples are nonstationary if they are generated by bootstrapping methods (DP and TD). Methods that cannot easily handle such nonstationarity are less suitable for reinforcement learning.
What performance measures are appropriate for evaluating function-approximation methods? Most supervised-learning methods seek to minimize the mean squared error over some distribution, P, of the inputs. In our value-prediction problem, the inputs are states and the target function is the true value function , so the mean squared error (MSE) for an approximation , using parameter , is
where P is a distribution weighting the errors of different states. This distribution is important because it is usually not possible to reduce the error to zero at all states. After all, there are generally far more states than there are components to . The flexibility of the function approximator is thus a scarce resource. Better approximation at some states can be gained generally only at the expense of worse approximation at other states. The distribution, P, specifies how these tradeoffs should be made.
The distribution P is also usually the distribution from which the states in the training examples are drawn, and thus the distribution of states at which backups are done. If we wish to minimize error over a certain distribution of states, then it makes sense to train the function approximator with examples from that same distribution. For example, if you want a uniform level of error over the entire state set, then it makes sense to train with backups distributed uniformly over the entire state set, such as in the exhaustive sweeps of some DP methods. Henceforth, let us assume that the distribution of states at which backups are done and the distribution that weights errors, P, are the same.
A distribution of particular interest is the one describing the frequency with which states are encountered while the agent is interacting with the environment and selecting actions according to , the policy whose value function we are approximating. We call this the on-policy distribution, in part because it is the distribution of backups in on-policy control methods. Minimizing error over the on-policy distribution focuses function-approximation resources on the states that actually occur while following the policy, ignoring those that never occur. The on-policy distribution is also the one for which it is easiest to get training examples using Monte Carlo or TD methods. These methods generate backups from sample experience using the policy . Because a backup is generated for each state encountered in the experience, the training examples available are naturally distributed according to the on-policy distribution. Stronger convergence results are available for the on-policy distribution than for other distributions, as we discuss later.
It is not completely clear that we should care about minimizing the MSE. Our goal in value prediction is potentially different because our ultimate purpose is to use the predictions to aid in finding a better policy. The best predictions for that purpose are not necessarily the best for minimizing MSE. However, it is not yet clear what a more useful alternative goal for value prediction might be. For now, we continue to focus on MSE.
An ideal goal in terms of MSE would be to find a global optimum, a parameter vector for which for all possible . Reaching this goal is sometimes possible for simple function approximators such as linear ones, but is rarely possible for complex function approximators such as artificial neural networks and decision trees. Short of this, complex function approximators may seek to converge instead to a local optimum, a parameter for which for all in some neighborhood of . Although this guarantee is only slightly reassuring, it is typically the best that can be said for nonlinear function approximators. For many cases of interest in reinforcement learning, convergence to an optimum, or even true convergence, does not occur. Nevertheless, a MSE that is within a small bound of an optimum may still be achieved with some methods. Others may in fact diverge, with their MSE approaching infinity in the limit.
In this section we have outlined a framework for combining a wide range of reinforcement learning methods for value prediction with a wide range of function-approximation methods, using the backups of the former to generate training examples for the latter. We have also outlined a range of MSE performance measures to which these methods may aspire. The range of possible methods is far too large to cover all, and anyway there is too little known about most of them to make a reliable evaluation or recommendation. Of necessity, we consider only a few possibilities. In the rest of this chapter we focus on function-approximation methods based on gradient principles, and on linear gradient-descent methods in particular. We focus of these methods in part because we consider them to be particularly promising and because they reveal key theoretical issues, but also just because they are simple and our space is limited. If we had another chapter devoted to function approximation, we would also cover at least memory-based methods and decision-tree methods.