As usual, we begin with the prediction problem of estimating the state-value function from experience generated using policy . The novelty in this chapter is that the approximate value function at time , , is represented not as a table but as a parameterized functional form with parameter vector . This means that the value function depends totally on , varying from time step to time step only as varies. For example, might be the function computed by an artificial neural network, with the vector of connection weights. By adjusting the weights, any of a wide range of different functions can be implemented by the network. Or might be the function computed by a decision tree, where is all the parameters defining the split points and leaf values of the tree. Typically, the number of parameters (the number of components of ) is much less than the number of states, and changing one parameter changes the estimated value of many states. Consequently, when a single state is backed up, the change generalizes from that state to affect the values of many other states.
All of the prediction methods covered in this book have been described as backups, that is, as updates to an estimated value function that shift its value at particular states toward a "backed-up value" for that state. Let us refer to an individual backup by the notation , where is the state backed up and is the backed-up value, or target, that 's estimated value is shifted toward. For example, the DP backup for value prediction is , the Monte Carlo backup is , the TD(0) backup is , and the general TD() backup is . In the DP case, an arbitrary state is backed up, whereas in the the other cases the state, , encountered in (possibly simulated) experience is backed up.
It is natural to interpret each backup as specifying an example of the desired input-output behavior of the estimated value function. In a sense, the backup means that the estimated value for state should be more like . Up to now, the actual update implementing the backup has been trivial: the table entry for 's estimated value has simply been shifted a fraction of the way toward . Now we permit arbitrarily complex and sophisticated function approximation methods to implement the backup. The normal inputs to these methods are examples of the desired input-output behavior of the function they are trying to approximate. We use these methods for value prediction simply by passing to them the of each backup as a training example. We then interpret the approximate function they produce as an estimated value function.
Viewing each backup as a conventional training example in this way enables us to use any of a wide range of existing function approximation methods for value prediction. In principle, we can use any method for supervised learning from examples, including artificial neural networks, decision trees, and various kinds of multivariate regression. However, not all function approximation methods are equally well suited for use in reinforcement learning. The most sophisticated neural network and statistical methods all assume a static training set over which multiple passes are made. In reinforcement learning, however, it is important that learning be able to occur on-line, while interacting with the environment or with a model of the environment. To do this requires methods that are able to learn efficiently from incrementally acquired data. In addition, reinforcement learning generally requires function approximation methods able to handle nonstationary target functions (target functions that change over time). For example, in GPI control methods we often seek to learn while changes. Even if the policy remains the same, the target values of training examples are nonstationary if they are generated by bootstrapping methods (DP and TD). Methods that cannot easily handle such nonstationarity are less suitable for reinforcement learning.
What performance measures are appropriate for evaluating function approximation
methods? Most supervised learning methods seek
to minimize the mean-squared error (MSE) over some distribution, , of the
inputs. In our value prediction problem, the inputs are states and the target
function is the true value function , so MSE for
an approximation , using parameter , is
The distribution is also usually the distribution from which the states in the training examples are drawn, and thus the distribution of states at which backups are done. If we wish to minimize error over a certain distribution of states, then it makes sense to train the function approximator with examples from that same distribution. For example, if you want a uniform level of error over the entire state set, then it makes sense to train with backups distributed uniformly over the entire state set, such as in the exhaustive sweeps of some DP methods. Henceforth, let us assume that the distribution of states at which backups are done and the distribution that weights errors, , are the same.
A distribution of particular interest is the one describing the frequency with which states are encountered while the agent is interacting with the environment and selecting actions according to , the policy whose value function we are approximating. We call this the on-policy distribution, in part because it is the distribution of backups in on-policy control methods. Minimizing error over the on-policy distribution focuses function approximation resources on the states that actually occur while following the policy, ignoring those that never occur. The on-policy distribution is also the one for which it is easiest to get training examples using Monte Carlo or TD methods. These methods generate backups from sample experience using the policy . Because a backup is generated for each state encountered in the experience, the training examples available are naturally distributed according to the on-policy distribution. Stronger convergence results are available for the on-policy distribution than for other distributions, as we discuss later.
It is not completely clear that we should care about minimizing the MSE. Our goal in value prediction is potentially different because our ultimate purpose is to use the predictions to aid in finding a better policy. The best predictions for that purpose are not necessarily the best for minimizing MSE. However, it is not yet clear what a more useful alternative goal for value prediction might be. For now, we continue to focus on MSE.
An ideal goal in terms of MSE would be to find a global optimum, a parameter vector for which for all possible . Reaching this goal is sometimes possible for simple function approximators such as linear ones, but is rarely possible for complex function approximators such as artificial neural networks and decision trees. Short of this, complex function approximators may seek to converge instead to a local optimum, a parameter vector for which for all in some neighborhood of . Although this guarantee is only slightly reassuring, it is typically the best that can be said for nonlinear function approximators. For many cases of interest in reinforcement learning, convergence to an optimum, or even true convergence, does not occur. Nevertheless, an MSE that is within a small bound of an optimum may still be achieved with some methods. Other methods may in fact diverge, with their MSE approaching infinity in the limit.
In this section we have outlined a framework for combining a wide range of reinforcement learning methods for value prediction with a wide range of function approximation methods, using the backups of the former to generate training examples for the latter. We have also outlined a range of MSE performance measures to which these methods may aspire. The range of possible methods is far too large to cover all, and anyway too little is known about most of them to make a reliable evaluation or recommendation. Of necessity, we consider only a few possibilities. In the rest of this chapter we focus on function approximation methods based on gradient principles, and on linear gradient-descent methods in particular. We focus on these methods in part because we consider them to be particularly promising and because they reveal key theoretical issues, but also because they are simple and our space is limited. If we had another chapter devoted to function approximation, we would also cover at least memory-based and decision-tree methods.