We now develop in detail one class of learning methods for function approximation in value prediction, those based on gradient descent. Gradient-descent methods are among the most widely used of all function approximation methods and are particularly well suited to reinforcement learning.

In gradient-descent methods, the parameter vector is a column vector with
a fixed number of real valued components,
(the here denotes
transpose), and is a smooth differentiable function of
for all . For now, let us assume that on each step , we
observe a new example . These states might be
successive states from an interaction with the environment, but for now we do
*not* assume so. Even though we are given the exact,
correct values, for each , there is still a difficult problem
because our function approximator has limited resources and thus limited
resolution. In particular, there is generally no that gets all the states,
or even all the examples, exactly correct. In addition, we must generalize to
all the other states that have not appeared in examples.

We assume that states appear in examples with the same distribution, , over
which we are trying to minimize the MSE as given by (8.1). A good
strategy in this case is to try to minimize error on the observed examples.
Gradient-descent methods do this by adjusting the parameter vector after
each example by a small amount in the direction that would most reduce the
error on that example:

where is a positive step-size parameter, and , for any function , denotes the vector of partial derivatives, . This derivative vector is the

It may not be immediately apparent why only a small step is taken in the direction of the gradient. Could we not move all the way in this direction and completely eliminate the error on the example? In many cases this could be done, but usually it is not desirable. Remember that we do not seek or expect to find a value function that has zero error on all states, but only an approximation that balances the errors in different states. If we completely corrected each example in one step, then we would not find such a balance. In fact, the convergence results for gradient methods assume that the step-size parameter decreases over time. If it decreases in such a way as to satisfy the standard stochastic approximation conditions (2.7), then the gradient-descent method (8.2) is guaranteed to converge to a local optimum.

We turn now to the case in which the target output, , of the th
training example, , is not the true value,
, but some approximation of it. For example, might be a
noise-corrupted version of , or it might be one of the backed-up
values mentioned in the previous section. In such cases we cannot perform the
exact update (8.2) because is unknown, but we can
approximate it by substituting in place of . This yields
the general gradient-descent method for state-value prediction:

If is an

For example, suppose the states in the examples are the states generated by interaction (or simulated interaction) with the environment using policy . Let denote the return following each state, . Because the true value of a state is the expected value of the return following it, the Monte Carlo target is by definition an unbiased estimate of . With this choice, the general gradient-descent method (8.3) converges to a locally optimal approximation to . Thus, the gradient-descent version of Monte Carlo state-value prediction is guaranteed to find a locally optimal solution.

Similarly, we can use -step TD returns and their averages for . For
example, the gradient-descent form of TD() uses the -return,
, as its approximation to , yielding the
forward-view update:

Unfortunately, for , is not an unbiased estimate of , and thus this method does not converge to a local optimum. The situation is the same when DP targets are used such as . Nevertheless, such bootstrapping methods can be quite effective, and other performance guarantees are available for important special cases, as we discuss later in this chapter. For now we emphasize the relationship of these methods to the general gradient-descent form (8.3). Although increments as in (8.4) are not themselves gradients, it is useful to view this method as a gradient-descent method (8.3) with a bootstrapping approximation in place of the desired output, .

As (8.4) provides the forward view of
gradient-descent TD(), so the backward view is provided by

where is the usual TD error,

and is a column vector of eligibility traces, one for each component of , updated by

with . A complete algorithm for on-line gradient-descent TD() is given in Figure 8.1.

Two methods for gradient-based function approximation have been used widely in reinforcement learning. One is multilayer artificial neural networks using the error backpropagation algorithm. This maps immediately onto the equations and algorithms just given, where the backpropagation process is the way of computing the gradients. The second popular form is the linear form, which we discuss extensively in the next section.