Next: 8.3 Linear Methods Up: 8. Generalization and Function Previous: 8.1 Value Prediction with Contents

8.2 Gradient-Descent Methods

We now develop in detail one class of learning methods for function approximation in value prediction, those based on gradient descent. Gradient-descent methods are among the most widely used of all function approximation methods and are particularly well suited to reinforcement learning.

In gradient-descent methods, the parameter vector is a column vector with a fixed number of real valued components, (the here denotes transpose), and is a smooth differentiable function of for all . For now, let us assume that on each step , we observe a new example . These states might be successive states from an interaction with the environment, but for now we do not assume so. Even though we are given the exact, correct values, for each , there is still a difficult problem because our function approximator has limited resources and thus limited resolution. In particular, there is generally no that gets all the states, or even all the examples, exactly correct. In addition, we must generalize to all the other states that have not appeared in examples.

We assume that states appear in examples with the same distribution, , over which we are trying to minimize the MSE as given by (8.1). A good strategy in this case is to try to minimize error on the observed examples. Gradient-descent methods do this by adjusting the parameter vector after each example by a small amount in the direction that would most reduce the error on that example:


			(8.2)

where $\alpha$ is a positive step-size parameter, and

, for any function

, denotes the vector of partial derivatives,

. This derivative vector is the gradient of

with respect to

. This kind of method is called gradient descent because the overall step in

is proportional to the negative gradient of the example's squared error. This is the direction in which the error falls most rapidly.

It may not be immediately apparent why only a small step is taken in the direction of the gradient. Could we not move all the way in this direction and completely eliminate the error on the example? In many cases this could be done, but usually it is not desirable. Remember that we do not seek or expect to find a value function that has zero error on all states, but only an approximation that balances the errors in different states. If we completely corrected each example in one step, then we would not find such a balance. In fact, the convergence results for gradient methods assume that the step-size parameter decreases over time. If it decreases in such a way as to satisfy the standard stochastic approximation conditions (2.7), then the gradient-descent method (8.2) is guaranteed to converge to a local optimum.

We turn now to the case in which the target output, , of the th training example, , is not the true value, , but some approximation of it. For example, might be a noise-corrupted version of , or it might be one of the backed-up values mentioned in the previous section. In such cases we cannot perform the exact update (8.2) because is unknown, but we can approximate it by substituting in place of . This yields the general gradient-descent method for state-value prediction:

(8.3)

is an unbiased estimate, that is, if

, for each

, then

is guaranteed to converge to a local optimum under the usual stochastic approximation conditions (2.7) for decreasing the step-size parameter $\alpha$ .

For example, suppose the states in the examples are the states generated by interaction (or simulated interaction) with the environment using policy . Let denote the return following each state, . Because the true value of a state is the expected value of the return following it, the Monte Carlo target is by definition an unbiased estimate of . With this choice, the general gradient-descent method (8.3) converges to a locally optimal approximation to . Thus, the gradient-descent version of Monte Carlo state-value prediction is guaranteed to find a locally optimal solution.

Similarly, we can use -step TD returns and their averages for . For example, the gradient-descent form of TD( $\lambda$ ) uses the $\lambda$ -return, , as its approximation to , yielding the forward-view update:

(8.4)

Unfortunately, for

is not an unbiased estimate of

, and thus this method does not converge to a local optimum. The situation is the same when DP targets are used such as

. Nevertheless, such bootstrapping methods can be quite effective, and other performance guarantees are available for important special cases, as we discuss later in this chapter. For now we emphasize the relationship of these methods to the general gradient-descent form (8.3). Although increments as in (8.4) are not themselves gradients, it is useful to view this method as a gradient-descent method (8.3) with a bootstrapping approximation in place of the desired output,

As (8.4) provides the forward view of gradient-descent TD( $\lambda$ ), so the backward view is provided by

(8.5)

where

is the usual TD error,

(8.6)

and

is a column vector of eligibility traces, one for each component of

, updated by

(8.7)

with

. A complete algorithm for on-line gradient-descent TD( $\lambda$ ) is given in Figure 8.1.

**Figure 8.1:** On-line gradient-descent TD( $\lambda$ ) for estimating . The approximate value function, , is implicitly a function of .

Two methods for gradient-based function approximation have been used widely in reinforcement learning. One is multilayer artificial neural networks using the error backpropagation algorithm. This maps immediately onto the equations and algorithms just given, where the backpropagation process is the way of computing the gradients. The second popular form is the linear form, which we discuss extensively in the next section.

Exercise 8.1 Show that table-lookup TD( $\lambda$ ) is a special case of general TD( $\lambda$ ) as given by equations ((8.5)-(8.7)).

Exercise 8.2 State aggregation is a simple form of generalizing function approximation in which states are grouped together, with one table entry (value estimate) used for each group. Whenever a state in a group is encountered, the group's entry is used to determine the state's value, and when the state is updated, the group's entry is updated. Show that this kind of state aggregation is a special case of a gradient method such as (8.4).

Exercise 8.3 The equations given in this section are for the on-line version of gradient-descent TD( $\lambda$ ). What are the equations for the off-line version? Give a complete description specifying the new approximate value function at the end of an episode,

, in terms of the approximate value function used during the episode,

. Start by modifying a forward-view equation for TD( $\lambda$ ), such as (8.4).

Exercise 8.4 For off-line updating, show that equations ((8.5)-(8.7)) produce updates identical to (8.4).

Next: 8.3 Linear Methods Up: 8. Generalization and Function Previous: 8.1 Value Prediction with Contents

Mark Lee 2005-01-04