We now develop in detail one class of learning methods for function approximation in value prediction, those based on gradient descent. Gradient-descent methods are among the most widely used of all function approximation methods and are particularly well-suited to reinforcement learning.
In gradient-descent methods, the parameter vector is a column vector with
a fixed number of real valued components,
(the `T' here denotes
transpose), and
is a smooth differentiable function of
for all
. For now, let us assume that on each step, t, we
observe a new example
. These states might be
successive states from an interaction with the environment, but for now we do
not assume so. Even though we are given the exact,
correct values,
for each
, there is still a difficult problem
because our function approximator has limited resources and thus limited
resolution. In particular, there is generally no
that gets all the states,
or even all the examples, exactly correct. In addition, we must generalize to
all the other states that have not appeared in examples.
We assume that states appear in examples with the same distribution, P, over which we are trying to minimize the MSE as given by (8.1). A good strategy in this case is to try to minimize error on the observed examples. Gradient-descent methods do this by adjusting the parameter vector after each example by a small amount in the direction that would most reduce the error on that example:
where
is a positive step-size parameter, and
,
for any function, f,
denotes the vector of partial derivatives,
. This
derivative vector is the gradient of f with respect to
.
This kind of method is called gradient descent because the overall step
in
is proportional to the negative gradient of the example's squared
error. This is the direction in which the error falls most rapidly.
It may not be immediately apparent why only a small step is taken in the direction of the gradient. Could we not move all the way in this direction and completely eliminate the error on the example? In many cases this could be done, but usually it is not desirable. Remember that we do not seek or expect to find a value function that has zero error on all states, but only an approximation that balances the errors in different states. If we completely corrected each example in one step, then we would not find such a balance. In fact, the convergence results for gradient methods assume that the step-size parameter decreases over time. If it decreases in such a way as to satisfy the standard stochastic-approximation conditions (2 .7 ), then the gradient-descent method (8.2) is guaranteed to converge to a local optimum.
We turn now to the case in which the output, , of the tth
training example,
, is not the true value,
, but some approximation of it. For example,
might be a
noise-corrupted version of
, or it might be one of the backed-up
values mentioned in the previous section. In such cases we cannot perform the
exact update (8.2) because
is unknown, but we can
approximate it by substituting
in place of
. This yields
the general gradient-descent method for state-value prediction:
If is an unbiased estimate, i.e., if
, for
each t, then
is guaranteed to converge to a local optimum under the
usual stochastic approximation conditions (2 .7
) for decreasing
the step-size parameter,
.
For example, suppose the states in the examples are the states generated
by interaction (or simulated interaction) with the environment using policy .
Let
denote the return following each state,
. Because the true value
of a state is the expected value of the return following it, the Monte Carlo
target
is by definition an unbiased estimate of
. With this
choice, the general gradient-descent method (8.3) converges to a
locally-optimal approximation to
. Thus, the gradient-descent
version of Monte Carlo state-value prediction is guaranteed to find a
locally-optimal solution.
Similarly, we can use n
-step TD returns and their averages for . For
example, the gradient-descent form of TD(
)
uses the
-return,
, as its approximation to
, yielding the
forward-view update:
Unfortunately, for ,
is not an unbiased estimate of
, and thus this method does not converge to a local optimum. The
situation is the same when DP targets are used such as
. Nevertheless, such bootstrapping methods can be quite
effective, and other performance guarantees are available for important special
cases, as we discuss later in this chapter. For now we emphasize the
relationship of these methods to the general gradient-descent form
(8.3). Although, increments as in (8.4) are not
themselves gradients, it is useful to view this method
as a gradient-descent method (8.3) with a bootstrapping
approximation in place of the desired output,
.
Just as (8.4) provides the forward view of
gradient-descent TD()
, the backwards view is provided by
where is the usual TD error,
and
is a column vector of eligibility traces, one
for each component of
, updated by
with . A complete algorithm for online gradient-descent TD(
)
is
given in Figure 8.1.
Initializearbitrarily and
Repeat (for each episode):
Repeat (for each step of episode):
Take action a, observe reward, r, and next state,
![]()
![]()
![]()
![]()
until s is terminal
Two forms of gradient-based function approximators have been used widely in reinforcement learning. One is multi-layer artificial neural networks using the error backpropagation algorithm. This maps immediately onto the equations and algorithms just given, where the backpropagation process is the way of computing the gradients. The second popular form is the linear form, which we discuss extensively in the next section.
Exercise .
Show that table-lookup TD()
is a special case of
general TD(
)
as given by equations (8.5--8.7).
Exercise .
State aggregation is a simple form of generalizing function approximator in which states are grouped together, with one table entry (value estimate) used for each group. Whenever a state in a group is encountered, the group's entry is used to determine the state's value, and when the state is updated, the group's entry is updated. Show that this kind of state aggregation is a special case of a gradient method such as (8.4).
Exercise .
The equations given in this section are for the online version
of gradient-descent TD()
. What are the equations for the offline
version? Give a complete description specifying the new approximate value
function at the end of an episode,
, in terms of the approximate value
function used during the episode, V. Start by modifying a forward-view
equation for TD(
)
, such as (8.4).
Exercise .
For offline updating, show that equations (8.5--8.7) produce updates identical to (8.4).