We now extend value prediction methods using function approximation to control methods, following the pattern of GPI. First we extend the state-value prediction methods to action-value prediction methods, then we combine them with policy improvement and action selection techniques. As usual, the problem of ensuring exploration is solved by pursuing either an on-policy or an off-policy approach.

The extension to action-value prediction is straightforward. In this case it is
the action-value function, , that is represented as a
parameterized functional form with parameter vector . Whereas before we
considered training examples of the form , now we consider
examples of the form . The target output,
, can be any approximation of , including the usual
backed-up values such as the full Monte Carlo return,
, or the one-step Sarsa-style return, .
The general gradient-descent update for action-value prediction is

For example, the backward view of the action-value method analogous to TD() is

where

and

with . We call this method

To form control methods, we need to couple such action-value prediction methods with techniques for policy improvement and action selection. Suitable techniques applicable to continuous actions, or to actions from large discrete sets, are a topic of ongoing research with as yet no clear resolution. On the other hand, if the action set is discrete and not too large, then we can use the techniques already developed in previous chapters. That is, for each possible action, , available in the current state, , we can compute and then find the greedy action . Policy improvement is done by changing the estimation policy to the greedy policy (in off-policy methods) or to a soft approximation of the greedy policy such as the -greedy policy (in on-policy methods). Actions are selected according to this same policy in on-policy methods, or by an arbitrary policy in off-policy methods.

Figures 8.8 and 8.9 show examples of on-policy (Sarsa()) and off-policy (Watkins's Q()) control methods using function approximation. Both methods use linear, gradient-descent function approximation with binary features, such as in tile coding and Kanerva coding. Both methods use an -greedy policy for action selection, and the Sarsa method uses it for GPI as well. Both compute the sets of present features, , corresponding to the current state and all possible actions, . If the value function for each action is a separate linear function of the same features (a common case), then the indices of the for each action are essentially the same, simplifying the computation significantly.

All the methods we have discussed above have used *accumulating*
eligibility traces. Although replacing traces (Section 7.8) are known to have
advantages in tabular methods, replacing traces do not directly extend to the use of
function approximation. Recall that the idea of replacing traces is to reset a
state's trace to each time it is visited instead of incrementing it by .
But with function approximation there is no single trace corresponding to
a state, just a trace for each component of , which corresponds
to many states. One approach that seems to work well for linear,
gradient-descent function approximation methods with binary features is to treat the
features as if they were states for the purposes of replacing traces. That is,
each time a state is encountered that has feature , the trace for
feature is set to rather than being incremented by , as it would be with
accumulating traces.

When working with state-action traces, it may also be useful to clear (set to zero) the traces of all nonselected actions in the states encountered (see Section 7.8). This idea can also be extended to the case of linear function approximation with binary features. For each state encountered, we first clear the traces of all features for the state and the actions not selected, then we set to the traces of the features for the state and the action that was selected. As we noted for the tabular case, this may or may not be the best way to proceed when using replacing traces. A procedural specification of both kinds of traces, including the optional clearing for nonselected actions, is given for the Sarsa algorithm in Figure 8.8.

The reward in this problem is on all time steps until the car moves
past its goal position at the top of the mountain, which ends the episode.
There are three possible actions: full throttle forward (), full
throttle reverse (), and zero throttle (). The car moves according to
a simplified physics. Its position,
, and velocity, , are updated by

where the operation enforces and . When reached the left bound, was reset to zero. When it reached the right bound, the goal was reached and the episode was terminated. Each episode started from a random position and velocity uniformly chosen from these ranges. To convert the two continuous state variables to binary features, we used gridtilings as in Figure 8.5. We used ten tilings, each offset by a random fraction of a tile width.

The Sarsa algorithm in Figure
8.8 (using replace
traces and the optional clearing) readily solved this task, learning
a near optimal policy within 100 episodes. Figure
8.10
shows the negative of the value function (the *cost-to-go* function)
learned on one run, using the parameters
, , and (). The initial action values were
all zero, which was optimistic (all true values are negative in this task),
causing extensive exploration to occur even though the exploration parameter, ,
was . This can be seen in the middle-top panel of the figure, labeled "Step
428." At this time not even one episode had been completed, but the car has
oscillated back and forth in the valley, following circular trajectories in state
space. All the states visited frequently are valued worse than
unexplored states, because the actual rewards have been worse than what was
(unrealistically) expected. This continually drives the agent away from wherever
it has been, to explore new states, until a solution is found.
Figure
8.11 shows the results of a detailed study of the effect
of the parameters and , and of the kind of traces, on the rate of learning
on this task.