In this section we describe how to extend the actor-critic methods introduced in Section 6 .6 to use eligibility traces. This is fairly straightforward. The critic part of an actor-critic method is simply on-policy learning of . The TD() algorithm can be used for that, with one eligibility trace for each state. The actor part needs to use an eligibility trace for each state-action pair. Thus, an actor-critic method needs two sets of traces, one for each state and one for each state-action pair.
Recall that the 1-step actor-critic method updates the actor by
where is the TD() error (7.6), and is the preference for taking action a at time t if in state s. The preferences determine the policy via, for example, a softmax method (Section 2 .3 ). We generalize the above equation to use eligibility traces as follows:
where denotes the trace at time t for state-action pair s,a. For the simplest case mentioned above, the trace can be updated just as in Sarsa() .
In Section 6 .6 we also discussed a more sophisticated actor-critic method that uses the update
To generalize this equation to eligibility traces we can use the same update (7.11) with a slightly different trace. Rather than incrementing the trace by 1 each time a state-action pair occurs, it is updated by :
for all s,a.