Next: 7.8 Replacing Traces Up: 7 Eligibility Traces Previous: 7.6 Q()

7.7 Eligibility Traces for Actor-Critic Methods (*)

In this section we describe how to extend the actor-critic methods introduced in Section 6 .6 to use eligibility traces. This is fairly straightforward. The critic part of an actor-critic method is simply on-policy learning of . The TD() algorithm can be used for that, with one eligibility trace for each state. The actor part needs to use an eligibility trace for each state-action pair. Thus, an actor-critic method needs two sets of traces, one for each state and one for each state-action pair.

Recall that the 1-step actor-critic method updates the actor by

where is the TD() error (7.6), and is the preference for taking action a at time t if in state s. The preferences determine the policy via, for example, a softmax method (Section 2 .3 ). We generalize the above equation to use eligibility traces as follows:

where denotes the trace at time t for state-action pair s,a. For the simplest case mentioned above, the trace can be updated just as in Sarsa() .

In Section 6 .6 we also discussed a more sophisticated actor-critic method that uses the update

To generalize this equation to eligibility traces we can use the same update (7.11) with a slightly different trace. Rather than incrementing the trace by 1 each time a state-action pair occurs, it is updated by :

for all s,a.

Richard Sutton
Fri May 30 15:01:47 EDT 1997