In this section we describe how to extend the actor-critic methods
introduced in Section 6 .6
to use eligibility traces.
This is fairly straightforward.
The critic part of an actor-critic method is simply on-policy
learning of . The TD(
)
algorithm can be used for that, with one
eligibility trace for each state. The actor part needs to use an eligibility trace
for each state-action pair. Thus, an actor-critic method needs two sets of traces,
one for each state and one for each state-action pair.
Recall that the 1-step actor-critic method updates the actor by
where is the TD(
)
error (7.6), and
is the preference for taking action a at time t
if in state s. The preferences determine the policy via, for example, a
softmax method (Section 2 .3
). We generalize the
above equation to use eligibility traces as follows:
where denotes the trace at time t for state-action pair
s,a. For the simplest case mentioned above, the trace can be
updated just as in Sarsa(
)
.
In Section 6 .6 we also discussed a more sophisticated actor-critic method that uses the update
To generalize this equation to eligibility traces we can use the
same update (7.11) with a slightly different
trace. Rather than incrementing the trace by 1 each time a
state-action pair occurs, it is updated by :
for all s,a.