7.7 Eligibility Traces for Actor-Critic Methods

In this section we describe how to extend the actor-critic methods introduced in Section 6.6 to use eligibility traces. This is fairly straightforward. The critic part of an actor-critic method is simply on-policy learning of . The TD( $\lambda$ ) algorithm can be used for that, with one eligibility trace for each state. The actor part needs to use an eligibility trace for each state-action pair. Thus, an actor-critic method needs two sets of traces, one for each state and one for each state-action pair.

Recall that the one-step actor-critic method updates the actor by

where

is the TD( $\lambda$ ) error (7.6), and

is the preference for taking action

at time

if in state

. The preferences determine the policy via, for example, a softmax method (Section 2.3). We generalize the above equation to use eligibility traces as follows:

(7.14)

where

denotes the trace at time

for state-action pair

. For the simplest case mentioned above, the trace can be updated as in Sarsa( $\lambda$ ).

In Section 6.6 we also discussed a more sophisticated actor-critic method that uses the update

To generalize this equation to eligibility traces we can use the same update (7.14) with a slightly different trace. Rather than incrementing the trace by 1 each time a state-action pair occurs, it is updated by

(7.15)

for all

Next: 7.8 Replacing Traces Up: 7. Eligibility Traces Previous: 7.6 Q(

) Contents

Mark Lee 2005-01-04