Next: 7.8 Replacing Traces
Up: 7. Eligibility Traces
Previous: 7.6 Q()
Contents
In this section we describe how to extend the actor-critic methods
introduced in Section 6.6 to use eligibility traces.
This is fairly straightforward.
The critic part of an actor-critic method is simply on-policy
learning of . The TD() algorithm can be used for that, with one
eligibility trace for each state. The actor part needs to use an eligibility trace
for each state-action pair. Thus, an actor-critic method needs two sets of traces,
one for each state and one for each state-action pair.
Recall that the one-step actor-critic method updates the
actor by
where is the TD() error (7.6), and
is the preference for taking action at time
if in state . The preferences determine the policy via, for example, a
softmax method (Section 2.3). We generalize the
above equation to use eligibility traces as follows:
| (7.14) |
where denotes the trace at time for state-action pair
. For the simplest case mentioned above, the trace can be
updated as in Sarsa().
In Section 6.6 we also
discussed a more sophisticated actor-critic method that uses the
update
To generalize this equation to eligibility traces we can use the
same update (7.14) with a slightly different
trace. Rather than incrementing the trace by 1 each time a
state-action pair occurs, it is updated by :
| (7.15) |
for all .
Next: 7.8 Replacing Traces
Up: 7. Eligibility Traces
Previous: 7.6 Q()
Contents
Mark Lee
2005-01-04