In some cases significantly better performance can be obtained by using a
slightly modified kind of trace known as a replacing trace. Suppose a state
is visited and then revisited before the trace due to the first visit has fully
decayed to zero. With accumulating traces
(7.5), the revisit causes a further increment in the
trace, driving it greater than , whereas with
replacing traces, the trace is reset to
.
Figure
7.16 contrasts these two kinds of traces.
Formally, a replacing trace for a discrete state
is defined by
Prediction or control algorithms using replacing traces are often called
replace-trace methods. Although
replacing traces are only slightly different from accumulating traces, they can
produce a significant improvement in learning rate. Figure
7.17
compares the performance of conventional and replace-trace versions of TD() on the
19-state random walk prediction task.
Other examples for a slightly more general
case are given in Figure 8.10 in the next chapter.
Example 7.5 Figure 7.18 shows an example of the kind of task that is difficult for control methods using accumulating eligibility traces. All rewards are zero except on entering the terminal state, which produces a reward of +1. From each state, selecting the right action brings the agent one step closer to the terminal reward, whereas the wrong (upper) action leaves it in the same state to try again. The full sequence of states is long enough that one would like to use long traces to get the fastest learning. However, problems occur if long accumulating traces are used. Suppose, on the first episode, at some state,
There is an interesting relationship between replace-trace methods and Monte Carlo methods in the undiscounted case. Just as conventional TD(1) is related to the every-visit MC algorithm, so replace-trace TD(1) is related to the first-visit MC algorithm. In particular, the off-line version of replace-trace TD(1) is formally identical to first-visit MC (Singh and Sutton, 1996). How, or even whether, these methods and results extend to the discounted case is unknown.
There are several possible ways to generalize replacing eligibility
traces for use in control methods. Obviously, when a state is
revisited and a new action is selected, the trace for that action should
be reset to 1. But what of the traces for the other actions for that
state? The approach recommended by Singh and Sutton (1996) is to set
the traces of all the other actions from the revisited state to 0.
In this case, the state-action traces are updated by the following instead of
(7.13):
Exercise 7.6 In Example 7.5, suppose from state
Exercise 7.7 (programming) Program Example 7.5 and compare accumulate-trace and replace-trace versions of Sarsa(