It might at first appear that methods using eligibility traces are much more complex
than one-step methods. A naive implementation would require every state
(or state-action pair) to update both its value estimate and its eligibility trace
on every time step. This would not be a problem for implementations on
single-instruction, multiple-data parallel computers or in plausible neural
implementations, but it is a problem for implementations on conventional serial
computers. Fortunately, for typical values of and
the eligibility
traces of almost all states are almost always nearly zero; only those that have
recently been visited will have traces significantly greater than zero. Only
these few states really need to be updated because the updates at the others
will have essentially no effect.
In practice,
then, implementations on conventional computers keep track of and update only the
few states with nonzero traces. Using this trick, the computational expense of
using traces is typically a few times that of a one-step method. The exact multiple
of course depends on and
and on the expense of the other computations.
Cichosz (1995) has demonstrated a further implementation
technique that further reduces complexity to a constant independent of
and
.
Finally, it should be noted that the tabular case is in some sense a worst
case for the computational complexity of traces. When function approximation
is used (Chapter 8), the computational advantages of not using traces
generally decrease. For example, if artificial neural networks and
backpropagation are used, then traces generally cause only a doubling of the
required memory and computation per step.
Exercise 7.9 Write pseudocode for an implementation of TD(