Next: 8. Generalization and Function
Up: 7. Eligibility Traces
Previous: 7.11 Conclusions
Contents
Subsections
The forward view of eligibility traces in terms of -step returns and the
-return is due to Watkins (1989), who also first discussed the error reduction
property of -step returns. Our presentation is based on the slightly modified
treatment by Jaakkola, Jordan, and Singh (1994). The results in the random walk
examples were made for this text based on work of Sutton (1988) and
Singh and Sutton (1996). The use of backup diagrams to describe these and other
algorithms in this chapter is new, as are the terms "forward view" and "backward
view."
TD() was proved to converge in the mean by Dayan (1992), and with
probability 1 by many researchers, including Peng (1993), Dayan and
Sejnowski (1994), and Tsitsiklis
(1994).
Jaakkola, Jordan, and Singh (1994), in addition,
first proved convergence of
TD() under on-line updating. Gurvits, Lin, and Hanson
(1994) proved convergence of a more general class
of eligibility trace methods.
The idea that stimuli produce aftereffects in the nervous system that are
important for learning is very old. Animal learning psychologists at least as
far back as Pavlov (1927) and Hull (1943, 1952) included such ideas in their
theories. However, stimulus traces in these
theories are more like transient state representations than what we are calling
eligibility traces: they could be associated with actions, whereas an eligibility
trace is used only for credit assignment. The idea of a stimulus trace serving
exclusively for credit assignment is apparently due to Klopf
(1972), who hypothesized that under certain conditions a neuron's
synapses would become "eligible" for subsequent modification should reinforcement
later arrive at the neuron. Our use of eligibility traces was based on
Klopf's work (Sutton, 1978a, 1978b, 1978c; Barto and Sutton, 1981a, 1981b; Sutton and
Barto, 1981a; Barto, Sutton, and Anderson, 1983; Sutton, 1984). The
TD() algorithm is due to Sutton (1988).
The equivalence of forward and backward views, and the relationships to Monte
Carlo methods, were proved by Sutton (1988) for undiscounted episodic tasks,
then extended by Watkins (1989) to the general case. The idea in exercise 7.5 is new.
Sarsa() was first explored as a control method by Rummery and Niranjan
(1994) and Rummery (1995).
Watkins's Q() is due to Watkins (1989). Peng's Q() is due to Peng
and Williams (Peng, 1993; Peng and Williams, 1994,
1996). Rummery
(1995) made extensive comparative studies of these
algorithms.
Convergence has not been proved for any control method for .
Actor-critic methods were among the first methods to use eligibility traces (Barto,
Sutton, and Anderson, 1983; Sutton, 1984). The
specific algorithm discussed in this chapter has never been tried before.
Replacing traces are due to Singh and Sutton (1996). The results in
Figure
7.17 are from their paper. The task in Figure 7.18 was
used to show the weakness of accumulating traces by Sutton (1984). The
relationship of both kinds of traces to specific Monte Carlo methods was
developed by Singh and Sutton (1996).
The ideas in these two sections were generally known for many years, but beyond
what is in the sources cited in the sections themselves, this text may be the first
place they have been described. Perhaps the first published discussion of
variable was by Watkins (1989), who pointed out that the cutting
off of the backup sequence (Figure
7.13) in his Q() when
a nongreedy action was selected could be implemented by temporarily setting
to 0.
Next: 8. Generalization and Function
Up: 7. Eligibility Traces
Previous: 7.11 Conclusions
Contents
Mark Lee
2005-01-04