Next: 8 Generalization and Function Approximation Up: 7 Eligibility Traces Previous: 7.11 Conclusions

7.12 Bibliographical and Historical Remarks

. and 7 .2

The forward view of eligibility traces in terms of n -step returns and the -return is due to Watkins (1989), who also first discussed the error-correction property of n -step returns. Our presentation is based on the slightly modified treatment by Jaakkola, Jordan, and Singh (1994). The results in the random-walk examples were made for this text based on the prior work by Sutton (1988) and Singh and Sutton (1996). The use of backup diagrams to describe these and other algorithms in this chapter is new, as are the terms ``forward view" and ``backward view."

TD() was proved to converge in the mean by Dayan (1992), and with probability one by many researchers, including Peng (1993), Dayan and Sejnowski (1994), and Tsitsiklis (1994). Jaakkola, Jordan and Singh (1994) in addition first proved convergence of TD() under online updating. Gurvits, Lin, and Hanson (1994) proved convergence of a more general class of eligibility-trace methods.

The idea that stimuli produce aftereffects in the nervous system that are important for learning is very old. Animal learning psychologists at least as far back as Pavlov (1927) and Hull (1943, 1952) included such ideas in their theories. However, stimulus traces in these theories are more like transient state representations than what we are calling eligibility traces: they could be associated with actions, whereas an eligibility trace is used only for credit assignment. The idea of a stimulus trace serving exclusively for credit assignment is apparently due to Klopf (1972), who proposed that a neuron's synapses become ``eligible" under certain conditions for modification if, and when, reinforcement arrives later at the neuron. Our subsequent use of eligibility traces was based on Klopf's work (Sutton, 1978; Barto and Sutton, 1981; Sutton and Barto, 1981; Barto, Sutton, and Anderson, 1983; Sutton, 1984). The TD() algorithm itself is due to Sutton (1988).

The equivalence of forward and backwards views, and the relationships to Monte Carlo methods, were developed by Sutton (1988) for undiscounted episodic tasks, then extended by Watkins to the general case. The idea in exercise 7.5 is new.

Sarsa() was first explored as a control method by Rummery and Niranjan (1994) and Rummery (1995).

Watkins's Q() is due to Watkins (1989). Peng's Q() is due to Peng and Williams (Peng, 1993; Peng and Williams, 1994, 1996). Rummery (1995) made extensive comparative studies of these algorithms.

Convergence has not been proved for any control method for .

Actor-critic methods were among the first to use eligibility traces (Barto, Sutton and Anderson, 1983; Sutton, 1984). The specific algorithm discussed in this chapter has never been tried before.

Replacing traces are due to Singh and Sutton (1996). The results in Figure 7.17 are from their paper. The task in the example of this section was first used to show the weakness of accumulating traces by Sutton (1984). The relationship of both kinds of traces to specific Monte Carlo methods were developed by Singh and Sutton (1996).

. and 7 .10

The ideas in these two sections were generally known for many years, but, beyond what is in the sources cited in the sections themselves, this text may be the first place they have been described. Perhaps the first published discussion of variable was by Watkins (1989), who pointed out that the cutting off of the backup sequence (Figure 7.13) in his Q() when a non-greedy action was selected could be implemented by temporarily setting to 0.

Next: 8 Generalization and Function Approximation Up: 7 Eligibility Traces Previous: 7.11 Conclusions

Richard Sutton
Fri May 30 15:01:47 EDT 1997