Next: 8. Generalization and Function Up: 7. Eligibility Traces Previous: 7.11 Conclusions Contents

Subsections

7.12 Bibliographical and Historical Remarks

7.1-2

The forward view of eligibility traces in terms of

-step returns and the $\lambda$ -return is due to Watkins (1989), who also first discussed the error reduction property of

-step returns. Our presentation is based on the slightly modified treatment by Jaakkola, Jordan, and Singh (1994). The results in the random walk examples were made for this text based on work of Sutton (1988) and Singh and Sutton (1996). The use of backup diagrams to describe these and other algorithms in this chapter is new, as are the terms "forward view" and "backward view."

TD( $\lambda$ ) was proved to converge in the mean by Dayan (1992), and with probability 1 by many researchers, including Peng (1993), Dayan and Sejnowski (1994), and Tsitsiklis (1994). Jaakkola, Jordan, and Singh (1994), in addition, first proved convergence of TD( $\lambda$ ) under on-line updating. Gurvits, Lin, and Hanson (1994) proved convergence of a more general class of eligibility trace methods.

7.3

The idea that stimuli produce aftereffects in the nervous system that are important for learning is very old. Animal learning psychologists at least as far back as Pavlov (1927) and Hull (1943, 1952) included such ideas in their theories. However, stimulus traces in these theories are more like transient state representations than what we are calling eligibility traces: they could be associated with actions, whereas an eligibility trace is used only for credit assignment. The idea of a stimulus trace serving exclusively for credit assignment is apparently due to Klopf (1972), who hypothesized that under certain conditions a neuron's synapses would become "eligible" for subsequent modification should reinforcement later arrive at the neuron. Our use of eligibility traces was based on Klopf's work (Sutton, 1978a, 1978b, 1978c; Barto and Sutton, 1981a, 1981b; Sutton and Barto, 1981a; Barto, Sutton, and Anderson, 1983; Sutton, 1984). The TD( $\lambda$ ) algorithm is due to Sutton (1988).

7.4

The equivalence of forward and backward views, and the relationships to Monte Carlo methods, were proved by Sutton (1988) for undiscounted episodic tasks, then extended by Watkins (1989) to the general case. The idea in exercise 7.5 is new.

7.5

Sarsa( $\lambda$ ) was first explored as a control method by Rummery and Niranjan (1994) and Rummery (1995).

7.6

Watkins's Q( $\lambda$ ) is due to Watkins (1989). Peng's Q( $\lambda$ ) is due to Peng and Williams (Peng, 1993; Peng and Williams, 1994, 1996). Rummery (1995) made extensive comparative studies of these algorithms.

Convergence has not been proved for any control method for .

7.7

Actor-critic methods were among the first methods to use eligibility traces (Barto, Sutton, and Anderson, 1983; Sutton, 1984). The specific algorithm discussed in this chapter has never been tried before.

7.8

Replacing traces are due to Singh and Sutton (1996). The results in Figure 7.17 are from their paper. The task in Figure 7.18 was used to show the weakness of accumulating traces by Sutton (1984). The relationship of both kinds of traces to specific Monte Carlo methods was developed by Singh and Sutton (1996).

7.9-10

The ideas in these two sections were generally known for many years, but beyond what is in the sources cited in the sections themselves, this text may be the first place they have been described. Perhaps the first published discussion of variable $\lambda$ was by Watkins (1989), who pointed out that the cutting off of the backup sequence (Figure 7.13) in his Q( $\lambda$ ) when a nongreedy action was selected could be implemented by temporarily setting $\lambda$ to 0.

Next: 8. Generalization and Function Up: 7. Eligibility Traces Previous: 7.11 Conclusions Contents

Mark Lee 2005-01-04