As we outlined in Chapter 1, the idea of TD learning has its early roots in animal learning psychology and artificial intelligence, most notably the work of Samuel (1959) and Klopf (1972). Samuel's work is described as a case study in Section 11 .2 . Also related to TD learning are Holland's (1975, 1976) early ideas about consistency among value predictions. These influenced one of the authors (Barto), who was a graduate student from 1970 to 1975 at the University of Michigan, where Holland was teaching. Holland's ideas led to a number of TD-related systems, including the work of Booker (1982) and the bucket brigade of Holland (1986), which is related to Sarsa as discussed below.
. -- 6.2
Most of the specific material from these sections is from Sutton (1988), including the TD(0) algorithm, the random walk example, and the term ``temporal-difference learning." The characterization of the relationship to dynamic programming and Monte Carlo methods was influenced by Watkins (1989), Werbos (1987), and others. The use of backup diagrams here and in other chapters is new to this book. Example 6.4 is due to Sutton, but has not been published before.
Tabular TD(0) was proved to converge in the mean by Sutton (1988) and with probability one by Dayan (1992) based on the work of Watkins and Dayan (1992). These results were then extended and strengthened by Jaakkola, Jordan and Singh (1994) and Tsitsiklis (1994) by using extensions of the powerful existing theory of stochastic approximation. Other extensions and generalizations are covered in the next two chapters.
.
The optimality of the TD algorithm under batch training was established by Sutton (1988). The term certainty equivalence is from the adaptive control literature (e.g., Goodwin and Sin, 1984). Illuminating this result is Barnard's (1993) derivation of the TD algorithm as a combination of one step of an incremental method for learning a model of the Markov chain and one step of a method for computing predictions from the model.
.
The Sarsa algorithm was first explored by Rummery and Niranjan (1994), who called it modified Q-learning. The name ``Sarsa" was introduced by Sutton (1996). The convergence of 1-step tabular Sarsa (the form treated in this chapter) has been proven by Satinder Singh (personal communication). The ``windy gridworld" example was suggested by Tom Kalt.
Holland's (1986) bucket-brigade idea evolved into an algorithm very close to Sarsa. The original idea of the bucket brigade involved chains of rules triggering each other; it focused on passing credit back from the current rule to the rules that triggered it. Over time, the bucket brigade came to be more like TD learning in passing credit back to any temporally preceding rule, not just to the ones that triggered the current rule. The modern form of the bucket brigade, when simplified in various natural ways, is very similar to 1-step Sarsa, as detailed by Wilson (1994).
.
Q-learning was introduced by Watkins (1989), whose outline of a convergence proof was later made rigorous by Watkins and Dayan (1992). More general convergence results were proved by Jaakkola, Jordan, and Singh (1994) and Tsitsiklis (1994).
.
Actor-critic architectures using TD learning were first studied by Witten (1977) and then by Barto, Sutton, and Anderson (1983), who introduced this use of the terms ``actor" and ``critic." Sutton (1984) and Williams (1992) developed the eligibility terms mentioned in this section. Barto (1995) and Houk, Adams, and Barto (1995) presented a model of how an actor-critic architecture might be implemented in the brain.
.
R-learning is due to Schwartz (1993). Mahadevan (1996), Tadepalli and Ok (1994), and Bertsekas and Tsitsiklis (1996) have also studied reinforcement learning for undiscounted continual tasks. In the literature, the undiscounted continual case is often called the case of maximizing ``average reward per time step" or the ``average-reward case." The name R-learning is due to Schwartz and probably was meant just to be the alphabetic successor to Q-learning, but we prefer to think of it as a reference to the learning of relative values. The access-control queuing example was suggested by the work of Carlström and Nordström (1997).