** Next:** III. A Unified View
** Up:** 6. Temporal-Difference Learning
** Previous:** 6.9 Summary
** Contents**

**Subsections**

As we outlined in Chapter 1, the idea of TD learning has its
early roots in animal learning psychology and artificial intelligence, most notably
the work of Samuel (1959) and Klopf (1972). Samuel's work is described as a case
study in Section 11.2. Also related to TD learning are Holland's
(1975, 1976) early ideas about consistency among
value predictions. These influenced one of the authors (Barto), who was a
graduate student from 1970 to 1975 at the University of Michigan, where Holland
was teaching. Holland's ideas led to a number of TD-related systems, including
the work of Booker (1982) and the bucket brigade of Holland
(1986), which is related to Sarsa as discussed below.

Most of the specific material from these sections is from Sutton (1988), including
the TD(0) algorithm, the random walk example, and the term
"temporal-difference learning." The characterization of
the relationship to dynamic programming and Monte Carlo methods was influenced by
Watkins (1989), Werbos (1987), and others. The use of backup diagrams here and in
other chapters is new to this book. Example 6.4 is due to Sutton, but has not
been published before.
Tabular TD(0) was proved to converge in the mean by Sutton (1988)
and with probability 1 by Dayan (1992), based on the work of
Watkins and Dayan (1992). These results were
extended and strengthened by Jaakkola, Jordan, and Singh
(1994) and Tsitsiklis (1994) by
using extensions of the powerful existing theory of stochastic approximation.
Other extensions and generalizations are covered in the next two chapters.

The optimality of the TD algorithm under batch training was established by
Sutton (1988). The term *certainty equivalence* is from the adaptive
control literature (e.g., Goodwin and Sin, 1984).
Illuminating this result is Barnard's (1993) derivation of
the TD algorithm as a combination of one step of an incremental method for
learning a model of the Markov chain and one step of a method for computing
predictions from the model.

The Sarsa algorithm was first explored by Rummery and Niranjan
(1994), who called it *modified Q-learning*. The name
"Sarsa" was introduced by Sutton (1996). The convergence of
one-step tabular Sarsa (the form treated in this chapter) has been proved by
Satinder Singh (personal communication). The "windy gridworld" example was
suggested by Tom Kalt.
Holland's (1986) bucket brigade idea evolved into an algorithm closely
related to Sarsa. The original idea of the bucket brigade involved chains of rules
triggering each other; it focused on passing credit back from the current rule to the
rules that triggered it. Over time, the bucket brigade came to be more like TD
learning in passing credit back to any temporally preceding rule, not just to the ones
that triggered the current rule. The modern form of the bucket brigade, when simplified
in various natural ways, is nearly identical to one-step Sarsa, as detailed by
Wilson (1994).

Q-learning was introduced by Watkins (1989), whose outline
of a convergence proof was later made rigorous by Watkins and Dayan
(1992). More general convergence results were proved by
Jaakkola, Jordan, and Singh (1994) and
Tsitsiklis (1994).

Actor-critic architectures using TD learning were first studied by Witten
(1977) and then by Barto, Sutton, and Anderson
(1983; Sutton, 1984), who introduced this use of the terms "actor" and
"critic." Sutton (1984) and Williams
(1992) developed the eligibility terms
mentioned in this section. Barto (1995a) and Houk,
Adams, and Barto (1995)
presented a model of how
an actor-critic architecture might be implemented in the brain.

R-learning is due to Schwartz (1993). Mahadevan
(1996), Tadepalli and Ok (1994), and
Bertsekas and Tsitsiklis (1996) have studied reinforcement learning for
undiscounted continuing tasks. In the literature, the undiscounted continuing case is
often called the case of maximizing "average reward
per time step" or the "average-reward case." The name R-learning was probably
meant to be the alphabetic successor to Q-learning, but we prefer to think of it
as a reference to the learning of *relative* values. The access-control
queuing example was suggested by the work of Carlström and Nordström
(1997).

** Next:** III. A Unified View
** Up:** 6. Temporal-Difference Learning
** Previous:** 6.9 Summary
** Contents**
Mark Lee
2005-01-04