** Next:** 3. The Reinforcement Learning
** Up:** 2. Evaluative Feedback
** Previous:** 2.11 Conclusions
** Contents**

**Subsections**

Bandit problems have been studied in statistics, engineering, and psychology. In
statistics, bandit problems fall under the heading "sequential design of
experiments," introduced by Thompson (1933,
1934) and Robbins (1952), and studied
by Bellman (1956). Berry and Fristedt (1985)
provide an extensive treatment of bandit problems from the perspective of statistics.
Narendra and Thathachar (1989) treat bandit problems from the
engineering perspective, providing a good discussion of the various theoretical
traditions that have focused on them. In psychology, bandit problems have played
roles in statistical learning theory (e.g., Bush and Mosteller, 1955; Estes,
1950).
The term *greedy* is often used in the heuristic search literature (e.g.,
Pearl, 1984). The conflict between exploration and exploitation is
known in control engineering as the conflict between identification (or estimation)
and control (e.g., Witten, 1976). Feldbaum
(1965) called it the *dual control* problem, referring to
the need to solve the two problems of identification and control simultaneously when
trying to control a system under uncertainty. In discussing aspects of genetic
algorithms, Holland (1975) emphasized the importance of this
conflict, referring to it as the conflict between the need to exploit and the
need for new information.

Action-value methods for our -armed bandit problem were first proposed by
Thathachar and Sastry (1985). These are often called
*estimator algorithms* in the learning automata literature. The term *action value* is due to Watkins (1989). The first to use
-greedy methods may also have been Watkins (1989, p. 187), but the idea is so
simple that some earlier use seems likely.

The term *softmax* for the action selection rule (2.2) is due to Bridle
(1990). This rule appears to have been first proposed by Luce
(1959). The parameter is called temperature in simulated
annealing algorithms (Kirkpatrick, Gelatt, and Vecchi,
1983).

The main argument and results in this section were first presented by Sutton
(1984). Further analysis of the relationship between evaluation
and instruction has been presented by Barto (1985, 1991,
1992), and Barto and Anandan (1985).
The unit-square representation of a binary bandit task used in
Figure
2.2 has been called a contingency space in
experimental psychology (e.g., Staddon, 1983).
Narendra and Thathachar (1989) provide a comprehensive
treatment of modern learning automata theory and its applications. They also discuss
similar algorithms from the statistical learning theory of psychology. Other methods
based on converting reinforcement learning experience into target actions were
developed by Widrow, Gupta, and Maitra (1973) and by Gällmo and
Asplund (1995).

This material falls under the general heading of stochastic iterative
algorithms, which is well covered by Bertsekas and Tsitsiklis (1996).

Reinforcement comparison methods were extensively
developed by Sutton (1984) and further refined by
Williams (1986, 1992), Kaelbling
(1993a), and Dayan (1991). These authors
analyzed many variations of the idea, including other eligibility terms
that may significantly improve performance. Perhaps the earliest use of
reinforcement comparison was by Barto, Sutton, and Brouwer
(1981).

The pursuit algorithm is due to Thathachar and
Sastry (1985).

The term *associative search* and the corresponding problem were introduced by
Barto, Sutton, and Brouwer (1981).
The term *associative reinforcement learning* has also been used for associative
search (Barto and Anandan, 1985), but we prefer to reserve that term as a
synonym for the full reinforcement learning problem (as in Sutton,
1984).
We note that Thorndike's Law of Effect (quoted in
Chapter 1) describes associative search by referring to the
formation of associative links between situations (states) and actions. According
to the terminology of operant, or instrumental, conditioning (e.g., Skinner,
1938), a discriminative stimulus is a stimulus that signals the
presence of a particular reinforcement contingency. In our terms, different
discriminative stimuli correspond to different states.

Interval estimation methods are due to Lai (1987) and Kaelbling
(1993a). Bellman (1956) was the first to
show how dynamic programming could be used to compute the optimal balance between
exploration and exploitation within a Bayesian formulation of the problem. The survey
by Kumar (1985) provides a good discussion of Bayesian and
non-Bayesian approaches to these problems. The term *information state* comes
from the literature on partially observable MDPs; see, e.g., Lovejoy
(1991). The Gittins index approach is due to Gittins and Jones
(1974). Duff (1995) showed how it is
possible to learn Gittins indices for bandit problems through reinforcement learning.

** Next:** 3. The Reinforcement Learning
** Up:** 2. Evaluative Feedback
** Previous:** 2.11 Conclusions
** Contents**
Mark Lee
2005-01-04