next up previous contents
Next: About this document Up: 2.12 Bibliographical and Historical Previous: .


Interval estimation methods are due to Lai (1987) and Kaelbling (1993). Bellman (1956) was the first to show how dynamic programming could be used to compute the optimal balance between exploration and exploitation within a Bayesian formulation of the problem. The survey by Kumar (1985) provides a good discussion of Bayesian and non-Bayesian approaches to these problems. The term information state comes from the literature on partially observable MDPs, see, e.g., Lovejoy (1991). The Gittins index approach is due to Gittins and Jones (1974). Duff (1996) showed how it is possible to learn Gittins indices for bandit problems through reinforcement learning.

Richard Sutton
Fri May 30 10:02:27 EDT 1997