Next: 10. Dimensions of Reinforcement
Up: 9. Planning and Learning
Previous: 9.8 Summary
Contents
Subsections
The overall view of planning and learning presented here has developed gradually
over a number of years, in part by the authors (Sutton, 1990, 1991a, 1991b;
Barto, Bradtke, and Singh, 1991, 1995; Sutton and Pinette, 1985; Sutton and Barto,
1981b); it has been strongly influenced by Agre and Chapman (1990; Agre 1988),
Bertsekas and Tsitsiklis (1989), Singh (1993), and others. The authors were
also strongly influenced by psychological studies of latent learning (Tolman,
1932) and by psychological views of the nature of thought (e.g.,
Galanter and Gerstenhaber, 1956; Craik, 1943; Campbell, 1960; Dennett, 1978).
The terms direct and indirect, which we use to
describe different kinds of reinforcement learning, are from the adaptive
control literature
(e.g., Goodwin and Sin, 1984), where they are used to make the
same kind of distinction. The term
system identification is used in adaptive control for what we call
model-learning
(e.g., Goodwin and Sin, 1984; Ljung and Söderstrom, 1983; Young,
1984). The Dyna architecture is due to Sutton (1990), and the results
in these sections are based on results reported there.
Prioritized sweeping was developed simultaneously and independently by Moore and
Atkeson (1993) and Peng and Williams (1993). The results in
Figure
9.10 are due to Peng and Williams (1993). The
results in Figure
9.11 are due to Moore and Atkeson.
This section was strongly influenced by the experiments of Singh
(1993).
For further reading on heuristic search, the reader is encouraged to
consult texts
and surveys such as those by Russell and Norvig (1995) and Korf (1988).
Peng and Williams (1993) explored a forward focusing of backups much as is
suggested in this section.
Next: 10. Dimensions of Reinforcement
Up: 9. Planning and Learning
Previous: 9.8 Summary
Contents
Mark Lee
2005-01-04