In this chapter we develop a unified view of methods that require a
model of the environment, such as dynamic programming and heuristic
search, and methods that can be used without a model, such as Monte Carlo and
temporal-difference methods. We think of the former as planning
methods and of the latter as learning methods. Although there are real
differences between these two kinds of methods, there are also great
similarities. In particular, the heart of both kinds of methods is the computation
of value functions. Moreover, all the methods are based on looking ahead to future
events, computing a backed-up value, and then using it to update an approximate
value function. Earlier in this book we presented Monte Carlo and
temporal-difference methods as distinct alternatives, then showed how they can be
seamlessly integrated using eligibility traces in methods such as TD()
. Our goal
in this chapter is a similar integration of planning and learning methods.
Having established these as distinct in earlier chapters, we now explore the extent
to which they can be blended together.