If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal difference (TD) learning. TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap). The relationship between TD, DP, and Monte Carlo methods is a recurring theme in the theory of reinforcement learning. This chapter is the beginning of our exploration of it. Before we are done, we will see that these ideas and methods blend into each other and can be combined in many ways. In particular, in Chapter 7 we introduce the TD() algorithm, which seamlessly integrates TD and Monte Carlo methods.
As usual, we start by focusing on the policy-evaluation or prediction problem, that of estimating the value function for a given policy . For the control problem (finding an optimal policy), all of DP, TD, and Monte Carlo methods use some variation of generalized policy iteration (GPI). The differences in the methods are primarily differences in their approaches to the prediction problem.