-Step TD Prediction">
What is the space of methods lying between Monte Carlo and TD methods? Consider estimating from sample episodes generated using . Monte Carlo methods perform a backup for each state based on the entire sequence of observed rewards from that state until the end of the episode. The backup of simple TD methods, on the other hand, is based on just the one next reward, using the value of the state one step later as a proxy for the remaining rewards. One kind of intermediate method, then, would perform a backup based on an intermediate number of rewards: more than one, but less than all of them until termination. For example, a two-step backup would be based on the first two rewards and the estimated value of the state two steps later. Similarly, we could have three-step backups, four-step backups, and so on. Figure 7.1 diagrams the spectrum of -step backups for , with one-step, simple TD backups on the left and up-until-termination Monte Carlo backups on the right.
The methods that use -step backups are still TD methods because they still change an earlier estimate based on how it differs from a later estimate. Now the later estimate is not one step later, but steps later. Methods in which the temporal difference extends over steps are called -step TD methods. The TD methods introduced in the previous chapter all use one-step backups, and henceforth we call them one-step TD methods.
More formally, consider the backup
applied to state as a result of the state-reward sequence,
(omitting the actions for
simplicity). We know that in Monte Carlo backups the estimate of
is updated in the direction of the complete return:
Of course, if the episode ends in less than steps, then the truncation in an -step return occurs at the episode's end, resulting in the conventional complete return. In other words, if , then . Thus, the last -step returns of any episode are always complete returns, and an infinite-step return is always a complete return. This definition enables us to treat Monte Carlo methods as the special case of infinite-step returns. All of this is consistent with the tricks for treating episodic and continuing tasks equivalently that we introduced in Section 3.4. There we chose to treat the terminal state as a state that always transitions to itself with zero reward. Under this trick, all -step returns that last up to or past termination have the same value as the complete return.
An -step backup is defined to be a backup toward the -step return. In
the tabular, state-value case, the increment to (the estimated value of
at time ), due to an -step backup of , is defined by
The expected value of all -step returns is guaranteed to improve in a certain
way over the current value function as an approximation to the true value
function. For any , the expected value of the -step return
using is guaranteed to be a better estimate of than is, in a
worst-state sense. That is, the worst error under the new estimate is guaranteed
to be less than or equal to times the worst error under :
Nevertheless, -step TD methods are rarely used because they are inconvenient to implement. Computing -step returns requires waiting steps to observe the resultant rewards and states. For large , this can become problematic, particularly in control applications. The significance of -step TD methods is primarily for theory and for understanding related methods that are more conveniently implemented. In the next few sections we use the idea of -step TD methods to explain and justify eligibility trace methods.
Example 7.1: -step TD Methods on the Random Walk Consider using -step TD methods on the random walk task described in Example 6.2 and shown in Figure 6.5. Suppose the first episode progressed directly from the center state, , to the right, through and , and then terminated on the right with a return of 1. Recall that the estimated values of all the states started at an intermediate value, . As a result of this experience, a one-step method would change only the estimate for the last state, , which would be incremented toward , the observed return. A two-step method, on the other hand, would increment the values of the two states preceding termination: and would both be incremented toward 1. A three-step method, or any -step method for , would increment the values of all three of the visited states toward 1, all by the same amount. Which is better? Figure 7.2 shows the results of a simple empirical assessment for a larger random walk process, with 19 states (and with a outcome on the left, all values initialized to ). Shown is the root mean-squared error in the predictions at the end of an episode, averaged over states, the first 10 episodes, and 100 repetitions of the whole experiment (the same sets of walks were used for all methods). Results are shown for on-line and off-line -step TD methods with a range of values for and . Empirically, on-line methods with an intermediate value of seem to work best on this task. This illustrates how the generalization of TD and Monte Carlo methods to -step methods can potentially perform better than either of the two extreme methods.
Exercise 7.1 Why do you think a larger random walk task (19 states instead of 5) was used in the examples of this chapter? Would a smaller walk have shifted the advantage to a different value of ? How about the change in left-side outcome from 0 to ? Would that have made any difference in the best value of ?
Exercise 7.2 Why do you think on-line methods worked better than off-line methods on the example task?
Exercise 7.3 In the lower part of Figure 7.2, notice that the plot for is different from the others, dropping to low performance at a much lower value of than similar methods. In fact, the same was observed for , , and . Can you explain why this might have been so? In fact, we are not sure ourselves.