-Step TD Prediction">

Next: 7.2 The Forward View Up: 7. Eligibility Traces Previous: 7. Eligibility Traces Contents

7.1 -Step TD Prediction

What is the space of methods lying between Monte Carlo and TD methods? Consider estimating from sample episodes generated using . Monte Carlo methods perform a backup for each state based on the entire sequence of observed rewards from that state until the end of the episode. The backup of simple TD methods, on the other hand, is based on just the one next reward, using the value of the state one step later as a proxy for the remaining rewards. One kind of intermediate method, then, would perform a backup based on an intermediate number of rewards: more than one, but less than all of them until termination. For example, a two-step backup would be based on the first two rewards and the estimated value of the state two steps later. Similarly, we could have three-step backups, four-step backups, and so on. Figure 7.1 diagrams the spectrum of -step backups for , with one-step, simple TD backups on the left and up-until-termination Monte Carlo backups on the right.

**Figure 7.1:** The spectrum ranging from the one-step backups of simple TD methods to the up-until-termination backups of Monte Carlo methods. In between are the -step backups, based on steps of real rewards and the estimated value of the th next state, all appropriately discounted.

The methods that use -step backups are still TD methods because they still change an earlier estimate based on how it differs from a later estimate. Now the later estimate is not one step later, but steps later. Methods in which the temporal difference extends over steps are called -step TD methods. The TD methods introduced in the previous chapter all use one-step backups, and henceforth we call them one-step TD methods.

More formally, consider the backup applied to state as a result of the state-reward sequence, (omitting the actions for simplicity). We know that in Monte Carlo backups the estimate of is updated in the direction of the complete return:

where

is the last time step of the episode. Let us call this quantity the target of the backup. Whereas in Monte Carlo backups the target is the expected return, in one-step backups the target is the first reward plus the discounted estimated value of the next state:

This makes sense because

takes the place of the remaining terms

, as we discussed in the previous chapter. Our point now is that this idea makes just as much sense after two steps as it does after one. The two-step target is

where now

takes the place of the terms

. In general, the

-step target is

(7.1)

This quantity is sometimes called the "corrected

-step truncated return" because it is a return truncated after

steps and then approximately corrected for the truncation by adding the estimated value of the

th next state. That terminology is descriptive but a bit long. We instead refer to

simply as the -step return at time

Of course, if the episode ends in less than steps, then the truncation in an -step return occurs at the episode's end, resulting in the conventional complete return. In other words, if , then . Thus, the last -step returns of any episode are always complete returns, and an infinite-step return is always a complete return. This definition enables us to treat Monte Carlo methods as the special case of infinite-step returns. All of this is consistent with the tricks for treating episodic and continuing tasks equivalently that we introduced in Section 3.4. There we chose to treat the terminal state as a state that always transitions to itself with zero reward. Under this trick, all -step returns that last up to or past termination have the same value as the complete return.

An -step backup is defined to be a backup toward the -step return. In the tabular, state-value case, the increment to (the estimated value of at time ), due to an -step backup of , is defined by

where $\alpha$ is a positive step-size parameter, as usual. Of course, the increments to the estimated values of the other states are

, for all

. We define the

-step backup in terms of an increment, rather than as a direct update rule as we did in the previous chapter, in order to distinguish two different ways of making the updates. In on-line updating, the updates are done during the episode, as soon as the increment is computed. In this case we have

for all

. This is the case considered in the previous chapter. In off-line updating, on the other hand, the increments are accumulated "on the side" and are not used to change value estimates until the end of the episode. In this case,

is constant within an episode, for all

. If its value in this episode is

, then its new value in the next episode will be

The expected value of all -step returns is guaranteed to improve in a certain way over the current value function as an approximation to the true value function. For any , the expected value of the -step return using is guaranteed to be a better estimate of than is, in a worst-state sense. That is, the worst error under the new estimate is guaranteed to be less than or equal to times the worst error under :

(7.2)

This is called the error reduction property of

-step returns. Because of the error reduction property, one can show formally that on-line and off-line TD prediction methods using

-step backups converge to the correct predictions under appropriate technical conditions. The

-step TD methods thus form a family of valid methods, with one-step TD methods and Monte Carlo methods as extreme members.

Nevertheless, -step TD methods are rarely used because they are inconvenient to implement. Computing -step returns requires waiting steps to observe the resultant rewards and states. For large , this can become problematic, particularly in control applications. The significance of -step TD methods is primarily for theory and for understanding related methods that are more conveniently implemented. In the next few sections we use the idea of -step TD methods to explain and justify eligibility trace methods.

Example 7.1: -step TD Methods on the Random Walk Consider using

-step TD methods on the random walk task described in Example 6.2 and shown in Figure 6.5. Suppose the first episode progressed directly from the center state,

, to the right, through

and

, and then terminated on the right with a return of 1. Recall that the estimated values of all the states started at an intermediate value,

. As a result of this experience, a one-step method would change only the estimate for the last state,

, which would be incremented toward

, the observed return. A two-step method, on the other hand, would increment the values of the two states preceding termination:

and

would both be incremented toward 1. A three-step method, or any

-step method for

, would increment the values of all three of the visited states toward 1, all by the same amount. Which

is better? Figure 7.2 shows the results of a simple empirical assessment for a larger random walk process, with 19 states (and with a

outcome on the left, all values initialized to

). Shown is the root mean-squared error in the predictions at the end of an episode, averaged over states, the first 10 episodes, and 100 repetitions of the whole experiment (the same sets of walks were used for all methods). Results are shown for on-line and off-line

-step TD methods with a range of values for

and $\alpha$ . Empirically, on-line methods with an intermediate value of

seem to work best on this task. This illustrates how the generalization of TD and Monte Carlo methods to

-step methods can potentially perform better than either of the two extreme methods.

**Figure 7.2:** Performance of -step TD methods as a function of $\alpha$ , for various values of , on a 19-state random walk task. The performance measure shown is the root mean-squared (RMS) error between the true values of states and the values found by the learning methods, averaged over the 19 states, the first 10 trials, and 100 different sequences of walks.

Exercise 7.1 Why do you think a larger random walk task (19 states instead of 5) was used in the examples of this chapter? Would a smaller walk have shifted the advantage to a different value of

? How about the change in left-side outcome from 0 to

? Would that have made any difference in the best value of

Exercise 7.2 Why do you think on-line methods worked better than off-line methods on the example task?

Exercise 7.3 In the lower part of Figure 7.2, notice that the plot for

is different from the others, dropping to low performance at a much lower value of $\alpha$ than similar methods. In fact, the same was observed for

, and

. Can you explain why this might have been so? In fact, we are not sure ourselves.

Next: 7.2 The Forward View Up: 7. Eligibility Traces Previous: 7. Eligibility Traces Contents

Mark Lee 2005-01-04