What is the space of methods lying between Monte Carlo and TD methods? Consider
estimating from sample episodes generated using . Monte Carlo
methods perform a backup for each state based on the entire sequence of observed
rewards from that state until the end of the episode. The backup of simple TD
methods, on the other hand, is based on just the one next reward, using the value
of the state one step later as a proxy for the remaining rewards. One kind of
intermediate method, then, would perform a backup based on an intermediate number
of rewards: more than one, but less than all of them until termination. For
example, a two-step backup would be based on the first two rewards and the estimated
value of the state two steps later. Similarly, we could have three-step backups,
four-step backups, and so on. Figure
7.1 diagrams the spectrum of
-step backups for , with one-step, simple TD backups on the left and
up-until-termination Monte Carlo backups on the right.
Figure
7.1:
The spectrum ranging from the one-step backups of
simple TD methods to the up-until-termination backups of Monte Carlo methods. In
between are the -step backups, based on steps of real rewards and the
estimated value of the th next state, all appropriately discounted.
The methods that use -step backups are still TD methods
because they still change an earlier estimate based on how it differs from a
later estimate. Now the later estimate is not one step later, but
steps later. Methods in which the temporal difference extends over steps
are called -step TD methods. The TD methods introduced in the
previous chapter all use one-step backups, and henceforth we call them one-step TD methods.
More formally, consider the backup
applied to state as a result of the state-reward sequence,
(omitting the actions for
simplicity). We know that in Monte Carlo backups the estimate of
is updated in the direction of the complete return:
where is the last time step of the episode. Let us call this quantity the
target of the backup. Whereas in Monte Carlo backups the target is the
expected return, in one-step backups the target is the first reward plus
the discounted estimated value of the next state:
This makes sense because takes the place of the remaining terms
, as we discussed in the
previous chapter. Our point now is that this
idea makes just as much sense after two steps as it does after one. The two-step
target is
where now takes the place of the terms . In general, the -step target is
(7.1)
This quantity is sometimes called the "corrected -step truncated return"
because it is a return truncated after steps and then approximately corrected
for the truncation by adding the estimated value of the th next state. That
terminology is descriptive but a bit long. We instead refer to
simply as the -step return at time .
Of course, if the episode ends in less than steps, then the truncation
in an -step return occurs at the episode's end, resulting in the
conventional complete return. In other words, if , then
. Thus, the last -step returns of any episode
are always complete returns, and an infinite-step return is always a complete
return. This definition enables us to treat Monte Carlo methods as the
special case of infinite-step returns. All of this is consistent with the tricks
for treating episodic and continuing tasks equivalently that we introduced in
Section 3.4. There we chose to treat the terminal state as a state that always
transitions to itself with zero reward. Under this trick, all -step returns
that last up to or past termination have the same value as the
complete return.
An -step backup is defined to be a backup toward the -step return. In
the tabular, state-value case, the increment to (the estimated value of
at time ), due to an -step backup of , is defined by
where is a positive step-size parameter, as usual.
Of course, the increments to the estimated values of the other states
are , for all . We define the -step backup in
terms of an increment, rather than as a direct update rule as we did in the previous
chapter, in order to distinguish two different ways of making the updates. In
on-line updating, the updates are done during the episode, as soon as the
increment is computed. In this case we have
for all . This is the case considered in
the previous chapter. In off-line updating, on the other hand, the
increments are accumulated "on the side" and are not used to change
value estimates until the end of the episode. In this case, is constant
within an episode, for all . If its value in this episode is ,
then its new value in the next episode will be .
The expected value of all -step returns is guaranteed to improve in a certain
way over the current value function as an approximation to the true value
function. For any , the expected value of the -step return
using is guaranteed to be a better estimate of than is, in a
worst-state sense. That is, the worst error under the new estimate is guaranteed
to be less than or equal to times the worst error under :
(7.2)
This is called the error reduction property of -step returns.
Because of the error reduction property, one can show formally that on-line and
off-line TD prediction methods using -step backups converge to the correct
predictions under appropriate technical conditions. The
-step TD methods thus form a family of valid methods, with one-step TD methods and
Monte Carlo methods as extreme members.
Nevertheless, -step TD methods are rarely used because they are inconvenient to
implement. Computing -step returns requires waiting steps to observe the
resultant rewards and states. For large , this can become problematic,
particularly in control applications. The significance of -step TD methods is
primarily for theory and for understanding related methods that are more
conveniently implemented. In the next few sections we use the idea of
-step TD methods to explain and justify eligibility trace methods.
Example 7.1: -step TD Methods on the Random Walk
Consider using -step TD methods on the random walk task described in
Example 6.2 and shown in Figure 6.5. Suppose the
first episode progressed directly from the center state, , to the right, through
and , and then terminated on the right with a return of 1. Recall that the
estimated values of all the states started at an intermediate value, .
As a result of this experience, a one-step method would change only
the estimate for the last state, , which would be incremented toward
, the observed return. A two-step method, on the other hand, would increment the
values of the two states preceding termination: and
would both be incremented toward 1. A three-step method, or any -step method
for , would increment the values of all three of the visited states toward 1,
all by the same amount. Which is better? Figure
7.2
shows the results of a simple empirical assessment for a larger random walk process,
with 19 states (and with a outcome on the left, all values initialized to ).
Shown is the root mean-squared error in the predictions at the end of an episode,
averaged over states, the first 10 episodes, and 100 repetitions of the whole
experiment (the same sets of walks were used for all methods). Results are shown for
on-line and off-line
-step TD methods with a range of values for and . Empirically, on-line
methods with an intermediate value of seem to work best on this task. This
illustrates how the generalization of TD and Monte Carlo methods to -step methods
can potentially perform better than either of the two extreme methods.
Figure
7.2:
Performance of -step TD methods as a function of , for various
values of , on a 19-state random walk task. The performance measure shown is
the root mean-squared (RMS) error between the true values of states and the
values found by the learning methods, averaged over the 19 states, the first 10
trials, and 100 different sequences of walks.
Exercise 7.1
Why do you think a larger random walk task (19 states instead of 5) was used in
the examples of this chapter? Would a smaller walk have shifted the advantage to a
different value of ? How about the change in left-side outcome from 0 to ?
Would that have made any difference in the best value of ?
Exercise 7.2
Why do you think on-line methods worked better than off-line methods on the example
task?
Exercise 7.3
In the lower part of Figure
7.2, notice that the plot for
is different from the others, dropping to low performance at a much
lower value of than similar methods. In fact, the same was observed for
, , and . Can you explain why this might have been so? In fact,
we are not sure ourselves.