Next: 7.11 Conclusions Up: 7. Eligibility Traces Previous: 7.9 Implementation Issues Contents

7.10 Variable

The $\lambda$ -return can be significantly generalized beyond what we have described so far by allowing $\lambda$ to vary from step to step, that is, by redefining the trace update as

where

denotes the value of $\lambda$ at time

. This is an advanced topic because the added generality has never been used in practical applications, but it is interesting theoretically and may yet prove useful. For example, one idea is to vary $\lambda$ as a function of state:

. If a state's value estimate is believed to be known with high certainty, then it makes sense to use that estimate fully, ignoring whatever states and rewards are received after it. This corresponds to cutting off all the traces once this state has been reached, that is, to choosing the $\lambda$ for the certain state to be zero or very small. Similarly, states whose value estimates are highly uncertain, perhaps because even the state estimate is unreliable, can be given

s near 1. This causes their estimated values to have little effect on any updates. They are "skipped over" until a state that is known better is encountered. Some of these ideas were explored formally by Sutton and Singh (1994).

The eligibility trace equation above is the backward view of variable s. The corresponding forward view is a more general definition of the $\lambda$ -return:

Exercise 7.10 Prove that the forward and backward views of off-line TD( $\lambda$ ) remain equivalent under their new definitions with variable $\lambda$ given in this section. Follow the example of the proof in Section 7.4.

Next: 7.11 Conclusions Up: 7. Eligibility Traces Previous: 7.9 Implementation Issues Contents

Mark Lee 2005-01-04