)">

Next: 7.7 Eligibility Traces for Up: 7. Eligibility Traces Previous: 7.5 Sarsa(

) Contents

7.6 Q()

Two different methods have been proposed that combine eligibility traces and Q-learning; we call them Watkins's Q( $\lambda$ ) and Peng's Q( $\lambda$ ), after the researchers who first proposed them. First we describe Watkins's Q( $\lambda$ ).

Recall that Q-learning is an off-policy method, meaning that the policy learned about need not be the same as the one used to select actions. In particular, Q-learning learns about the greedy policy while it typically follows a policy involving exploratory actions--occasional selections of actions that are suboptimal according to . Because of this, special care is required when introducing eligibility traces.

Suppose we are backing up the state-action pair at time . Suppose that on the next two time steps the agent selects the greedy action, but on the third, at time , the agent selects an exploratory, nongreedy action. In learning about the value of the greedy policy at we can use subsequent experience only as long as the greedy policy is being followed. Thus, we can use the one-step and two-step returns, but not, in this case, the three-step return. The -step returns for all no longer have any necessary relationship to the greedy policy.

Thus, unlike TD( $\lambda$ ) or Sarsa( $\lambda$ ), Watkins's Q( $\lambda$ ) does not look ahead all the way to the end of the episode in its backup. It only looks ahead as far as the next exploratory action. Aside from this difference, however, Watkins's Q( $\lambda$ ) is much like TD( $\lambda$ ) and Sarsa( $\lambda$ ). Their lookahead stops at episode's end, whereas Q( $\lambda$ )'s lookahead stops at the first exploratory action, or at episode's end if there are no exploratory actions before that. Actually, to be more precise, one-step Q-learning and Watkins's Q( $\lambda$ ) both look one action past the first exploration, using their knowledge of the action values. For example, suppose the first action, , is exploratory. Watkins's Q( $\lambda$ ) would still do the one-step update of toward . In general, if is the first exploratory action, then the longest backup is toward

where we assume off-line updating. The backup diagram in Figure 7.13 illustrates the forward view of Watkins's Q( $\lambda$ ), showing all the component backups.

**Figure 7.13:** The backup diagram for Watkins's Q( $\lambda$ ). The series of component backups ends either with the end of the episode or with the first nongreedy action, whichever comes first.

The mechanistic or backward view of Watkins's Q( $\lambda$ ) is also very simple. Eligibility traces are used just as in Sarsa( $\lambda$ ), except that they are set to zero whenever an exploratory (nongreedy) action is taken. The trace update is best thought of as occurring in two steps. First, the traces for all state-action pairs are either decayed by or, if an exploratory action was taken, set to . Second, the trace corresponding to the current state and action is incremented by . The overall result is

where, as before,

is an identity indicator function, equal to

and

otherwise. The rest of the algorithm is defined by

where

Figure 7.14 shows the complete algorithm in pseudocode.

**Figure 7.14:** Tabular version of Watkins's Q( $\lambda$ ) algorithm.

Unfortunately, cutting off traces every time an exploratory action is taken loses much of the advantage of using eligibility traces. If exploratory actions are frequent, as they often are early in learning, then only rarely will backups of more than one or two steps be done, and learning may be little faster than one-step Q-learning. Peng's Q( $\lambda$ ) is an alternate version of Q( $\lambda$ ) meant to remedy this. Peng's Q( $\lambda$ ) can be thought of as a hybrid of Sarsa( $\lambda$ ) and Watkins's Q( $\lambda$ ).

**Figure 7.15:** The backup diagram for Peng's Q( $\lambda$ ).

Conceptually, Peng's Q( $\lambda$ ) uses the mixture of backups shown in Figure 7.15. Unlike Q-learning, there is no distinction between exploratory and greedy actions. Each component backup is over many steps of actual experiences, and all but the last are capped by a final maximization over actions. The component backups, then, are neither on-policy nor off-policy. The earlier transitions of each are on-policy, whereas the last (fictitious) transition uses the greedy policy. As a consequence, for a fixed nongreedy policy, converges to neither nor under Peng's Q( $\lambda$ ), but to some hybrid of the two. However, if the policy is gradually made more greedy, then the method may still converge to . As of this writing this has not yet been proved. Nevertheless, the method performs well empirically. Most studies have shown it performing significantly better than Watkins's Q( $\lambda$ ) and almost as well as Sarsa( $\lambda$ ).

On the other hand, Peng's Q( $\lambda$ ) cannot be implemented as simply as Watkins's Q( $\lambda$ ). For a complete description of the needed implementation, see Peng and Williams (1994, 1996). One could imagine yet a third version of Q( $\lambda$ ), let us call it naive Q( $\lambda$ ), that is just like Watkins's Q( $\lambda$ ) except that the traces are not set to zero on exploratory actions. This method might have some of the advantages of Peng's Q( $\lambda$ ), but without the complex implementation. We know of no experience with this method, but perhaps it is not as naive as one might at first suppose.

Next: 7.7 Eligibility Traces for Up: 7. Eligibility Traces Previous: 7.5 Sarsa(

) Contents

Mark Lee 2005-01-04