How can eligibility traces be used not just for prediction, as in TD(), but for control? As usual, the main idea of one popular approach is simply to learn action values, , rather than state values, . In this section we show how eligibility traces can be combined with Sarsa in a straightforward way to produce an on-policy TD control method. The eligibility trace version of Sarsa we call Sarsa(), and the original version presented in the previous chapter we henceforth call one-step Sarsa.
The idea in Sarsa() is to apply the TD() prediction method to state-action
pairs rather than to states.
Obviously, then, we need a trace not
just for each state, but for each state-action pair. Let denote the
trace for state-action pair
. Otherwise the method is just like TD(), substituting state-action
variables for state variables-- for and for :
Figure 7.10 shows the backup diagram for Sarsa(). Notice the similarity to the diagram of the TD() algorithm (Figure 7.3). The first backup looks ahead one full step, to the next state-action pair, the second looks ahead two steps, and so on. A final backup is based on the complete return. The weighting of each backup is just as in TD() and the -return algorithm.
One-step Sarsa and Sarsa() are on-policy algorithms, meaning that they approximate , the action values for the current policy, , then improve the policy gradually based on the approximate values for the current policy. The policy improvement can be done in many different ways, as we have seen throughout this book. For example, the simplest approach is to use the -greedy policy with respect to the current action-value estimates. Figure 7.11 shows the complete Sarsa() algorithm for this case.
Example 7.4: Traces in Gridworld The use of eligibility traces can substantially increase the efficiency of control algorithms. The reason for this is illustrated by the gridworld example in Figure 7.12. The first panel shows the path taken by an agent in a single episode, ending at a location of high reward, marked by the *. In this example the values were all initially 0, and all rewards were zero except for a positive reward at the * location. The arrows in the other two panels show which action values were strengthened as a result of this path by one-step Sarsa and Sarsa() methods. The one-step method strengthens only the last action of the sequence of actions that led to the high reward, whereas the trace method strengthens many actions of the sequence. The degree of strengthening (indicated by the size of the arrows) falls off (according to ) with steps from the reward. In this example, and .