Next: 6.5 Q-Learning: Off-Policy TD Up: 6. Temporal-Difference Learning Previous: 6.3 Optimality of TD(0) Contents

6.4 Sarsa: On-Policy TD Control

We turn now to the use of TD prediction methods for the control problem. As usual, we follow the pattern of generalized policy iteration (GPI), only this time using TD methods for the evaluation or prediction part. As with Monte Carlo methods, we face the need to trade off exploration and exploitation, and again approaches fall into two main classes: on-policy and off-policy. In this section we present an on-policy TD control method.

The first step is to learn an action-value function rather than a state-value function. In particular, for an on-policy method we must estimate for the current behavior policy and for all states and actions . This can be done using essentially the same TD method described above for learning . Recall that an episode consists of an alternating sequence of states and state-action pairs:

In the previous section we considered transitions from state to state and learned the values of states. Now we consider transitions from state-action pair to state-action pair, and learn the value of state-action pairs. Formally these cases are identical: they are both Markov chains with a reward process. The theorems assuring the convergence of state values under TD(0) also apply to the corresponding algorithm for action values:

(6.5)

This update is done after every transition from a nonterminal state

. If

is terminal, then

is defined as zero. This rule uses every element of the quintuple of events,

, that make up a transition from one state-action pair to the next. This quintuple gives rise to the name Sarsa for the algorithm.

It is straightforward to design an on-policy control algorithm based on the Sarsa prediction method. As in all on-policy methods, we continually estimate for the behavior policy , and at the same time change toward greediness with respect to . The general form of the Sarsa control algorithm is given in Figure 6.9.

**Figure 6.9:** Sarsa: An on-policy TD control algorithm.

The convergence properties of the Sarsa algorithm depend on the nature of the policy's dependence on . For example, one could use $\varepsilon$ -greedy or $\varepsilon$ -soft policies. According to Satinder Singh (personal communication), Sarsa converges with probability to an optimal policy and action-value function as long as all state-action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy (which can be arranged, for example, with $\varepsilon$ -greedy policies by setting ), but this result has not yet been published in the literature.

Example 6.5: Windy Gridworld Figure 6.10 shows a standard gridworld, with start and goal states, but with one difference: there is a crosswind upward through the middle of the grid. The actions are the standard four--up, down, right, and left--but in the middle region the resultant next states are shifted upward by a "wind," the strength of which varies from column to column. The strength of the wind is given below each column, in number of cells shifted upward. For example, if you are one cell to the right of the goal, then the action left takes you to the cell just above the goal. Let us treat this as an undiscounted episodic task, with constant rewards of

until the goal state is reached. Figure 6.11 shows the result of applying $\varepsilon$ -greedy Sarsa to this task, with

, and the initial values

for all

. The increasing slope of the graph shows that the goal is reached more and more quickly over time. By 8000 time steps, the greedy policy (shown inset) was long since optimal; continued $\varepsilon$ -greedy exploration kept the average episode length at about 17 steps, two less than the minimum of 15. Note that Monte Carlo methods cannot easily be used on this task because termination is not guaranteed for all policies. If a policy was ever found that caused the agent to stay in the same state, then the next episode would never end. Step-by-step learning methods such as Sarsa do not have this problem because they quickly learn during the episode that such policies are poor, and switch to something else.

**Figure 6.10:** Gridworld in which movement is altered by a location-dependent, upward "wind."

**Figure 6.11:** Results of Sarsa applied to the windy gridworld.

Exercise 6.6: Windy Gridworld with King's Moves Resolve the windy gridworld task assuming eight possible actions, including the diagonal moves, rather than the usual four. How much better can you do with the extra actions? Can you do even better by including a ninth action that causes no movement at all other than that caused by the wind?

Exercise 6.7: Stochastic Wind Resolve the windy gridworld task with King's moves, assuming that the effect of the wind, if there is any, is stochastic, sometimes varying by 1 from the mean values given for each column. That is, a third of the time you move exactly according to these values, as in the previous exercise, but also a third of the time you move one cell above that, and another third of the time you move one cell below that. For example, if you are one cell to the right of the goal and you move left, then one-third of the time you move one cell above the goal, one-third of the time you move two cells above the goal, and one-third of the time you move to the goal.

Exercise 6.8 What is the backup diagram for Sarsa?

Next: 6.5 Q-Learning: Off-Policy TD Up: 6. Temporal-Difference Learning Previous: 6.3 Optimality of TD(0) Contents

Mark Lee 2005-01-04