next up previous
Next: 6.5 Q-learning: Off-Policy TD Up: 6 Temporal Difference Learning Previous: 6.3 Optimality of TD(0)

6.4 Sarsa: On-Policy TD Control

We turn now to the use of TD prediction methods for the control problem. As usual, we follow the pattern of generalized policy iteration (GPI), only this time using TD methods for the evaluation or prediction part. As with Monte Carlo methods, we face the need to tradeoff exploration and exploitation, and again approaches fall into two main classes: on-policy and off-policy. In this section we present an on-policy TD control method.

The first step is to learn an action-value function rather than a state-value function. In particular, for an on-policy method we must estimate for the current behavior policy and for all states s and actions a. Happily, we can learn using essentially the same TD method as described above for learning . Recall that an episode consists of an alternating sequence of states and state-action pairs:

In the previous section we considered transitions from state to state and learned the value of states. But the relationship between states and state-actions pairs is symmetrical. Now we consider transitions from state-action pair to state-action pair, and learn the value of state-action pairs. Formally these cases are identical: they are both Markov chains with a reward process. The theorems assuring the convergence of state values under TD(0) also applies to the corresponding algorithm for action values:

 

This update is done after every transition from a nonterminal state . If is terminal, then is defined as zero. This rule uses every element of the quintuple of events, , that make up a transition from one state-action pair to the next. This quintuple gives rise to the name Sarsa for the algorithm.

It is straightforward to design an on-policy control algorithm based on the Sarsa prediction method. As in all on-policy methods, we continually estimate for the behavior policy , and at the same time change towards greediness with respect to . The general form of the Sarsa control algorithm is given in Figure 6.9.


Initialize arbitrarily Repeat (for each episode): Initialize s Choose a from s using policy derived from Q (e.g., -greedy) Repeat (for each step of episode): Take action a, observe r, Choose from using policy derived from Q (e.g., -greedy) ; ; until s is terminal

  
Figure 6.9: Sarsa: An on-policy TD control algorithm.

The convergence properties of the Sarsa algorithm depend on the nature of the policy's dependence on Q. For example, one could use -greedy or -soft policies. According to Satinder Singh (personal communication), Sarsa converges with probability one to an optimal policy and action-value function as long as all state action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy (which can be arranged, for example, with -greedy policies by setting ), but this result has not yet been published in the literature.

Example . Windy Gridworld. Figure 6.10 shows a standard gridworld, with start and goal states, but with one difference: there is a crosswind upward through the middle of the grid. The actions are the standard four, up, down, right, and left, but in the middle region the resultant next states are shifted upward by a ``wind," the strength of which varies from column to column. The strength of the wind is given below each column, in number of cells shifted upward. For example, if you are one cell to the right of the goal, then the action left takes you to the cell just above the goal. Let us treat this as a undiscounted episodic task, with constant rewards of -1 until the goal state is reached. Figure 6.11 shows the result of applying -greedy Sarsa to this task, with , , and the initial values for all s,a. The increasing slope of the graph shows that the goal is reached more and more quickly over time. By 8000 time steps, the greedy policy (shown inset) was long since optimal; continued -greedy exploration kept the average episode length at about 17 steps, two less than the minimum of 15. Note that Monte Carlo methods can not easily be used on this task because termination is not guaranteed for all policies. If a policy was ever found that caused the agent to stay in the same state, then the next episode would never end. Step-by-step learning methods such as Sarsa do not have this problem because they quickly learn during the episode that such policies are poor, and switch to something else.

  
Figure 6.10: In this gridworld, movement is altered by a location-dependent, upward ``wind."

  
Figure 6.11: Results of Sarsa applied to the windy gridworld. The slope shows the rate at which the goal was reached---approximately once every 17 steps at the end. The inset shows the greedy policy, which is optimal.

Exercise . Windy Gridworld with King's Moves. Re-solve the windy gridworld task assuming eight possible actions, including the diagonal moves, rather than the usual four. How much better can you do with the extra actions? Can you do even better by including a ninth action that causes no movement at all other than that caused by the wind?

Exercise . Stochastic Wind. Re-solve the windy gridworld task with King's moves, assuming that the effect of the wind, if there is any, is stochastic, sometimes varying by one from the mean values given for each column. That is, a third of the time you move exactly according to these values, as in the previous exercise, but also a third of the time you move one cell above that, and another third of the time you move one cell below that. For example, if you are one cell to the right of the goal and you move left, then one third of the time you move one cell above the goal, one third of the time you move two cells above the goal, and one third of the time you move to the goal.

Exercise .

What is the backup diagram for Sarsa?



next up previous
Next: 6.5 Q-learning: Off-Policy TD Up: 6 Temporal Difference Learning Previous: 6.3 Optimality of TD(0)



Richard Sutton
Fri May 30 13:53:05 EDT 1997