Next: 5.5 Evaluating One Policy Up: 5 Monte Carlo Methods Previous: 5.3 Monte Carlo Control

5.4 On-Policy Monte Carlo Control

How can we avoid the unlikely assumption of exploring starts? The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them. There are two approaches to ensuring this, resulting in what we call on-policy methods and off-policy methods. In this section we treat on-policy methods, which are distinguished in that they attempt to evaluate and improve the same policy that they use to make decisions. In this section we present an on-policy Monte Carlo control method in order to illustrate the idea.

In on-policy control methods the policy is soft, meaning that for all and all . There are many possible variations on on-policy methods. One possibility is to gradually shift the policy toward a deterministic optimal policy. Many of the methods discussed in Chapter 2 provide mechanisms for this. The on-policy method we present in this section uses -greedy policies, meaning that most of the time they choose an action that has maximal estimated action value, but with probability they instead select an action at random. That is, all non-greedy actions are given the minimal probability of selection, , and the remaining bulk of the probability, , is given to the greedy action. The -greedy policies are examples of -soft policies, defined as policies for which for all states and actions, for some . Among -soft policies, -greedy policies are in some sense those that are closest to greedy.

The overall idea of on-policy Monte Carlo control is still that of GPI. As in Monte Carlo ES, we use first-visit MC methods to estimate the action-value function for the current policy. Without the assumption of exploring starts, however, we cannot simply improve the policy by making it greedy with respect to the current value function, because that would prevent further exploration of non-greedy actions. Fortunately, GPI does not require that the policy be taken all the way to a greedy policy, only that it be moved toward a greedy policy. In our on-policy method we will move it only to an -greedy policy. For any -soft policy, , any -greedy policy with respect to is guaranteed to be better than or equal to .

That any -greedy policy with respect to is an improvement over any -soft policy is assured by the policy improvement theorem. Let be the -greedy policy. The conditions of the policy improvement theorem apply because for any :

Thus, by the policy improvement theorem, (i.e., , ). We now prove that equality can only hold when both and are optimal among the -soft policies, i.e., that they are better than or equal to all other -soft policies.

Consider a new environment that is just like the original environment, except with the requirement that policies be -soft ``moved inside" the environment. The new environment has the same action and state set as the original and behaves as follows. If in state s and taking action a, then with probability the new environment behaves exactly like the old environment. With probability it re-picks the action at random, with equal probabilities, and then behaves like the old environment with the new, random action. The best one can do in this new environment with general policies is the same as the best one could do in the original environment with -soft policies. Let and denote the optimal value functions for the new environment. Then a policy is optimal among -soft policies if and only if . From the definition of we know that it is the unique solution to

And from (5.2) we know that, when equality holds and the -soft policy is no longer improved, then

but this equation is the same as the previous, except for the substitution of for . Since is the unique solution, .

In essence, we have shown in the last few pages that policy iteration works for -soft policies. Using the natural notion of greedy policy for -soft policies, one is assured of improvement on every step, except when the best policy has been found among the -soft policies. This analysis is independent of how the action-value functions are determined at each stage, but it does assume they are computed exactly. This brings us to roughly the same point as in the previous section. Now we achieve the best policy only among the -soft policies, but, on the other hand, we have eliminated the assumption of exploring starts. The complete algorithm is given in Figure 5.6.



Initialize, for all , :
   
   
   
Repeat Forever:
   (a) Generate an episode using 
   (b) For each pair s,a appearing in the episode:
        
        Append  to 
        
   (c) For each s in the episode:
        
        For all :

Figure 5.6: An

-soft on-policy Monte Carlo control algorithm.

Next: 5.5 Evaluating One Policy Up: 5 Monte Carlo Methods Previous: 5.3 Monte Carlo Control

Richard Sutton
Fri May 30 13:20:35 EDT 1997