Next: 4.3 Policy Iteration Up: 4 Dynamic Programming Previous: 4.1 Policy Evaluation

4.2 Policy Improvement

Our reason for computing the value function for a policy is to help find better policies. Suppose we have determined the value function for an arbitrary deterministic policy . For some state s we would like to know whether or not we should change the policy to deterministically choose an action . We know how good it is to follow the current policy from s---that is just ---but would it be better or worse to change to the new policy? One way to answer this question is to consider selecting a in s and then thereafter following the existing policy, . The value of this way of behaving is

The key criterion is whether this is greater than or less than . If it is greater, that is, if it is better to select a once in s and thereafter follow than it would be to follow all the time, then one would expect it to be better still to select a every time s is encountered, and that the new policy would in fact be a better one overall.

That this is true is a special case of a general result called the policy improvement theorem. Let and be any pair of deterministic policies such that, for all ,

Then the policy must be as good as, or better than, . That is, it must obtain greater or equal expected return from all states :

Moreover, if there is strict inequality of (4.7) at any state, then there must be strict inequality of (4.8) at at least one state. This result applies in particular to the two policies that we considered in the previous paragraph, an original deterministic policy, , and a changed policy, , that is identical to except that . Obviously, (4.7) holds trivially at all states other than s. Thus, if , then the changed policy is indeed better than .

The idea behind the proof of the policy improvement theorem is easy to understand. Starting from (4.7), we keep expanding the side and re-applying (4.7) until we get :

So far we have seen how, given a policy and its value function, we can easily evaluate a change in the policy at a single state to a particular action. It is a natural extension to consider changes at all states and to all possible actions, selecting at each state the action that appears best according to . In other words, to consider the new greedy policy, , given by

where the ``" notation denotes the value of a at which the expression is maximized (with ties broken arbitrarily). The greedy policy takes the action that looks best in the short term---after one step of lookahead---according to . By construction, the greedy policy meets the conditions of the policy improvement theorem (4.7), so we know that it is as good as, or better than, the original policy. The process of making a new policy that improves over an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement.

Suppose the new greedy policy, , is as good, but not better than, the old policy . Then , and from (4.9) it follows that for all :

But this is the same as the Bellman optimality equation (4.1), and therefore, must be , and both and must be optimal policies. Policy improvement thus must give us a strictly better policy except when the original policy is already optimal.

So far in this section we have considered the special case of deterministic policies. In the general case, a stochastic policy specifies probabilities, , for taking each action, a, in each state, s. We will not go through the details, but in fact all the ideas of this section extend easily to the case of stochastic policies. In particular, the policy improvement theorem carries through as stated for the stochastic case, under the natural definition:

In addition, if there are ties in policy improvement steps such as (4.9), that is, if there are several actions at which the maximum is achieved, then in the stochastic case we need not select a single action from among them. Instead, each maximizing action can be given a portion of the probability of being selected in the new greedy policy. Any apportioning scheme is allowed as long as all sub-maximal actions are given zero probability.

The last row of Figure 4.2 shows an example of policy improvement for stochastic policies. Here the original policy, , is the equiprobable random policy, and the new policy, , is greedy with respect to . The value function is shown in the bottom-left diagram and the set of possible is shown in the bottom-right diagram. The states with multiple arrows in the diagram are those in which several actions achieve the maximum in (4.9); any apportionment of probability among these actions is permitted. The value function of any such policy, , can be seen by inspection to be either -1, -2, or -3 at all states, , whereas is at most -14. Thus, , for all , illustrating policy improvement. Although in this case the new policy happens to be optimal, in general only an improvement is guaranteed.

Next: 4.3 Policy Iteration Up: 4 Dynamic Programming Previous: 4.1 Policy Evaluation

Richard Sutton
Sat May 31 11:09:21 EDT 1997