Our reason for computing the value function for a policy is to help
find better policies. Suppose we have determined the value function
for an arbitrary deterministic policy
. For some state s we
would like to know whether or not we should change the policy to
deterministically choose an action
. We know how good it is to follow the current
policy from s---that is just
---but would it be better or worse to
change to the new policy? One way to answer this question is to consider
selecting a in s and then thereafter following the existing policy,
. The
value of this way of behaving is
The key criterion is whether this is greater than or less than
. If it is greater, that is, if it is better to select a once in
s and thereafter follow
than it would be to follow
all the time, then one would expect it
to be better still to select a every time s is encountered, and that the
new policy would in fact be a better one overall.
That this is true is a special case of a general result called the
policy improvement theorem. Let and
be any pair of
deterministic policies such that, for all
,
Then the policy
must be as good as, or better than,
. That is, it must obtain greater
or equal expected return from all states
:
Moreover, if there is strict inequality of (4.7) at any state,
then there must be strict inequality of
(4.8) at at least one state. This result applies in particular
to the two policies that we considered in the previous paragraph, an original
deterministic policy, , and a changed policy,
, that is identical to
except that
. Obviously, (4.7)
holds trivially at all states other than s. Thus, if
,
then the changed policy is indeed better than
.
The idea behind the proof of the policy improvement theorem is easy to
understand. Starting from (4.7), we keep expanding the
side and re-applying (4.7) until we get
:
So far we have seen how, given a policy and its value function, we can easily
evaluate a change in the policy at a single state to a particular action.
It is a natural extension to consider changes at
all states and to all possible actions, selecting at each state
the action that appears best according to . In other words, to
consider the new greedy policy,
, given by
where the ``" notation denotes the value of a
at which the expression is maximized (with ties broken
arbitrarily). The greedy policy takes the action that
looks best in the short term---after one step of lookahead---according to
.
By construction, the greedy policy meets the conditions of the policy
improvement theorem (4.7), so we know that it is as
good as, or better than, the original policy. The process of making a new policy
that improves over an original policy, by making it greedy with respect to the
value function of the original policy, is called policy improvement.
Suppose the new greedy policy, , is as good, but not better than, the old
policy
. Then
, and from (4.9) it follows that for all
:
But this is the same as the Bellman optimality equation
(4.1), and therefore, must be
, and both
and
must be optimal policies. Policy improvement thus must give us a strictly
better policy except when the original policy is already optimal.
So far in this section we have considered the special case of deterministic
policies. In the general case, a stochastic policy specifies probabilities,
, for taking each action, a, in each state, s. We will not go
through the details, but in fact all the ideas of this section extend easily to the
case of stochastic policies. In particular, the policy improvement theorem carries
through as stated for the stochastic case, under the natural definition:
In addition, if there are ties in policy improvement steps such as (4.9), that is, if there are several actions at which the maximum is achieved, then in the stochastic case we need not select a single action from among them. Instead, each maximizing action can be given a portion of the probability of being selected in the new greedy policy. Any apportioning scheme is allowed as long as all sub-maximal actions are given zero probability.
The last row of Figure 4.2 shows an example of policy
improvement for stochastic policies. Here the original policy, , is the
equiprobable random policy, and the new policy,
, is greedy with respect to
. The value function
is shown in the bottom-left diagram and the
set of possible
is shown in the bottom-right diagram. The states with
multiple arrows in the
diagram are those in which several actions
achieve the maximum in (4.9); any apportionment of probability
among these actions is permitted. The value function of any such policy,
, can be seen by inspection to be either -1, -2, or
-3 at all states,
, whereas
is at most -14. Thus,
, for all
, illustrating policy improvement.
Although in this case the new policy
happens to be optimal, in general only
an improvement is guaranteed.