How can we avoid the unlikely assumption of exploring starts? The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them. There are two approaches to ensuring this, resulting in what we call on-policy methods and off-policy methods. On-policy methods attempt to evaluate or improve the policy that is used to make decisions. In this section we present an on-policy Monte Carlo control method in order to illustrate the idea.
In on-policy control methods the policy is generally soft, meaning that
for all
and all
. There are many
possible variations on on-policy methods. One possibility is to
gradually shift the policy toward a deterministic optimal policy.
Many of the methods discussed in Chapter 2 provide mechanisms
for this. The on-policy method we present in this section uses
-greedy policies, meaning that most of the time they choose an action
that has maximal estimated action value, but with probability
they instead select an action at random. That is, all
nongreedy actions are given the minimal probability of selection,
, and the remaining bulk of the probability,
, is given to the greedy action. The
-greedy policies are examples of
-soft policies, defined as
policies for which
for all states and
actions, for some
. Among
-soft policies,
-greedy
policies are in some sense those that are closest to greedy.
The overall idea of on-policy Monte Carlo control is still that of
GPI. As in Monte Carlo ES, we use first-visit MC methods to
estimate the action-value function for the current policy. Without the
assumption of exploring starts, however, we cannot simply improve the policy
by making it greedy with respect to the current value function, because that
would prevent further exploration of nongreedy actions. Fortunately, GPI does
not require that the policy be taken all the way to a greedy
policy, only that it be moved toward a greedy policy. In our
on-policy method we will move it only to an -greedy policy. For any
-soft policy,
, any
-greedy policy with respect to
is
guaranteed to be better than or equal to
.
That any -greedy policy with respect to
is an improvement over any
-soft policy
is assured by the policy improvement theorem. Let
be the
-greedy policy. The conditions of the policy improvement
theorem apply because for any
:
Consider a new environment that is just like the original environment, except with
the requirement that policies be -soft "moved inside" the environment. The
new environment has the same action and state set as the original and behaves
as follows. If in state
and taking action
, then with probability
the new environment behaves exactly like the old environment. With
probability
it repicks the action at random, with equal probabilities,
and then behaves like the old environment with the new, random action. The best
one can do in this new environment with general policies is the same as the best
one could do in the original environment with
-soft policies. Let
and
denote the optimal value functions for the new environment. Then a
policy
is optimal among
-soft policies if and only if
. From
the definition of
we know that it is the unique solution to
![]() |
![]() |
In essence, we have shown in the last few pages that policy iteration works for
-soft policies. Using the natural notion of greedy policy for
-soft
policies, one is assured of improvement on every step, except when the best
policy has been found among the
-soft policies. This analysis is independent
of how the action-value functions are determined at each stage, but it does
assume that they are computed exactly. This brings us to roughly the
same point as in the previous section. Now we only achieve the best policy
among the
-soft policies, but on the other hand, we have eliminated the assumption of
exploring starts. The complete algorithm is given in
Figure
5.6.