We are now ready to present an example of the second class of learning
control methods we consider in this book: off-policy methods. Recall that the
distinguishing feature of on-policy methods is that they estimate the value of
a policy while using it for control. In off-policy methods these two
functions are separated. The policy used to generate behavior, called the *behavior* policy, may in fact be unrelated to the policy that is evaluated
and improved, called the *estimation* policy. An advantage of this
separation is that the estimation policy may be deterministic (e.g., greedy),
while the behavior policy can continue to sample all possible actions.

Off-policy Monte Carlo control methods use the technique presented in the preceding section for estimating the value function for one policy while following another. They follow the behavior policy while learning about and improving the estimation policy. This technique requires that the behavior policy have a nonzero probability of selecting all actions that might be selected by the estimation policy. To explore all possibilities, we require that the behavior policy be soft.

Figure 5.7 shows an off-policy Monte Carlo method, based on GPI, for computing . The behavior policy is maintained as an arbitrary soft policy. The estimation policy is the greedy policy with respect to , an estimate of . The behavior policy chosen in (a) can be anything, but in order to assure convergence of to the optimal policy, an infinite number of returns suitable for use in (c) must be obtained for each pair of state and action. This can be assured by careful choice of the behavior policy. For example, any -soft behavior policy will suffice.

A potential problem is that this method learns only from the *tails* of
episodes, after the last nongreedy action. If nongreedy actions are
frequent, then learning will be slow, particularly for states appearing
in the early portions of long episodes. Potentially, this could greatly slow
learning. There has been insufficient experience with off-policy Monte Carlo
methods to assess how serious this problem is.