The Monte Carlo methods presented in this chapter learn value functions and
optimal policies from experience in the form of *sample episodes*. This
gives them at least three kinds of advantages over DP methods. First, they
can be used to learn optimal behavior directly from interaction with the
environment, with no model of the environment's dynamics. Second, they
can be used with simulation or *sample models*. For surprisingly many
applications it is easy to simulate sample episodes even though it is
difficult to construct the kind of explicit model of transition probabilities
required by DP methods. Third, it is easy and efficient to *focus*
Monte Carlo methods on a small subset of the states. A region of special
interest can be accurately evaluated without going to the expense of
accurately evaluating the rest of the state set (we explore this further
in Chapter 9).

A fourth advantage of Monte Carlo methods, which we discuss later in the book, is that they may be less harmed by violations of the Markov property. This is because they not update their value estimates on the basis of the value estimates of successor states. In other words, it is because they do not bootstrap.

In designing Monte Carlo control methods we have followed the overall schema
of *generalized policy iteration* (GPI) introduced in Chapter 4. GPI
involves interacting processes of policy evaluation and policy improvement.
Monte Carlo methods provide an alternative policy evaluation process. Rather
than use a model to compute the value of each state, they simply average many
returns that start in the state. Because a state's value is the expected
return, this average can become a good approximation to the value. In
control methods we are particularly interested in approximating action-value
functions, because these can be used to improve the policy without requiring a
model of the environment's transition dynamics. Monte Carlo methods intermix
policy evaluation and policy improvement steps on an episode-by-episode basis,
and can be incrementally implemented on an episode-by-episode basis.

Maintaining *sufficient exploration* is an issue in Monte
Carlo control methods. It is not enough just to select the actions
currently estimated to be best, because then no returns will be obtained for
alternative actions, and it may never be learned that they are actually
better. One approach is to ignore this problem by assuming that episodes
begin with state-action pairs randomly selected to cover all possibilities.
Such *exploring starts* can sometimes be arranged in applications with
simulated episodes, but are unlikely in learning from real experience.
Instead, one of two general approaches can be used. In *on-policy*
methods, the agent commits to always exploring and tries to find the best
policy that still explores. In *off-policy* methods, the agent
also explores, but learns a deterministic optimal policy that may be unrelated
to the policy followed. More instances of both kinds of methods are presented
in the next chapter.

All Monte Carlo methods for reinforcement learning have been explicitly identified only recently. Their convergence properties are not yet clear, and their effectiveness in practice has been little tested. At present, their primary significance is their simplicity and their relationships to other methods.

Monte Carlo methods differ from DP methods in two ways. First, they operate on sample experience, and thus can be used for direct learning without a model. Second, they do not bootstrap. That is, they do not update their value estimates on the basis of other value estimates. These two differences are not tightly linked and can be separated. In the next chapter we consider methods that learn from experience, like Monte Carlo methods, but also bootstrap, like DP methods.