Monte Carlo methods can be implemented incrementally, on an episode-by-episode
basis, using extensions of techniques described in Chapter 2. They use averages of
*returns* just as some of the methods for solving -armed bandit tasks
described in Chapter 2 use averages of *rewards*. The techniques in
Sections 2.5 and
2.6 extend immediately to the Monte Carlo case. They enable
Monte Carlo methods to process each new return incrementally with no increase
in computation or memory as the number of episodes increases.

There are two differences between the Monte Carlo and bandit cases. One is that the Monte Carlo case typically involves multiple situations, that is, a different averaging process for each state, whereas bandit problems involve just one state (at least in the simple form treated in Chapter 2). The other difference is that the reward distributions in bandit problems are typically stationary, whereas in Monte Carlo methods the return distributions are typically nonstationary. This is because the returns depend on the policy, and the policy is typically changing and improving over time.

The incremental implementation described in Section 2.5 handles the case
of simple or arithmetic averages, in which each return is weighted equally.
Suppose we instead want to implement a *weighted* average, in which each
return is weighted by , and we want to compute

For example, the method described for estimating one policy while following another in Section 5.5 uses weights of . Weighted averages also have a simple incremental update rule. In addition to keeping track of , we must maintain for each state the cumulative sum of the weights given to the first returns. The update rule for is

and

where .