Next: 5.8 Summary Up: 5 Monte Carlo Methods Previous: 5.6 Off-Policy Monte Carlo

5.7 Incremental Implementation

Monte Carlo methods can be implemented incrementally, on an episode-by-episode basis, using extensions of techniques described in Chapter 2. Monte Carlo methods use averages of returns just as some of the methods for solving n -armed bandit tasks described in Chapter 2 use averages of rewards. The techniques in Sections 2 .5 and 2 .6 extend immediately to the Monte Carlo case. These enable Monte Carlo methods to incrementally process each new return with no increase in computation or memory over episodes.

There are two differences between the Monte Carlo and bandit cases. One is that the Monte Carlo case typically involves multiple situations, that is, a different averaging process for each state, whereas bandit problems involve just one state (at least in the simple form treated in Chapter 2). The other difference is that the reward distributions in bandit problems are typically stationary, whereas in Monte Carlo methods the return distributions are typically nonstationary. This is because the returns depend on the policy, and the policy is typically changing and improving over time.

The incremental implementation described in Section 2 .5 handles the case of simple or arithmetic averages, in which each return is weighted equally. Suppose we instead want to implement a weighted average, in which each return is weighted by , and we want to compute

For example, the method described for estimating one policy while following another in Section 5 .5 uses weights of . Weighted averages also have a simple incremental update rule. In addition to keeping track of , we must also maintain for each state the cumulative sum of the weights given to the first n returns. The update rule for is

and

where .

Exercise .

Modify the algorithm for first-visit MC policy evaluation (Figure 5.1) to use the incremental implementation for stationary averages described in Section 2 .5 .

Exercise .

Derive the weighted-average update rule (5.5) from (5.4). Follow the pattern of the derivation of the unweighted rule (2 .4 ) from (2 .1 ).

Exercise .

Modify the algorithm for the off-policy control algorithm (Figure 5.7) to use the method described above for incrementally computing weighted averages.

Next: 5.8 Summary Up: 5 Monte Carlo Methods Previous: 5.6 Off-Policy Monte Carlo

Richard Sutton
Fri May 30 13:20:35 EDT 1997