next up previous contents
Next: 8.7 Summary Up: 8. Generalization and Function Previous: 8.5 Off-Policy Bootstrapping   Contents

8.6 Should We Bootstrap?

At this point you may be wondering why we bother with bootstrapping methods at all. Nonbootstrapping methods can be used with function approximation more reliably and over a broader range of conditions than bootstrapping methods. Nonbootstrapping methods achieve a lower asymptotic error than bootstrapping methods, even when backups are done according to the on-policy distribution. By using eligibility traces and , it is even possible to implement nonbootstrapping methods on-line, in a step-by-step incremental manner. Despite all this, in practice bootstrapping methods are usually the methods of choice.


Figure 8.15: The effect of $\lambda $ on reinforcement learning performance. In all cases, the better the performance, the lower the curve. The two left panels are applications to simple continuous-state control tasks using the Sarsa($\lambda $) algorithm and tile coding, with either replacing or accumulating traces (Sutton, 1996). The upper-right panel is for policy evaluation on a random walk task using TD($\lambda $) (Singh and Sutton, 1996). The lower right panel is unpublished data for the pole-balancing task (Example 3.4) from an earlier study (Sutton, 1984).
 

In empirical comparisons, bootstrapping methods usually perform much better than nonbootstrapping methods. A convenient way to make such comparisons is to use a TD method with eligibility traces and vary $\lambda $ from 0 (pure bootstrapping) to 1 (pure nonbootstrapping). Figure  8.15 summarizes a collection of such results. In all cases, performance became much worse as $\lambda $ approached , the nonbootstrapping case. The example in the upper right of the figure is particularly significant in this regard. This is a policy evaluation (prediction) task and the performance measure used is root MSE (at the end of each episode, averaged over the first 20 episodes). Asymptotically, the case must be best according to this measure, but here, short of the asymptote, we see it performing much worse.

At this time it is unclear why methods that involve some bootstrapping perform so much better than pure nonbootstrapping methods. It could be that bootstrapping methods learn faster, or it could be that they actually learn something better than nonbootstrapping methods. The available results indicate that nonbootstrapping methods are better than bootstrapping methods at reducing MSE from the true value function, but reducing MSE is not necessarily the most important goal. For example, if you add 1000 to the true action-value function at all state-action pairs, then it will have very poor MSE, but you will still get the optimal policy. Nothing quite that simple is going on with bootstrapping methods, but they do seem to do something right. We expect the understanding of these issues to improve as research continues.


next up previous contents
Next: 8.7 Summary Up: 8. Generalization and Function Previous: 8.5 Off-Policy Bootstrapping   Contents
Mark Lee 2005-01-04