All the methods we have discussed so far are dependent to some extent on the initial action-value estimates, . In the language of statistics, these methods are biased by their initial estimates. For the sample-average methods, the bias disappears once all actions have been selected at least once, but for methods with constant , the bias is permanent, though decreasing over time as given by (2.6). In practice, this kind of bias is usually not a problem, and can sometimes be very helpful. The downside is that the initial estimates become, in effect, a whole set of parameters that must be picked by the user, if only to set them all to zero. The upside is that they provide an easy way to supply some prior knowledge about what level of rewards can be expected.
Initial action values can also be used as a simple way of encouraging exploration. Suppose that instead of setting the initial action values to zero, as we did in the 10-armed testbed, we set them all to +5. Recall that the in this problem are selected from a normal distribution with mean 0 and variance 1. An initial estimate of +5 is thus wildly optimistic. But this optimism encourages action-value methods to explore. Whichever actions are initially selected, the reward is less than the starting estimates; the learner switches to other actions, being ``disappointed" with the rewards it is receiving. The result is that all actions are tried, several times, before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time.
Figure 2.4: The effect of optimistic initial action-values estimates on the 10-armed testbed. Both algorithms used constant step size, .
Figure 2.4 shows the performance on the 10-armed bandit testbed of a greedy method using , for all a. For comparison, also shown is an -greedy method with . Both methods used a constant step size, . Initially, the optimistic method performs worse, because it explores more, but eventually it performs better because its exploration decreases with time. We call this technique for encouraging exploration optimistic initial values. We regard it as a simple trick that can be quite effective on stationary problems, but it is far from being a generally useful approach to encouraging exploration. For example, it is not well suited to nonstationary problems because its drive for exploration is inherently temporary. If the task changes, creating a renewed need for exploration, this method cannot help. Indeed, any method that focuses on the initial state in any special way is unlikely to help with the general non-stationary case. The beginning of time occurs only once, and thus we should not focus on it too much. This criticism applies as well to the sample-average methods, which also treat the beginning of time as a special event, averaging with equal weights all subsequent rewards. Nevertheless, all of these methods are very simple, and one or some simple combination of them is often adequate in practice. In the rest of this book we make frequent use of many of the exploration techniques that we explore here within the simpler non-associative setting of the n -armed bandit problem.
The results shown in Figure 2.4 should be quite reliable because they are averages over 2000 individual, randomly chosen 10-armed bandit tasks. Why, then, are there oscillations and spikes in the early part of the curve for the optimistic method? What might make this method perform particularly better or worse, on average, on particular early plays?