Next: 2.3 Softmax Action Selection Up: 2 Evaluative Feedback Previous: 2.1 An n-armed Bandit

2.2 Action-Value Methods

We begin by looking more closely at some simple algorithms for estimating the value of actions and for using the estimates to make action-selection decisions. In this chapter, we denote the true (actual) value of action a as , and the estimated value after t plays as . Recall that the true value of an action is the mean reward given that the action is selected. One natural way to estimate this is by averaging the rewards actually received when the action was selected. In other words, if, after t decisions, action a has been chosen times, yielding rewards , then its value is estimated to be

If , then we define instead as some default value, e.g., . As , by the law of large numbers converges to . We call this the sample-average method for estimating action values because each estimate is a simple average of the sample of relevant rewards. Of course this is just one way to estimate action values, not necessarily the best one. Nevertheless, for now let us stay with this simple estimation algorithm and turn to the question of how the estimates might be used to select actions.

The simplest action selection rule is to select the action (or one of the actions) with highest estimated action value, i.e., to select on the tth play one of the greedy actions, , for which . This method always exploits current knowledge to maximize immediate reward; it spends no time at all sampling apparently inferior actions to see if they might really be better. A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability , instead select an action at random, uniformly, independently of the action-value estimates. We call methods using this near-greedy action selection rule -greedy methods. An advantage of these methods is that in the limit as the number of plays increases, every action will be sampled an infinite number of times, guaranteeing that for all a, and thus ensuring that all the converge to . This of course implies that the probability of selecting the optimal action converges to , i.e., to near certainty. These are just asymptotic guarantees, however, and say little about the practical effectiveness of the methods.

To roughly assess the relative effectiveness of the greedy and -greedy methods, we compared them numerically on a suite of test problems. This is a set of 2000 randomly generated n -armed bandit tasks with n=10. For each action, a, the rewards were selected from a normal (Gaussian) probability distribution with mean and variance 1. The 2000 n -armed bandit tasks were generated by re-selecting the 2000 times, each according to a normal distribution with mean 0 and variance 1. Averaging over tasks, we can plot the performance and behavior of various methods as they improve with experience over 1000 plays, as in Figure 2.1. We call this suite of test tasks the 10-armed testbed.

Figure 2.1: Average performance of -greedy action-value methods on 10-armed testbed. These data are averages over 2000 tasks. All methods used sample averages as their action-value estimates.

Figure 2.1 compares a greedy method with two -greedy methods ( and ), as described above, on the 10-armed testbed. Both methods formed their action-value estimates using the sample-average technique. The upper graph shows the increase in expected reward with experience. The greedy method improves slightly faster than the other methods at the very beginning, but then levels off at a lower level. It achieves a reward-per-step of only about 1 compared to the best possible of about 1.55 on this testbed. The greedy method performs significantly worse in the long run because it often gets stuck performing suboptimal actions. The lower graph shows that the greedy method found the optimal action only in approximately one-third of the tasks. In the other two-thirds, its initial samples of the optimal action were disappointing, and it never returned to it. The -greedy methods eventually perform better because they continue to explore, and continue to improve their chances of recognizing the optimal action. The method explores more, and usually finds the optimal action earlier, but never selects it more than 91% of the time. The method improves more slowly, but eventually performs better than the method by both performance measures. It is also possible to reduce over time to try to get the best of both high and low values.

The advantage of -greedy over greedy methods depends on the problem. For example, suppose the reward variance had been larger, say 10 instead of 1. With noisier rewards it takes more exploration to find the optimal action, and -greedy methods should fare even better relative to the greedy method. On the other hand, if the reward variances were zero, then the greedy method would know the true value of each action after trying it once. In this case the greedy method might actually perform best because it would soon find the optimal action and then never explore. But even in the deterministic case, there is a large advantage to exploring if we weaken some of the other assumptions. For example, suppose the bandit problem were nonstationary, that is, that the true values of the actions changed over time. In this case exploration is needed even in the deterministic case to make sure one of the non-greedy actions has not changed to become better than the greedy one. As we will see in the next few chapters, effective nonstationarity is the case most commonly encountered in reinforcement learning. Even if the underlying problem is stationary and deterministic, the learner faces a set of bandit-like decision problems each of which changes over time due to the learning process itself. Reinforcement learning problems have a strong requirement for some sort of balance between exploration and exploitation.

Exercise .

In the comparison shown in Figure 2.1, which method will perform best in the long run, in terms of cumulative reward and cumulative probability of selecting the best action? How much better will it be?

Next: 2.3 Softmax Action Selection Up: 2 Evaluative Feedback Previous: 2.1 An n-armed Bandit

Richard Sutton
Sat May 31 12:02:11 EDT 1997