A central intuition underlying reinforcement learning is that actions followed by large rewards should be made more likely to recur, whereas actions followed by small rewards should be made less likely to recur. But how is the learner to know what constitutes a large or a small reward? If an action is taken and the environment returns a reward of 5, is that large or small? To make such a judgment one must compare the reward with some standard or reference level, called the reference reward. A natural choice for the reference reward is an average of previously received rewards. In other words, a reward is interpreted as large if it is higher than average, and small if it is lower than average. Learning methods based on this idea are called reinforcement comparison methods. These methods are sometimes more effective than action-value methods. They are also the precursors to actor-critic methods, a class of methods for solving the full reinforcement learning problem that we present later.
Reinforcement comparison methods typically do not maintain estimates of action
values, but only of an overall reward level. In order to pick among the
actions, they maintain a separate measure of their preference for each action.
Let us denote the preference for action on play by . The
preferences might be used to determine action-selection probabilities according
to a softmax relationship, such as
The reference reward is an incremental average of all
recently received rewards, whichever actions were taken. After the update
(2.10), the reference reward is updated:
Reinforcement comparison methods can be very effective, sometimes performing even better than action-value methods. Figure 2.5 shows the performance of the above algorithm () on the 10-armed testbed. The performances of -greedy () action-value methods with and are also shown for comparison.
Exercise 2.9 The softmax action-selection rule given for reinforcement comparison methods (2.9) lacks the temperature parameter, , used in the earlier softmax equation (2.2). Why do you think this was done? Has any important flexibility been lost here by omitting ?
Exercise 2.10 The reinforcement comparison methods described here have two step-size parameters, and . Could we, in general, reduce this to one parameter by choosing ? What would be lost by doing this?
Exercise 2.11 (programming) Suppose the initial reference reward, , is far too low. Whatever action is selected first will then probably increase in its probability of selection. Thus it is likely to be selected again, and increased in probability again. In this way an early action that is no better than any other could crowd out all other actions for a long time. To counteract this effect, it is common to add a factor of to the increment in (2.10). Design and implement an experiment to determine whether or not this really improves the performance of the algorithm.