Next: 2.9 Pursuit Methods Up: 2 Evaluative Feedback Previous: 2.7 Optimistic Initial Values

2.8 Reinforcement Comparison

A central intuition underlying reinforcement learning is that actions followed by large rewards should be made more likely to recur, whereas actions followed by small rewards should be made less likely to recur. But how is the learner to know what constitutes a large or a small reward? If an action is taken and the environment returns a reward of 5, is that large or small? To make such a judgment one must compare the reward with some standard or reference level, called the reference reward. A natural choice for the reference reward is an average of previously received rewards. In other words, a reward is interpreted as large if it is higher than average, and small if it is lower than average. Learning methods based on this idea are called reinforcement comparison methods. These methods are sometimes more effective than action-value methods. They are also the precursors to actor-critic methods, an important class of methods, which we present later, for solving the full reinforcement learning problem.

Reinforcement-comparison methods typically do not maintain estimates of action values but only of an overall reward level. In order to pick among the actions, they maintain a separate measure of their preference for each action. Let us denote the preference for action a on play t by . The preferences might be used to determine action-selection probabilities according to a softmax relationship, e.g.:

where denotes the probability of selecting action a on the tth play. The reinforcement comparison idea is used in updating the action preferences. After each play, the preference for the action selected on that play, , is incremented by the difference between the reward, , and the reference reward, :

where is a positive step-size parameter. This equation implements the idea that high rewards should increase the probability of re-selecting the action taken and low rewards should decrease its probability.

The reference reward is an incremental average of all recently received rewards, whichever actions were taken. After the update (2.9), the reference reward is updated:

where , , is a step size as usual. The initial value of the reference reward, , can be set either optimistically, to encourage exploration, or according to prior knowledge. The initial values of the action preferences can simply all be set to zero. Constant is a good choice here because the distribution of rewards is changing over time as action selection improves. We see here the first case in which the learning problem is effectively nonstationary even though the underlying problem is stationary.

Figure 2.5: Reinforcement comparison methods versus action-value methods on the 10-armed testbed.

Reinforcement-comparison methods can be very effective, sometimes performing even better than action-value methods. Figure 2.5 shows the performance of the above algorithm () on the 10-armed testbed. The performances of -greedy () action-value methods with and are also shown for comparison.

Exercise .

The softmax action-selection rule given for reinforcement-comparison methods (2.8) lacks the temperature parameter, , used in the earlier softmax equation (2.2). Why do you think this was done? Has any important flexibility been lost here by omitting ?

Exercise .

The reinforcement-comparison methods described here have two step-size parameters, and . Could we, in general, reduce this to just one parameter by choosing ? What would be lost by doing this?

Exercise . (programming)

Suppose the initial reference reward, , is far too low. Whatever action is selected first will then probably increase in its probability of selection. Thus it is likely to be selected again, and increased in probability again. In this way an early action that is no better than any other could crowd out all other actions for a long time. To counteract this effect, it is common to add a factor of to the increment in (2.9). Design and implement an experiment to determine whether or not this really improves the performance of the algorithm.

Next: 2.9 Pursuit Methods Up: 2 Evaluative Feedback Previous: 2.7 Optimistic Initial Values

Richard Sutton
Sat May 31 12:02:11 EDT 1997