Another class of effective learning methods for the n
-armed bandit
problem are pursuit methods. Pursuit methods maintain both
action-value estimates and action preferences, with the preferences
continually ``pursuing" the action that is greedy according to the
current action-value estimates. In the simplest pursuit method, the action
preferences, , are the probabilities with which each action, a, is selected
on play t.
Just before selecting the action, the probabilities are updated so as to
make the greedy action more likely to be selected. Suppose is the greedy action (or a random sample from the
greedy actions if there are more than one) just prior to selecting action
. Then the probability of selecting
is incremented a fraction,
, of the way toward one:
while the probabilities of selecting the other actions are decremented toward zero:
After is taken and a new reward
observed, the action value,
, is updated in one of the ways
discussed in the preceding sections, e.g., to be a sample average
of the observed rewards, using (2.1).
Figure 2.6: Performance of the pursuit method vis-a-vis action-value and
reinforcement-comparison methods on the 10-armed testbed.
Figure 2.6 shows the performance of the pursuit algorithm described
above when the action values were estimated using sample averages (incrementally
computed using ). In these results, the initial action probabilities
were
, for all a and the parameter
was .01. For comparison, we
also show the performance of an
-greedy method (
) with action values also estimated using
sample averages. The performance of the reinforcement-comparison algorithm from the
previous section is also shown. Although the pursuit algorithm performs the best of
these three on this task at these parameter settings, the ordering could well be
different in other cases. All three of these methods appear to have their uses
and advantages.
Exercise .
An
-greedy method always selects a random action on a fraction,
, of the
time steps. How about the pursuit algorithm? Will it eventually select the optimal
action with probability approaching certainty?
Exercise .
For many of the problems we will encounter later in this book it is not feasible to directly update action probabilities. To use pursuit methods in these cases it is necessary to modify them to use action preferences that are not probabilities, but which determine action probabilities according to a softmax relationship such as the Gibbs distribution (2.8). How can the pursuit algorithm described above be modified to be used in this way? Specify a complete algorithm, including the equations for action-values, preferences, and probabilities at each play.
Exercise .
(programming)
How well does the algorithm you proposed in
Exercise .
perform? Design and run an experiment assessing the performance of your method. Discuss the role of parameter value settings in your experiment.
Exercise .
The pursuit algorithm described above is suited only for stationary
environments because the action probabilities converge, albeit slowly, to
certainty. How could you combine the pursuit idea with the
-greedy idea
to obtain a method with performance close to that of the pursuit
algorithm, but which always continues to explore to some small degree?