We have presented in this chapter some simple ways of balancing exploration and
exploitation. The
-greedy methods simply choose randomly a small fraction of the
time, the softmax methods grade their action probabilities according to the current
action-value estimates, and the pursuit methods just keep taking steps toward the
current greedy action. Are these simple methods really the best we can do in terms
of practically useful algorithms? So far, the answer appears to be ``yes". Despite
their simplicity, in our opinion the methods presented in this chapter can fairly be
considered the state-of-the-art. There are more sophisticated methods, but their
complexity and assumptions make them impractical for the full reinforcement learning
problem that is our real focus. Starting in Chapter 5 we present learning
methods for solving the full reinforcement learning problem that use in part the
simple methods explored in this chapter.
Although the simple methods explored in this chapter may be the best we can do at present, they are far from a fully satisfactory solution to the problem of balancing exploration and exploitation. We conclude this chapter with a brief look at some of the current ideas that, while not as yet practically useful, may point the way toward better solutions.
One promising idea is to use estimates of the uncertainty of the action-value estimates to direct and encourage exploration. For example, suppose there are two actions estimated to have values slightly less than that of the greedy action, but which differ greatly in their degree of uncertainty. One estimate is very certain; perhaps that action has been tried many times and many rewards have been observed. The uncertainty for this action value is so low that its true value is very unlikely to be higher than the value of the greedy action. The other action is known less well, and the estimate of its value is very uncertain. The true value of this action could easily be better than that of the greedy action. Obviously, it makes more sense to explore the second action than the first.
This line of thought leads to interval estimation methods. These
methods estimate for each action the confidence interval of the action
value. That is, rather than learning that the action value is approximately 10,
they learn that it is between 9 and 11 with 95% confidence. The action selected
is then the action whose confidence interval has the highest upper limit. This
encourages exploration of actions that are uncertain and have a chance of
ultimately being the best action. In some cases one can obtain guarantees that the
optimal action has been found with confidence equal to the confidence factor (e.g.,
the 95%). Interval estimation methods are problematic in practice because of the
complexity of the statistical methods used to estimate the confidence intervals.
Moreover, the underlying statistical assumptions required by these methods are
often not satisfied. Nevertheless, the idea of using confidence intervals, or some
other measure of uncertainty, to encourage exploration of particular actions is sound
and appealing.
There is also a well-known algorithm for computing the Bayes optimal
way to balance exploration and exploitation. This method is
computationally intractable when done exactly, but there may be efficient
ways to approximate it. In this optimal method we assume that we know the
distribution of problem instances, i.e., the probability of each
possible set of true action values. Given any action selection, we can
then compute the probability of each possible immediate reward and the
resultant posterior probability distribution over action values. This
evolving distribution becomes the information state of the
problem. Given a horizon, say 1000 plays, one can consider all possible
actions, all possible resulting rewards, all possible next actions, all
next rewards, and so on for all 1000 plays. Given the assumptions, the
rewards and probabilities of each possible chain of events can be
determined, and one need only pick the best. But the tree of
possibilities grows extremely rapidly; even if there are only two actions
and two rewards the tree will have
leaves. This approach effectively turns the bandit problem
into an instance of the full reinforcement learning problem. In the end,
we may be able to use reinforcement learning methods to approximate
this optimal solution. But that is a topic for current research and beyond
the scope of this introductory book.
The classical solution to balancing exploration and exploitation in n -armed bandit problems is to compute special functions called Gittins indices. These provide an optimal solution to a certain kind of bandit problem more general than that considered here, but which assumes the prior distribution of possible problems is known. Unfortunately, this method does not appear to generalize either in its theory or computational tractability to the full reinforcement learning problem that we consider in the rest of the book.