-Armed Bandit Problem">
Consider the following learning
problem. You are faced repeatedly with a choice
among different options, or actions. After each choice you receive a numerical
reward chosen from a stationary probability distribution that depends on the action
you selected. Your objective is to maximize the expected total reward over some
time period, for example, over 1000 action selections. Each action selection is
called a play.
This is the original form of the -armed bandit problem, so named by analogy
to a slot machine, or "one-armed bandit," except that it has
levers instead of
one. Each action selection is like a play of one of the slot machine's levers,
and the rewards are the payoffs for hitting the jackpot. Through repeated plays
you are to maximize your winnings by concentrating your plays on the best levers.
Another analogy is that of a doctor choosing between experimental treatments for a
series of seriously ill patients. Each play is a treatment selection, and each
reward is the survival or well-being of the patient. Today the term "
-armed
bandit problem" is often used for a generalization of the problem described above,
but in this book we use it to refer just to this simple case.
In our -armed bandit problem, each action has an expected
or mean reward given that that action is selected; let us call this the value of that action. If you knew the value of each action, then it would be
trivial to solve the
-armed bandit problem: you would always select
the action with highest value. We assume that you do not know the action values with
certainty, although you may have estimates.
If you maintain estimates of the action values, then at any time there is at least one action whose estimated value is greatest. We call this a greedy action. If you select a greedy action, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring because this enables you to improve your estimate of the nongreedy action's value. Exploitation is the right thing to do to maximize the expected reward on the one play, but exploration may produce the greater total reward in the long run. For example, suppose the greedy action's value is known with certainty, while several other actions are estimated to be nearly as good but with substantial uncertainty. The uncertainty is such that at least one of these other actions probably is actually better than the greedy action, but you don't know which one. If you have many plays yet to make, then it may be better to explore the nongreedy actions and discover which of them are better than the greedy action. Reward is lower in the short run, during exploration, but higher in the long run because after you have discovered the better actions, you can exploit them. Because it is not possible both to explore and to exploit with any single action selection, one often refers to the "conflict" between exploration and exploitation.
In any specific case, whether it is better to explore or exploit depends in a
complex way on the precise values of the estimates, uncertainties, and the number
of remaining plays. There are many
sophisticated methods for balancing
exploration and exploitation for particular mathematical formulations of
the -armed bandit and related problems. However, most of these methods
make strong assumptions about stationarity and prior knowledge that are
either violated or impossible to verify in applications and in the full
reinforcement learning problem that we consider in subsequent chapters.
The guarantees of optimality or bounded loss for these methods are of little comfort
when the assumptions of their theory do not apply.
In this book we do not worry about balancing exploration and
exploitation in a sophisticated way; we worry only about balancing them
at all. In this chapter we present several simple
balancing methods for the -armed bandit problem and show that they work
much better than methods that always exploit. In addition, we point out
that supervised learning methods (or rather the methods closest to
supervised learning methods when adapted to this problem) perform poorly
on this problem because they do not balance exploration and exploitation
at all. The need to balance exploration and
exploitation is a distinctive challenge that arises in reinforcement
learning; the simplicity of the
-armed bandit problem enables us to show
this in a particularly clear form.