The action-value methods we have discussed so far all estimate their action values as the sample averages of observed rewards. The obvious implementation is to maintain, for each action, a, a record of all the rewards that have followed the selection of that action. Then, whenever estimate of the action value is needed, it can be computed according to (2.1), which we repeat here:
where are all the rewards received
following all selections of action a up until play t. A problem with this
straightforward implementation is that its memory and computational requirements grow
over time without bound. That is, each additional reward following a selection of
action a requires more memory to store it and results in more computation being
required to compute
.
As you might suspect, this is not really necessary. It is easy to devise
incremental update formulas for computing averages with small, constant
computation required to process each new reward. For some action, let denote
the average of its first k rewards. Given this average and a
st reward,
, the average of all k+1 rewards can be computed by:
which holds even for k=0, obtaining for arbitrary
. This
implementation requires memory only for
and k, and only the small
computation (2.4) for each new reward.
The update rule (2.4) is of a form that occurs frequently throughout this book. The general form is:
The expression is an
error in the estimate. It is reduced by taking a step toward the
``target". The target is presumed to indicate a desirable direction in
which to move, though it may be noisy. In the case above, for example,
the target is the
reward.
Note that the step size used in the incremental method described above changes from
time step to time step. In processing the kth reward for action a, that method
uses a step-size of . In this book we denote the step size by the
symbol
, or, more generally, by
. For example, the above incremental
implementation of the sample average method is described by the equation
. Accordingly, we sometimes use the informal shorthand
to refer to this case, leaving the action dependence implicit.
Exercise .
Give pseudo-code for a complete algorithm for the n
-armed bandit problem.
Use greedy action selection and incremental computation of
action values with step size. Assume a function
that takes an
action and returns a reward. Use arrays and variables; do not subscript anything by
the time index t. Indicate how the action values are initialized and
updated after each reward.