Next: 2.6 Tracking a Nonstationary Up: 2. Evaluative Feedback Previous: 2.4 Evaluation Versus Instruction Contents

2.5 Incremental Implementation

The action-value methods we have discussed so far all estimate action values as sample averages of observed rewards. The obvious implementation is to maintain, for each action , a record of all the rewards that have followed the selection of that action. Then, when the estimate of the value of action a is needed at time , it can be computed according to (2.1), which we repeat here:

where

are all the rewards received following all selections of action

prior to play

. A problem with this straightforward implementation is that its memory and computational requirements grow over time without bound. That is, each additional reward following a selection of action

requires more memory to store it and results in more computation being required to determine

As you might suspect, this is not really necessary. It is easy to devise incremental update formulas for computing averages with small, constant computation required to process each new reward. For some action, let denote the average of its first rewards (not to be confused with , the average for action at the th play). Given this average and a st reward, , then the average of all rewards can be computed by





			(2.4)

which holds even for

, obtaining

for arbitrary

. This implementation requires memory only for

and

, and only the small computation (2.4) for each new reward.

The update rule (2.4) is of a form that occurs frequently throughout this book. The general form is

(2.5)

The expression

is an error in the estimate. It is reduced by taking a step toward the "Target." The target is presumed to indicate a desirable direction in which to move, though it may be noisy. In the case above, for example, the target is the

reward.

Note that the step-size parameter () used in the incremental method described above changes from time step to time step. In processing the th reward for action , that method uses a step-size parameter of . In this book we denote the step-size parameter by the symbol $\alpha$ or, more generally, by . For example, the above incremental implementation of the sample-average method is described by the equation . Accordingly, we sometimes use the informal shorthand to refer to this case, leaving the action dependence implicit.