Reinforcement learning systems must be capable of generalization if they are to be applicable to artificial intelligence or to large engineering applications in general. To achieve this, any of a broad range of existing methods for supervised-learning function approximation can be used simply by treating each backup as a training example. Gradient-descent methods, in particular, allow a natural extension to function approximation of all the techniques developed in previous chapters, including eligibility traces. Linear gradient-descent methods are particularly appealing theoretically and work well in practice when provided with appropriate features. Choosing the features is one of the most important ways of adding prior domain knowledge to reinforcement learning systems. Linear methods include radial basis functions, tile coding (CMAC), and Kanerva coding. Backpropagation methods for multi-layer neural networks are instances of nonlinear gradient-descent function approximators.
For the most part, the extension of reinforcement learning prediction and control
methods to gradient-descent forms is straightforward. However, there is an
interesting interaction between function approximation, bootstrapping, and the
on-policy/off-policy distinction. Bootstrapping methods, such as DP and
TD()
for
, work reliably in conjunction with function approximation over a
narrower range of conditions than non-bootstrapping methods. As the control
case has not yet yielded to theoretical analysis, research has focused on the
value prediction problem. In this case, on-policy bootstrapping methods
converge reliably with linear gradient-descent function approximators to a
solution with mean square error bounded by
times
the minimum possible error. Off-policy methods, on the other hand, may diverge to
infinite error. Several approaches have been explored to making off-policy
bootstrapping methods work with function approximation, but this is still an
open research issue. Bootstrapping methods are of persistent interest in
reinforcement learning, despite their limited theoretical guarantees, because in
practice they usually work significantly better than non-bootstrapping methods.