Footnotes

If this were a control problem with the objective of minimizing travel time, then we would of course make the rewards the negative of the elapsed time. But since we are concerned here only with prediction (policy evaluation), we can keep things simple by using positive numbers.

Richard Sutton
Fri May 30 13:53:05 EDT 1997