We use the terms agent, environment, and action instead of the engineers' terms controller, controlled system (or plant), and control signal because they are meaningful to a wider audience.

We restrict attention to discrete time to keep things as simple as possible, even though many of the ideas can be extended to the continuous-time case (e.g., see Bertsekas and Tsitsiklis, 1996; Werbos, 1992; Doya, 1996).

We use 15#15 to denote the immediate reward for an action taken at time step t, instead of the more common 16#16, because it emphasizes that the next reward and the next state are jointly determined.

Better places for imparting this kind of prior knowledge are the initial policy or value function, or in influences on these. For example, see Lin (1993), Maclin and Shavlik (1994), and Clouse (1996).

Episodes are often called ``trials" in the literature.

Ways to formulate tasks that are both continual and undiscounted are subjects of current research (e.g., Mahadevan, 1996; Schwartz, 1993; Tadepalli and Ok, 1994). Some of the ideas are discussed in Section 6 .7 .

Richard Sutton
Sat May 31 13:56:52 EDT 1997