In the preceding section we described two kinds of reinforcement learning tasks, one in which the agent-environment interaction naturally breaks down into a sequence of separate episodes (episodic tasks), and one in which it does not (continual tasks). The former case is mathematically easier because each action affects only the finite number of rewards subsequently received during the episode. In this book we consider sometimes one kind of problem and sometimes the other, but often both. It is therefore useful to establish one notation that enables us to talk precisely about both cases simultaneously.
To be precise about episodic tasks requires some additional notation.
Rather than one long sequence of time steps, we need to consider a series of
episodes, each of which consists of a finite sequence of time steps.
We number the time steps of each episode starting anew from zero.
Therefore, we have to refer
not just to , the state representation at time t, but to
, the
state representation at time
t of episode i (and similarly for
,
,
,
, etc.).
However, it turns out that when we discuss episodic tasks we will almost never
have to distinguish between different episodes. We will almost always be
considering a particular single episode, or stating something that is true for
all episodes. Accordingly, in practice we will almost always abuse notation
slightly by dropping the explicit reference to episode number. That is, we
will simply write
to refer to
, etc.
We need one other convention to obtain a single notation that covers both episodic and continual tasks. We have defined the return as a sum over a finite number of terms in one case (3.1) and as a sum over an infinite number of terms in the other (3.2). These can be unified by considering episode termination to be the entering of a special absorbing state that transitions only to itself and which generates only rewards of zero. For example, consider the state transition diagram:
Here the solid square represents the special absorbing state corresponding to
the end of an episode. Starting from we get the reward sequence
. Summing these up, we get the same return whether we sum
over the first T rewards (here T=3) or we sum over the full infinite
sequence. This remains true even if we introduce discounting. Thus, we can
define the return, in general, as
including the possibility that either or
(but not
both
), and using the convention of
omitting episode numbers when they are not needed. We use these conventions
throughout the rest of the book to simplify notation and to express the close
parallels between episodic and continual tasks.