A reinforcement learning task that satisfies the Markov
property is called a *Markov decision process*, or *MDP*. If the
state and action spaces are finite, then it is called a *finite Markov
decision process (finite MDP)*. Finite MDPs are particularly important to the
theory of reinforcement learning. We treat them extensively throughout this
book; they are all you need to understand 90% of modern reinforcement
learning.

A particular finite MDP is defined by its state and action sets and by the one-step
dynamics of the environment. Given any state and action, and
, the probability of each possible next state, , is

These quantities are called

These quantities, and , completely specify the most important aspects of the dynamics of a finite MDP (only information about the distribution of rewards around the expected value is lost). Most of the theory we present in the rest of this book implicitly assumes the environment is a finite MDP.

The agent makes its decisions solely as a function of the energy level of the
battery. It can distinguish two levels, `high` and `low`, so that the state set is . Let us
call the possible decisions--the agent's actions--`wait`, `search`, and
`recharge`. When the energy level is `high`, recharging would always be
foolish, so we do not include it in the action set for this state. The agent's
action sets are

If the energy level is `high`, then a period of active search
can always be completed without risk of depleting the battery. A period of searching
that begins with a `high` energy level leaves the energy level `high`
with probability and reduces it to `low` with probability .
On the other hand, a period of searching undertaken when the energy level is `low` leaves it `low` with probability and depletes the battery with
probability
. In the latter case, the robot must be rescued, and the battery is then
recharged back to `high`. Each can collected by the robot counts as a unit reward,
whereas a reward of results whenever the robot has to be rescued. Let
and
, with , respectively denote the expected
number of cans the robot will collect (and hence the expected reward) while
searching and while waiting. Finally, to keep things simple, suppose that no cans
can be collected during a run home for recharging, and that no cans can be collected
on a step in which the battery is depleted. This system is then a finite MDP, and
we can write down the transition probabilities and the expected rewards, as in
Table 3.1.

A *transition graph* is a useful way to summarize the dynamics of a
finite MDP.
Figure
3.3 shows the transition graph for the recycling robot
example. There are two kinds of nodes: *state nodes* and *action nodes*. There is a state node for each possible state (a large
open circle labeled by the name of the state), and an action node for each
state-action pair (a small solid circle labeled
by the name of the action and connected by a line to the state node). Starting in
state
and taking action moves you along the line from state node to
action node . Then the environment responds with a transition to the
next state's node via one of the arrows leaving action node . Each arrow
corresponds to a triple , where is the next state, and we label
the arrow with the transition probability, , and the expected
reward for that transition,
. Note that the transition probabilities labeling the
arrows leaving an action node always sum to 1.