Almost all reinforcement learning algorithms are based on estimating value functions---functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of ``how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Of course the rewards the agent can expect to receive in the future depends on what actions it will take. Accordingly, value functions are defined with respect to particular policies.
Recall that a policy, , is a mapping from states,
, and actions,
, to the probability
of taking action a when
in state s. Informally, the value of a state s under a policy
,
denoted
, is the expected return when starting in s and following
thereafter. For MDPs, we can define
formally as:
where denotes the expected value given that the
agent follows policy
, and t is any time step. Note that the value of
the terminal state, if any, is always zero. We call the function
the state-value function for policy
.
Similarly, we define the value of taking action a in
state s under a policy , denoted
, as the expected
return starting from s, taking the action a, and thereafter following
policy
:
We call the action-value function for policy
.
The value functions and
can be estimated from
experience. For example, if an agent follows policy
and maintains an average for each state encountered of the actual returns
that have followed that state, then the average will converge to
the state's value,
, as the number of times that state is
encountered approaches infinity. If separate averages are kept for each action taken
in a state, then these averages will similarly converge to the action values,
. We call estimation methods of this kind Monte Carlo
methods because they involve averaging over many random samples
of actual returns. These kinds of methods are presented in Chapter 5. Of course if
there are very many states, then it may not be practical to keep separate averages
for each state individually. Instead, the agent would have to maintain
and
as parameterized functions and adjust the parameters to better
match the observed returns. This can also produce accurate estimates, although
much depends on the nature of the parameterized function approximation
(Chapter 8
).
A fundamental property of value functions used throughout reinforcement learning
and dynamic programming is that they satisfy
particular recursive relationships. For any
policy and any state
s, the following consistency condition holds between the value of s and the value
of its possible successor states:
where it is implicit that the actions, a, are taken from the set
, and the next states,
, are taken from the set
, or from
in the case of an episodic problem. The last equation is the
Bellman equation for
. It expresses a relationship between the value of
a state and the values of its successor states. Think of looking ahead from
one state to its possible successor states, as suggested by
Figure 3.4a. Each open circle represents a state and
each solid circle represents a state-action pair. Starting from state
s, the root node at the top, the agent could take any of some set of
actions---three are shown in Figure 3.4a. From each
of these, the environment could respond with one of several next states,
, along with a reward, r. The Bellman equation (3.10)
simply averages over all the possibilities, weighting each by its probability
of occurring. It states that the value of the start state s must equal the
(discounted) value of the expected next state, plus the reward expected along
the way.
The value function is the unique solution to its Bellman
equation. We show in subsequent chapters
how this Bellman equation forms the basis of a number of ways to compute,
approximate, and learn
. We call diagrams like those shown in
Figure 3.4 backup diagrams because they diagram
relationships that form the basis of the update or backup operations that are
at the heart of all reinforcement learning methods. These operations transfer value
information back to a state (or a state-action pair) from its successor states
(or state-action pairs). We use backup diagrams throughout the book to provide
graphical summaries of the algorithms we discuss. (Note that unlike state-transition
diagrams, the state nodes of backup diagrams do not necessarily represent distinct
states; for example, a state might be its own successor. We also omit explicit
arrowheads because time always flows downward in a backup diagram.)
Figure 3.4: Backup diagrams for a) and b)
. The top node is the root
node. Below it are all possible actions and states that might occur.
Example .
Gridworld.
Figure 3.5a uses a rectangular grid to illustrate value functions for a
simple finite MDP. The cells of the grid correspond to the
states of the environment. At each cell, four actions are
possible: north, south, east, and west, which
deterministically cause the agent to move one cell in the respective direction
in the grid. Actions that would take the agent off the grid leave its
location unchanged, but also result in a reward of -1. Other actions result
in a reward of 0, except those that move the agent out of the special states A
and B. From state A, all four actions yield a reward of +10 and take the
agent to . From state B, all actions yield a reward of +5 and take
the agent to
.
Figure 3.5: Grid Example: a) exceptional reward dynamics; b) state-value
function for the equiprobable random policy.
Suppose the agent selects all four actions with equal probability in all
states. Figure 3.5b shows the value function,
, for this policy, for the discounted-reward case with
. This
value function was computed by solving the system of equations
(3.10). Notice the negative values near the lower
edge; these are the result of the high probability of hitting the edge
of the grid there under the random policy. Notice that A
is the best state to be in under this policy, but that its expected return is less
than 10, its immediate reward, because from A the agent is taken to
, from which it is likely to run into the edge of the grid. State B, on
the other hand, is valued more than 5, its immediate reward, because from B the agent
is taken to
, which has a positive value. From
the expected penalty
(negative reward) for possibly running into an edge is more than compensated for by
the expected gain for possibly stumbling onto A or B.
Figure 3.6: A golf example: the state-value function for putting (above) and the
optimal action-value function for using the driver (below).
Example .
Golf.
To formulate playing a hole of golf as a reinforcement learning task, we
count a penalty (negative reward) of -1 for each stroke until we hit the ball
into the hole. The state is the location of the ball. The value of a state
is the negative of the number of strokes to the hole from that location. Our
actions are how we aim and swing at the ball, of course, and which club we
select. Let us take the former as given and consider just the choice of club,
which we assume is either a putter or a driver. The upper part of
Figure 3.6 shows a possible state-value function,
,
for the policy that always uses the putter. The terminal state
in-the-hole has a value of 0. From anywhere on the green we assume we
can make a putt; these states have value -1. Off the green we cannot reach
the hole by putting and the value is greater. If we can reach the green from
a state by putting, then that state must have value one less than the green's
value, i.e., -2. For simplicity, let us assume we can putt very precisely
and deterministically, but with a limited range. This gives us the sharp
contour line labeled -2 in the figure; all locations between that line and
the green require exactly two strokes to complete the hole. Similarly,
any location within putting range of the -2 contour line must have a value
of -3, and so on to get all the contour lines shown in the figure. Putting
doesn't get us out of sand traps, so they have a value of
. Overall,
it takes us 6 strokes to get from the tee to the hole by putting.
Exercise .
What is the Bellman equation for action values, that is, for ? It must
give the action value
in terms of the action values,
, of possible successors to the state-action pair
. As
a hint, the backup diagram corresponding to this equation is given in
Figure 3.4b. Show the sequence of equations
analogous to (3.10), but for action values.
Exercise .
The Bellman equation (3.10) must hold for each state for the value
function shown in Figure 3.5b. As an example, show numerically
that this equation holds for the center state, valued at
, with respect to its
four neighboring states, valued at
,
,
, and
. (These
numbers are accurate only to one decimal place.)
Exercise .
In the gridworld example, rewards are positive for goals, negative
for running into the edge of the world, and zero all the rest of the time. Are
the signs of these rewards important, or only the intervals between them? Prove
using (3.2) that adding a constant C to all the rewards
simply adds a constant, K, to the values of all states, and thus does not
affect the relative value of any policies. What is
K in terms of C and ?
Exercise .
Now consider adding a constant C to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the infinite-horizon case above? Why or why not? Give an example.
Exercise .
The value of a state depends on the the value of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:
Give the equation corresponding to this intuition and diagram for the value at the root
node, , in terms of the value at the expected leaf node,
, given
. This will expectation depend on the
policy,
. Then give a second equation in which the expected value is written out
explicitly in terms of
such that no expected value notation appears in the equation.
Exercise .
The value of an action, , can be divided into two parts, the expected next
reward, which does not depend on the policy
, and the expected sum of the
remaining rewards, which depends on the next state and the policy. Again we can think
of this in terms of a small backup diagram, this one rooted at an action (state-action
pair) and branching based on the next state:
Give the equation corresponding to this intuition and diagram for the action value,
, in terms of the expected next reward,
, and the expected next
state value,
, given that
and
. Then give a second equation, writing out the expected value
explicitly in terms of
and
, defined respectively by
(3.6) and (3.7), such that no expected value notation appears in the
equation.