Almost all reinforcement learning algorithms are based on estimating value functions---functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of ``how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Of course the rewards the agent can expect to receive in the future depends on what actions it will take. Accordingly, value functions are defined with respect to particular policies.
Recall that a policy, , is a mapping from states, , and actions, , to the probability of taking action a when in state s. Informally, the value of a state s under a policy , denoted , is the expected return when starting in s and following thereafter. For MDPs, we can define formally as:
where denotes the expected value given that the agent follows policy , and t is any time step. Note that the value of the terminal state, if any, is always zero. We call the function the state-value function for policy .
Similarly, we define the value of taking action a in state s under a policy , denoted , as the expected return starting from s, taking the action a, and thereafter following policy :
We call the action-value function for policy .
The value functions and can be estimated from experience. For example, if an agent follows policy and maintains an average for each state encountered of the actual returns that have followed that state, then the average will converge to the state's value, , as the number of times that state is encountered approaches infinity. If separate averages are kept for each action taken in a state, then these averages will similarly converge to the action values, . We call estimation methods of this kind Monte Carlo methods because they involve averaging over many random samples of actual returns. These kinds of methods are presented in Chapter 5. Of course if there are very many states, then it may not be practical to keep separate averages for each state individually. Instead, the agent would have to maintain and as parameterized functions and adjust the parameters to better match the observed returns. This can also produce accurate estimates, although much depends on the nature of the parameterized function approximation (Chapter 8 ).
A fundamental property of value functions used throughout reinforcement learning and dynamic programming is that they satisfy particular recursive relationships. For any policy and any state s, the following consistency condition holds between the value of s and the value of its possible successor states:
where it is implicit that the actions, a, are taken from the set , and the next states, , are taken from the set , or from in the case of an episodic problem. The last equation is the Bellman equation for . It expresses a relationship between the value of a state and the values of its successor states. Think of looking ahead from one state to its possible successor states, as suggested by Figure 3.4a. Each open circle represents a state and each solid circle represents a state-action pair. Starting from state s, the root node at the top, the agent could take any of some set of actions---three are shown in Figure 3.4a. From each of these, the environment could respond with one of several next states, , along with a reward, r. The Bellman equation (3.10) simply averages over all the possibilities, weighting each by its probability of occurring. It states that the value of the start state s must equal the (discounted) value of the expected next state, plus the reward expected along the way.
The value function is the unique solution to its Bellman equation. We show in subsequent chapters how this Bellman equation forms the basis of a number of ways to compute, approximate, and learn . We call diagrams like those shown in Figure 3.4 backup diagrams because they diagram relationships that form the basis of the update or backup operations that are at the heart of all reinforcement learning methods. These operations transfer value information back to a state (or a state-action pair) from its successor states (or state-action pairs). We use backup diagrams throughout the book to provide graphical summaries of the algorithms we discuss. (Note that unlike state-transition diagrams, the state nodes of backup diagrams do not necessarily represent distinct states; for example, a state might be its own successor. We also omit explicit arrowheads because time always flows downward in a backup diagram.)
Figure 3.4: Backup diagrams for a) and b) . The top node is the root
node. Below it are all possible actions and states that might occur.
Example .
Gridworld. Figure 3.5a uses a rectangular grid to illustrate value functions for a simple finite MDP. The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction in the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of -1. Other actions result in a reward of 0, except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 and take the agent to . From state B, all actions yield a reward of +5 and take the agent to .
Figure 3.5: Grid Example: a) exceptional reward dynamics; b) state-value
function for the equiprobable random policy.
Suppose the agent selects all four actions with equal probability in all states. Figure 3.5b shows the value function, , for this policy, for the discounted-reward case with . This value function was computed by solving the system of equations (3.10). Notice the negative values near the lower edge; these are the result of the high probability of hitting the edge of the grid there under the random policy. Notice that A is the best state to be in under this policy, but that its expected return is less than 10, its immediate reward, because from A the agent is taken to , from which it is likely to run into the edge of the grid. State B, on the other hand, is valued more than 5, its immediate reward, because from B the agent is taken to , which has a positive value. From the expected penalty (negative reward) for possibly running into an edge is more than compensated for by the expected gain for possibly stumbling onto A or B.
Figure 3.6: A golf example: the state-value function for putting (above) and the
optimal action-value function for using the driver (below).
Example . Golf. To formulate playing a hole of golf as a reinforcement learning task, we count a penalty (negative reward) of -1 for each stroke until we hit the ball into the hole. The state is the location of the ball. The value of a state is the negative of the number of strokes to the hole from that location. Our actions are how we aim and swing at the ball, of course, and which club we select. Let us take the former as given and consider just the choice of club, which we assume is either a putter or a driver. The upper part of Figure 3.6 shows a possible state-value function, , for the policy that always uses the putter. The terminal state in-the-hole has a value of 0. From anywhere on the green we assume we can make a putt; these states have value -1. Off the green we cannot reach the hole by putting and the value is greater. If we can reach the green from a state by putting, then that state must have value one less than the green's value, i.e., -2. For simplicity, let us assume we can putt very precisely and deterministically, but with a limited range. This gives us the sharp contour line labeled -2 in the figure; all locations between that line and the green require exactly two strokes to complete the hole. Similarly, any location within putting range of the -2 contour line must have a value of -3, and so on to get all the contour lines shown in the figure. Putting doesn't get us out of sand traps, so they have a value of . Overall, it takes us 6 strokes to get from the tee to the hole by putting.
Exercise .
What is the Bellman equation for action values, that is, for ? It must give the action value in terms of the action values, , of possible successors to the state-action pair . As a hint, the backup diagram corresponding to this equation is given in Figure 3.4b. Show the sequence of equations analogous to (3.10), but for action values.
Exercise .
The Bellman equation (3.10) must hold for each state for the value function shown in Figure 3.5b. As an example, show numerically that this equation holds for the center state, valued at , with respect to its four neighboring states, valued at , , , and . (These numbers are accurate only to one decimal place.)
Exercise .
In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero all the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove using (3.2) that adding a constant C to all the rewards simply adds a constant, K, to the values of all states, and thus does not affect the relative value of any policies. What is K in terms of C and ?
Exercise .
Now consider adding a constant C to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the infinite-horizon case above? Why or why not? Give an example.
Exercise .
The value of a state depends on the the value of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:
Give the equation corresponding to this intuition and diagram for the value at the root node, , in terms of the value at the expected leaf node, , given . This will expectation depend on the policy, . Then give a second equation in which the expected value is written out explicitly in terms of such that no expected value notation appears in the equation.
Exercise .
The value of an action, , can be divided into two parts, the expected next reward, which does not depend on the policy , and the expected sum of the remaining rewards, which depends on the next state and the policy. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state-action pair) and branching based on the next state:
Give the equation corresponding to this intuition and diagram for the action value, , in terms of the expected next reward, , and the expected next state value, , given that and . Then give a second equation, writing out the expected value explicitly in terms of and , defined respectively by (3.6) and (3.7), such that no expected value notation appears in the equation.