So far we have been imprecise regarding the objective of learning. We
have said that the agent's goal is to maximize the reward it receives in the long
run. How might this be formally defined? If the sequence of rewards received after
time step
is denoted
, then what precise aspect of this
sequence do we wish to maximize? In general, we seek to maximize the expected
return, where the return,
, is defined as some specific function of the reward
sequence. In the simplest case the return is the sum of the rewards:
On the other hand, in many cases the agent-environment interaction
does not break naturally into identifiable episodes, but goes on continually
without limit. For example, this would be the natural way to formulate a
continual process-control task, or an application to a robot with a long
life span. We call these continuing tasks.
The return formulation (3.1) is problematic for continuing tasks
because the final time step would be , and the return, which is what we are
trying to maximize, could itself easily be infinite. (For example, suppose the agent
receives a reward of
at each time step.) Thus, in this book we usually
use a definition of return that is slightly more complex conceptually but much
simpler mathematically.
The additional concept that we need is that of discounting.
According to this approach, the agent tries to select actions so that the
sum of the discounted rewards it receives over the future is maximized. In
particular, it chooses to maximize the expected discounted
return:
The discount rate determines the present value of future rewards:
a reward received time steps in the future is worth only
times
what it would be worth if it were received immediately. If
, the infinite
sum has a finite value as long as the reward sequence
is bounded. If
, the agent is "myopic" in being concerned only with maximizing
immediate rewards: its objective in this case is to learn how to choose
so
as to maximize only
. If each of the agent's actions happened to
influence only the immediate reward, not future rewards as well, then a myopic
agent could maximize (3.2) by separately maximizing each immediate
reward. But in general, acting to maximize immediate reward can reduce access to
future rewards so that the return may actually be reduced. As
approaches
1, the objective takes future rewards into account more strongly: the agent
becomes more farsighted.
Example 3.4: Pole-Balancing Figure 3.2 shows a task that served as an early illustration of reinforcement learning. The objective here is to apply forces to a cart moving along a track so as to keep a pole hinged to the cart from falling over. A failure is said to occur if the pole falls past a given angle from vertical or if the cart runs off the track. The pole is reset to vertical after each failure. This task could be treated as episodic, where the natural episodes are the repeated attempts to balance the pole. The reward in this case could be
Exercise 3.4 Suppose you treated pole-balancing as an episodic task but also used discounting, with all rewards zero except for
Exercise 3.5 Imagine that you are designing a robot to run a maze. You decide to give it a reward of