So far we have considered methods for estimating the value functions for a
policy given an infinite supply of episodes generated using that policy.
Suppose now that all we have are episodes generated from a different
policy. That is, suppose we wish to estimate or
, but all we have
are episodes following
, where
. Can we learn the value function
for a policy given only experience "off" the policy?
Happily, in many cases we can. Of course, in order to use episodes from
to estimate values for
, we require that every action taken under
is also taken, at least occasionally, under
. That is, we require
that
implies
. In the episodes generated using
, consider the
th first visit to state
and the complete sequence of states and actions
following that visit. Let
and
denote the probabilities of that complete sequence happening given
policies
and
and starting from
. Let
denote the corresponding observed return from state
. To average
these to obtain an unbiased estimate of
, we need only weight each
return by its relative probability of occurring under
and
, that
is, by
. The desired Monte Carlo estimate after observing
returns from state
is then
This equation involves the probabilities and
,
which are normally considered unknown in applications of Monte Carlo methods.
Fortunately, here we need only their ratio,
, which can be determined with no knowledge of the environment's dynamics. Let
be the time of termination of the
th episode involving state
. Then
![]() |
![]() |
Exercise 5.3 What is the Monte Carlo estimate analogous to (5.3) for action values, given returns generated using