So far we have considered methods for estimating the value functions for a
policy given an infinite supply of episodes generated using that policy.
Suppose now that all we have are episodes generated from a different policy.
That is, suppose we wish to estimate or
, but all we have
are episodes following
, where
. Can we do it?
Happily, in many cases we can. Of course, in order to use episodes from
to estimate values for
, we require that every action taken under
is also taken, at least occasionally, under
. That is, we require
that
imply
. In the episodes generated using
, consider the
ith first visit to state s and the complete sequence of states and actions
following that visit. Let
and
denote the probabilities of that complete sequence happening given
policies
and
and starting from s. Let
denote the corresponding observed return from state s. To average
these to obtain an unbiased estimate of
, we need only weight each
return by its relative probability of occurring under
and
, that
is, by
. The desired Monte Carlo estimate after observing
returns from state s is then
This equation involves the probabilities and
,
which are normally considered unknown in applications of Monte Carlo methods.
Fortunately, here we need only their ratio,
, which
can be determined with no knowledge of the environment's dynamics. Let
be the time of termination of the i th episode involving state
s. Then
and
Thus the weight needed in (5.3), , depends
only on the two policies and not at all on the environment's dynamics.
Exercise .
What is the Monte Carlo estimate analogous to (5.3) for
action values, given returns generated using ?