So far we have considered methods for estimating the value functions for a policy given an infinite supply of episodes generated using that policy. Suppose now that all we have are episodes generated from a different policy. That is, suppose we wish to estimate or , but all we have are episodes following , where . Can we learn the value function for a policy given only experience "off" the policy?
Happily, in many cases we can. Of course, in order to use episodes from to estimate values for , we require that every action taken under is also taken, at least occasionally, under . That is, we require that implies . In the episodes generated using , consider the th first visit to state and the complete sequence of states and actions following that visit. Let and denote the probabilities of that complete sequence happening given policies and and starting from . Let denote the corresponding observed return from state . To average these to obtain an unbiased estimate of , we need only weight each return by its relative probability of occurring under and , that is, by . The desired Monte Carlo estimate after observing returns from state is then
This equation involves the probabilities and ,
which are normally considered unknown in applications of Monte Carlo methods.
Fortunately, here we need only their ratio, , which can be determined with no knowledge of the environment's dynamics. Let
be the time of termination of the th episode involving state
. Then
Exercise 5.3 What is the Monte Carlo estimate analogous to (5.3) for action values, given returns generated using ?