Errata and Notes for:
Errata:
- p. xviii, Ben Van Roy should be acknowledged only once in the
list. (Ben Van Roy)
- p. 29 (Figure 2.1). In the upper graph, the third line is
unlabeled, but
should be labeled "epsilon=0 (greedy)".
- p. 29 (Figure 2.1). If you try to reproduce these graphs, you
will find that in order to get the detailed structure at the very
beginning to match, you will have to use a policy that is
epsilon-greedy (as described in the text) and that breaks ties
randomly. The text does not say how ties (multiple actions with the
maximal action value) were handled, but in fact they were chosen among
randomly, which significantly affects the very beginning of the run.
Random tiebreak is generally a good idea, though of course it only has
an effect at the beginning of time, which happens rarely ;-). You can
get a similar effect, with less programming, just by initializing the
action values not to zero, but to very small random numbers. (Garett
Hunter)
- p. 61. The upper limit in Equation 3.3 should be T-t-1.
- p. 78. In the 2nd max equation for V*(h), at the end of the
first line, "V*(h)" should be "V*(l)". (Christian Schulz)
- p. 98, Figure 4.3. For some MDPs the given policy iteration
algorithm never terminates. The problem is that there may be small
changes in the values computed in step 2 that cause the policy to
forever be changing in step 3. The solution is to terminate step 3 not
when the policy is stable, but as soon as the largest change in state
value due to a policy change is less than some epsilon. (Stuart
Reynolds)
- p. 105, Figure 4.7. The caption should end "with each other" (the
each is missing). (Nikos Vlassis)
- p. 121, Figure 5.5. The two 3D graphs, above and below, should
be switched. (Terence Schauenberg)
- p. 127, Figure 5.7. The first two lines of step (c) refer to
pairs s,a and times t at or later than time \tau. In fact, it should
only treat them for times later than \tau, not equal. (Thorsten
Buchheim) No, wait, on further consideration, it is right as
printed. The action at \tau may be non-greedy, but everything
after it is, so we can use it to learn its value. In fact, this
is the only way the value of non-greedy actions are learned.
(Ashique Mahmood)
- p. 128, Exercise 5.4, second line from the end in the exercise,
"on-policy" should be "off-policy".
- p. 129, Exercise 5.5, the word "stationary" should be "sample".
(Kenichi Kato)
- p. 146, the windy gridworld example apparently used alpha=0.5
rather than alpha=0.1 as stated. (Yuxi Li)
- p. 147, the second line should read "17 steps, two more than the
minimum of 15" instead of "two less" as stated. (Barry D. Nichols)
- p. 151, second line of the equation, pi(s_t,a_t) should be
pi(s_{t+1},a_t), and the conditioning on s_t in the first line should
similarly be conditioning on s_{t+1}. (Dan Bernstein) This algorithm,
called Expected Sarsa, has since been extensively studied by Harm van
Seijen et al. in a 2009
paper published the Proceedings of the IEEE Symposium on Adaptive
Dynamic Programming and Reinforcement Learning.
- p. 155, the parameter alpha was 0.01, not 0.1 as stated. (Abinav
Garg)
- p. 157, first sentence of section 6.9. "shown" should be
"showed".
- p. 174, 181, 184, 200, 212, 213: in the boxed algorithms on
all these pages, the setting of the eligibility traces to zero should
appear not in the first line, but as a new first line inside the first
loop (just after the "Repeat..."). (Jim Reggia)
- p. 197, bottom formula last theta_t(2) should be theta_t(n). (Dan
Bernstein)
- p. 201, the two MSEs should be RMSEs, that is, they should be
square roots of the MSEs. (Tao Wang and Dale Schuurmans)
- p. 212-213. There are a number of small problems with these
algorithms. Instead of describing minimal fixes, I would prefer
to rewrite them as here for Sarsa(lambda)
with FA and here for Watkins's Q(lambda) with
FA.
- p. 215, Figure 8.11, the y-axis label. "first 20 trials" should
be "first 20 episodes".
- p. 215. The data shown in Figure 8.11 was apparently not
generated exactly as described in the text, as its details (but not its
overall shape) have defied replication. In particular, several
researchers have reported best "steps per episode" in the 200-300
range.
- p. 215. Two-thirds down the page. "\alpha = 0.05 (0.1/m)" should
be "\alpha = 0.05 (0.5/m)". m here is 10, the number of tilings.
- p. 218, Figure 8.13. The funny symbol between the two thetas
at the top should be just a dash, as in theta(1) through theta(5),
which all behave as shown in this line.
- p. 220, the last term of the last equation should have a gamma
before it (before the expectation).
- p. 233, last line of caption: "ne-step" should be "one-step".
(Michael Naish)
- p. 245, the analysis whose results are shown in Figure 9.13 is
slightly in error. The text on the next page says that the error is
reduced according to 1/sqrt{t} times (b-1)/b, but it should be
1/sqrt{t} times sqrt{(b-1)/b}. The corrected version of Figure 9.13
(which is slightly changed) is available here.
- p. 259, the reference to McCallum, 1992 should be to
Chrisman, 1992. And in the references section, on p. 302, the
(incorrect) listing for McCallum, 1992 should not be there. (Paul
Crook)
- p. 267, Table 11.1. The number of hidden units for TD-Gammon 3.0
is given as 80 but should be 160. (Michael Naish)
- p.
282, middle paragraph, it says Fig. 11.10 shows mean arrival rates of
150, 200, and 300, but the last number should be 350 rather than 300.
(Susan Overstreet)
- p. 309, the reference for Tstisiklis and Van Roy (1997b) should
be to Technical Report LIDS-P-2390, Massachusetts Institute of
Technology. (Ben Van Roy)
- p. 322, in the index entry for TD error, the range listed as
"174-165" should be "174-175". (Jette Randlov)
Other Notes:
- p. 28. The description of the 10-armed testbed could be clearer.
Basically there are 2000 randomly generated 10-armed bandits. The Q*(a)
of each of these were selected from a normal distribution with mean 0
and variance 1. Then, on each play with each bandit, the reward was
determined by adding to Q*(a) another normally distributed random
number with mean 0 and variance 1.
- The Gambler's Problem, pages 101-103. The optimal policy
shown in Figure 4.6 is not unique. There are many other optimal
policies, all of which share the optimal function also shown in the
figure. Which optimal policy you get from your value iteration
algorithm depends on how it breaks ties within the numerical resolution
of your computer. An email discussion of this issue is here. (Daniel Loeb, Daniel Polani, Francesco
Decomite, and many students)
- p. 127, Figure 5.7. This algorithm is only valid if all policies
are proper,
meaning that they produce episodes that always eventually terminate
(this assumption is made on the first page of the chapter). This
restriction on environments can be lifted if the algorithm is modified
to use epsilon-soft policies, which are proper for all environments.
Such a modification is a good exercise for the reader! Alternative
ideas for off-policy Monte Carlo learning are discussed in this recent research
paper.
- Chapter 5. John Tsitsiklis has obtained some new results
which come very close to solving "one of the most important open
theoretical questions in reinforcement learning" -- the convergence of
Monte Carlo ES. See here.
- Exercise 7.3. This is a difficult exercise. Igor
Karpov made a thorough answer available, but at present only part of it
is still available, from the wayback machine, at
https://web.archive.org/web/20160326154910/http://www.cs.utexas.edu/~ikarpov/Classes/RL/RandomWalk/.
- Chapter 8. The overall conclusions of this chapter have to be
re-evaluated in light of the recent development of true
gradient-descent TD methods. The current view is much more
straightforward and positive. In particular, we now have stable and
efficient algorithms for off-policy
TD learning with linear function approximation, for nonlinear
function approximation, for off-policy control (in preparation),
and for general
option-conditional predictions. It seems likely that these new
ideas will result in a whole new generation of reinforcement learning
algorithms for which function approximation is much more
straightforward.
- p. 212-213. In these two algorithms, it is implicit that the set
of features for the terminal state (and all actions) is the
empty set. It would be better if this case was treated explicitly in
the pseudo-code, and this is now done for the rewritten versions of
these figures mentioned in the errata.
- The last equation on page 214 can be a little confusing. The
minus sign here is meant to be grouped with the 0.0025 (as the spacing
suggests). Thus the consecutive plus and minus signs have the same
effect as a single minus sign. (Chris Hobbs)
- Dyna-Q+ involved two other changes not mentioned in the book.
First, in planning steps actions were considered from states even if
those actions had never previously been taken. Second, the initial
model for these actions was that they produced a reward of zero and
returned the agent to the same state. The number of steps since last
taking them was of course the total number of steps, as if they had all
been taken on step 0. (Kavosh Asadi)
- Discounting and function approximation. Since writing the book,
it has become apparent to me that discounting as it is widely used is
inconsistent with function approximation. In particular, the
formulation of the problem given in Section 3.8 breaks down whenever
there is a restriction on the policy space (such as occurs to greedy
policies when the value function is approximated). In these cases, one
cannot necessarily chose a policy that optimizes the value of all
states simultaneously, as the best policy for one state may not be the
best from another. The only way out is to specify which states one
cares about. For example, one may specify a start state (or start state
distribution) as the state one cares about optimizing. Alternatively,
one might hope to use the distribution of states actually experienced
under the policy. Interestingly, however, in this case one can show that the discount rate
has
no effect on the policy ordering, and that the effective result is the
same as the undiscounted (or "average reward") case outlined in Section
6.7. The lesson, I think, is that we should switch from discounted
reward to average reward as our base case in thinking about
reinforcement learning and artificial intelligence.