Errata and Notes for:
Errata for the second printing (2020) of the
second edition:
Last updated April 25, 2022
- p1, second paragraph: "methods.of" --> "methods. That is,
we adopt the perspective of" (Samuel Pettersson)
- p29, second paragraph: "about 1.55" would be better as "about
1.54" (Tetsuro Tanaka)
- p35, Equation 2.9 should say n>0, not n>=0 (John
Senneker)
- p58, Exercise 3.11: "S_t" --> "S_t=s" (Marlos
Machado)
- p68, paragraph 3: this would be clearer with the word
"tabular" inserted before "continuing tasks" (Gabriel
Paludo Licks)
- p80: "finite number of policies" --> "finite number of
deterministic policies" (Samuel Pettersson)
- p80, in the box, at the end of the initialization line, add:
"; V(terminal)=0" (Samuel Pettersson)
- p102: The first equation in the middle group is not correct.
It and the sentence leading up to it should read: We know that
$\tilde v_*$ is the unique solution to the Bellman optimality
equation \eqrf{bellman-v*} with altered transition
probabilities: \tilde v_*=\max_a\sum_{s',r} \Bigl[
(1-\e)\p(s',r|s,a) + \sum_{a'}\frac{\varepsilon}{|\A(s)|}
\p(s',r|s,a')\Bigr]\Bigl[r+\g\tV(s')\Bigr] (Samuel
Pettersson)
- p129: The convergence of Sarsa also requires the usual
conditions on the step sizes (2.7). (Samuel Pettersson)
- p164, 2nd line: "contents of the" --> "contents of the
model" (Samuel Pettersson)
- p211, bottom, or possibly top of the next page: "the length
the interval" --> "the length of the interval" (Prabhat
Nagarajan)
- p253, paragraph 2, first line: "returns" --> "rewards"
- p259, Equation 11.6: The last importance sampling ratio should
be \rho_{t+n} instead of \rho_{t+n-1}. (Nikolay Gudkov)
- p259, Equation 11.6: Some versions of the pdf erroneously
start the equation with both occurrences of w indexed at t+n.
The second occurrence should be indexed by t+n-1 (Prabhat
Nagarajan)
- p259, the last line of the second paragraph: G_{t:n} should be
G_{t:t+n}. (Nikolay Gudkov)
- p259, Equation 11.8: w_t-1 --> w_t+n-1 (A
Reader)
- p269, in the first full sentence after (11.20): "for v" -->
"for v_{\bf w}" (Nikolay Gudkov)
- p269, 4th para, 2nd line: \Pi\bar\delta_{v_{\bf w}} should be
\Pi\bar\delta_{\bf w} (Nikolay Gudkov)
- p269, last sentence before (11.22): v should be v_{\bf
w} (Nikolay Gudkov)
- p275, line 5: "even given even" --> "even given" (Nikolay
Gudkov)
- p278, last equation: {\bf x}_s{\bf x}_s --> {\bf x}(s){\bf
x}{s) (Nikolay Gudkov)
- p282, first line after the equations: M_{t-1}=0 -->
M_{-1}=0 (Nikolay Gudkov)
- p289, before (12.2): [0,1] --> [0,1) (Jiuqi
Wang)
- p289, in the caption for Figure 12.1: digram -->
diagram (Prabhat Nagarajan)
- p308: "to the extent that we not bootstrapping" --> to the
extent that we are not bootstrapping" (Prabhat Nagarajan)
- p309, before (12.23): "the truncated version of this
return" --> "this final \lambda-return" (Nikolay Gudkov)
- p335, first line of Section 13.7: "actions spaces" -->
"action spaces" (Prabhat Nagarajan)
- p331, second line of Section 3.5: "the only the" --> "only
the" (Ansel Blume)
- p336, Exercise 13.4: "gaussian" --> "Gaussian"
(Samuel Pettersson)
- p459, Equation 17.1: This equation is a definition and should
have a dot over the equal sign
- p463, 2/3rds down the page: "the the option's policy" -->
"the option's policy" (Prabhat Nagarajan)
- p469, "plateau problem" --> "mesa phenomena" (Prabhat
Nagarajan)
- p479, last line: Minh --> Mnih (Prabhat Nagarajan)
- p495: The reference for John (1994) is missing. It should be
John, G. H. (1994). When the best move isn't optimal: Q-learning
with exploration. In Proceedings of the Association for the
Advancement of Artificial Intelligence, p. 1464. (Michael
Przystupa)
- p496, in reference for Keiflin and Janak: "Ffrom" -->
"From" (Steven Bagley)
Errata for the first printing (2018) of the
second edition:
- The title of the final section, Section 17.6, was mistakenly
printed as a repeat of an earlier sections title. It should be
"The Future of Artificial Intelligence." This is also wrong in
the table of contents.
- The phrase "function approximation" was mistakenly
abbrieviated to "function approx." many times in the printed
book.
- p11, 5 lines from bottom: "(see (Section 16.1))" -->
"(Section 16.1)”
- p19, 8 lines from bottom: "(Section 16.2)" --> "(Section
15.9)"
- p30, Exercise 2.2: The values specified for R_1 and R_3 should
have minus signs in front of them
- p35, in (2.9): n>=0 --> n>0 (Prabhat
Nagarajan)
- p64, after the figure: v_pi --> v_*
- p81: At the top of this page there should be a bold heading:
"Example 4.2: Jack's Car Rental". Related to this, the example
on page 84 should be Example 4.3. Also on page 81, there should
be a paragraph break after "the policy that never moves any
cars." (Sam Ritchie)
- p98, start of last paragraph: For Monte Carlo policy
evaluation --> For Monte Carlo policy iteration
- p107, the 3rd line is cut off; it should read: "using the
behavior policy that selects right
and left with equal
probability."
- p117, in 5.5: probabalistic --> probabilistic
- p143, in Exercise 7.1: "a sum TD errors" --> "a sum of TD
errors" (Prabhat
Nagarajan)
- p149, in the pseudocode algorithm, in the upper limit for the
\rho product: "\tau+n-1" --> "\tau+n", and in the comment:
"t+n-1" --> "\tau+n". (Jyothis Vasudevan, Zixuan Jiang, Yifan
Wang, and Zhiqi Pan)
- p153, end of first paragraph: occuring --> occurring
- p155: Equations (7.17) and (7.18) are mistakenly the same. The
first equation should actually be a three-line derivation as
given in the online book, here:
http://www.incompleteideas.net/book/RLbook2018.pdf#page=177. The
second equation should be numbered (7.17). (Xiang Gu)
- p156, line 2: (7.11) -->
(7.5) (Xiang Gu)
- p156, in the algorithm, the line "G <- 0:" should be
replaced with the two lines: "If t+1 < T:" and then, indented
from that: "G <- Q(S_{t+1},A_{t+1})" (Mark
Rowland and Will Dabney)
- p178, line 10: polices --> policies (Kaniska
Mohanty)
- p180, line 24: "sill" --> "still" (James R)
- p198 2/3 down page: "s \mapsto g" --> "s \mapsto u"
- p202, after (9.7): "(S_t)" --> "(s)" (Dhawal
Gupta)
- p204, bottom: "approximate state-value function" -->
"approximate the state-value function"
- p206, end of (9.11): "\Re^d \times \Re^d" --> "\Re^{d\times
d}" (Dhawal Gupta)
- p208, beginning of second paragraph: "the these" -->
"these". (Miguel Drummond)
- p212, line 1: "length the interval" --> "length of the
interval" (Prabhat Nagarajan)
- p212, second to last line: The i index should start at 1, not
0. (Chris Harding)
- p220, middle of page: horizonal --> horizontal
- p229, In (9.22) and the equation above it labeled (from
(9.20)), all the x's should have their time index reduced by 1:
x_t --> x_{t-1} and x_{t+1} --> x_t (Frederic Godin)
- p229, bottom of page: forgeting --> forgetting
- p241: "approximated by linear combination" -->
"approximated by a linear combination" (Prabhat Nagarajan)
- p244, line 14: w_t --> w_{t-1} (Frederic Godin)
- p246, footnote 1: ",A)" in the call to tiles should be ",[A])"
as this last argument to tiles must be a list (Martha
Steenstrup)
- p248, Exercise 10.1, line 2: "or in" --> "in"
- p256 in 10.3: "Tsitiklis" --> "Tsitsiklis" (Prabhat
Nagarajan)
- p259, above (11.6): "Expected Sarsa" -->
"Sarsa" (Xiang Gu)
- p267, within Figure 11.3: "\overline{TDE}=0" --> \min
\overline{TDE}
- p279, end of paragraph below (11.27): "naive
residual-gradient" --> "residual gradient" (Abhishek Naik)
- p286 in 11.7: "Mahadeval" --> "Mahadevan" (Prabhat
Nagarajan)
- p289, in the caption for Figure 12.1: digram -->
diagram (Prabhat
Nagarajan)
- p302, middle: auxilary --> auxiliary
- p304, just before Example 12.1: a second closing parenthes is
needed after “inactive (=0)” (Abhishek Naik)
- p308: "to the
extent that we not bootstrapping" --> "to the extent that we are not
bootstrapping" (Prabhat Nagarajan)
- p310, shortly after the three equations on top: "sum left"
--> "sum on" (Abhishek Naik)
- p321, after equation (13.1): d should be d'
- p327, the first equation should not be an equality, but a
proportional-to. And in the second line following that system of
equations, the word "equal" should be changed to "proportional"
(Mirco Musolesi)
- p327, 5 lines from bottom: "boxed" --> "boxed
algorithm" (Douglas De Rizzo Meneghetti)
- p329, bottom: \w\in\Re^m --> \w\in\Re^d
- p335, first line of Section 13.7: "actions spaces" -->
"action spaces" (Prabhat Nagarajan)
- p337, line 25: critizes --> criticizes (Prabhat Nagarajan)
- p337, bottom: Schall --> Schaal
- p350, third line: "Rescoral-Wgner" --> "Rescorla-Wagner"
(Kyle Simpson)
- p350, (14.6): the z indices should be t and t-1 (Abhishek
Naik)
- p354, 11 lines from the bottom: "Rescoral-Wagner" -->
"Rescorla-Wagner" (Kyle Simpson)
- p371, in the note on section 14.2.2, 3rd line from bottom:
Schmajuk—> Schmajuk’s
- p372, in the note on section 14.3, line 2: Thorndikes —>
Thorndike’s
- p398, near the bottom: "receiving action R_t+1" -->
"receiving reward R_t+1" (Caleb Bowyer)
- p399, the two eligibility-trace update equations (z) should
have a gamma just before their lambdas (Abhishek Naik)
- p399, in the line after the equations, \lambda^w c -->
\lambda^{\bf w} and \lambda^w a --> \lambda^{\theta}
(Abhishek Naik)
- p400: The left side of (15.3) is missing a logarithm (ln)
between the grad symbol (Nabla) and the policy symbol (pi)
(Jiahao Fan)
- p400, in (15.3) and again 7 lines down: "A_t-\pi(A_t|S_t"
--> "A_t-\pi(1|S_t"
- p401, third paragraph, three times: "A_t-\pi(A_t|S_t" -->
"A_t-\pi(1|S_t"
- p415, biblio section 15.8: "\pi(A_t|S_t" -->
"\pi(1|S_t"
- p436, eight lines from the bottom: "Tesauro and colleages"
--> "Tesauro and colleagues" (Raymund Chua)
- p447, 14 lines from the bottom: "Figure 16.7, were $\theta$
is" -> "Figure 16.7, where $\theta$ is" (Kyle Simpson)
- p451, middle left: user-targed --> user-targeted
- p460, in paragraph 3, then again in paragraph 4: auxilary
--> auxiliary
- p461, bottom line: \g_\omega(S_{t+1}) -->
1-\g_\omega(S_{t+1})
- p462, 2nd line: \g_\omega(S_{t+2}) -->
1-\g_\omega(S_{t+2})
- p463, 2/3rds down the page: "the the option's policy" -->
"the option's policy" (Prabhat Nagarajan)
- p463, 6th line from bottom: C_t=\g(S_t)\cdot\ind{S_t=s'}
--> C_t=(1-\g_\omega(S_t))\ind{S_t=s'}
- p465, middle right: there should be no commas in the list
defining tau
- p465: This definition and discussion of Markov state is
insufficient. In addition to a Markov state being sufficient to
make the one-step predictions (17.6), it must also be
incrementally updatable as in (17.9).
- p504: "Pavlov, P. I." --> "Pavlov, I. P." (Brian Christian)
Notes:
- It is easy to make a URL that refers to a specific page of the
online version of the book. Use a URL of the form
"http://www.incompleteideas.net/book/RLbook2018.pdf#page=X"
where X is the number of the page you want to refer to plus 22. For example, to
refer to page number 100, you would use
"http://www.incompleteideas.net/book/RLbook2018.pdf#page=122".
- Thermal soaring (Section 16.8) has since been extended to real
gliders. See Reddy, G., Ng, J. W., Celani, A., Sejnowski, T. J.,
Vergassola, M. Glider soaring via reinforcement learning in the
field. Nature 562:236-239,
2018.
- The differential Sarsa algorithm for the average-reward case,
shown on page 251, only converges to the true action values up
to an additive constant. That is, \hat q(s,a,w_\infty) =
q_*(s,a) + Q for some scalar Q. (To see this, note that adding
an arbitrary constant, say 100, to the estimated values of all
actions would have no effect on any of the TD errors (10.10).)
If you wanted to estimate the true differential action values,
you would have to estimate Q in addition to running the given
algorithm. It is not hard to see that under asymptotic on-policy
conditions, the average of q_*(S_t,A_t) is zero. It follows that
Q is the asymptotic average of \hat q(S_t,A_t,w_t). Thus one
could estimate Q by \bar Q, updated by \bar Q_t = \bar Q_{t-1} +
beta * (\hat q(S_t,A_t,w_t) - \bar Q_{t-1}). Then if at time t
you want an estimate of the true value you would use q_*(s,a)
\approx \hat q(s,a,w_t) - \bar Q_t.
- (first printing) In the computation of the reward baseline,
described in words after (2.12) on page 37, it probably would be
better if the baseline at step t included only rewards up to
time t (i.e., not the
reward at time t). This would be more consistent with the
averages described earlier in this chapter and with the
baselines used in later chapters, and is arguably necessary for
the theory developed in the box on page 39. Thanks to Frederic
Godin for pointing this out.
- The results in Figure 4.3 are for asynchronous value iteration---the algorithm
given on page 83. Because of this, the results are dependent on
the order in which the states are swept through. The given
results are for sweeping through the states from Capital=1 to
Capital=99, which is probably a poor order in terms of rapidly
converging to the optimal value function.
- Exercise 7.3. This is a difficult exercise. Igor
Karpov made a thorough answer available, but at present only
part of it is still available, from the wayback machine, at
https://web.archive.org/web/20160326154910/http://www.cs.utexas.edu/~ikarpov/Classes/RL/RandomWalk/.