Errata and Notes for:
Errata for the first printing of the second edition:
- The title of the final section, Section 17.6, was mistakenly
printed as a repeat of an earlier sections title. It should be "The
Future of Artificial Intelligence." This is also wrong in the table of
contents.
- The phrase "function approximation" was mistakenly abbrieviated
to
"function approx." many times in the printed book.
- p11, 5 lines from bottom: "(see (Section 16.1))" --> "(Section
16.1)”
- p19, 8 lines from bottom: "(Section 16.2)" --> "(Section 15.9)"
- p30, Exercise 2.2: The values specified for R_1 and R_3 should
have minus signs in front of them
- p64, after the figure: v_pi --> v_*
- p81: At the top of this page there should be a bold heading:
"Example 4.2: Jack's Car Rental". Related to this, the example on page
84 should be Example 4.3. Also on page 81, there should be a paragraph
break after "the policy that never moves any cars." (Sam Ritchie)
- p98, start of last paragraph: For Monte Carlo policy evaluation
--> For Monte Carlo policy iteration
- p107, the 3rd line is cut off; it should read: "using the
behavior policy that selects right
and left with equal
probability."
- p117, in 5.5: probabalistic --> probabilistic
- p153, end of first paragraph: occuring --> occurring
- p155: Equations (7.17) and (7.18) are mistakenly the same. The
first equation should actually be a three-line derivation as given in
the online book, here:
http://www.incompleteideas.net/book/RLbook2018.pdf#page=177. The second
equation should be numbered (7.17). (Xiang Gu)
- p156, line 2: (7.11) -->
(7.5) (Xiang Gu)
- p156, in the algorithm, the line "G <- 0:" should be replaced
with the two lines: "If t+1 < T:" and then, indented from that: "G
<- Q(S_{t+1},A_{t+1})" (Mark Rowland and Will
Dabney)
- p178, line 10: polices --> policies
(Kaniska Mohanty)
- p180, line 24: "sill" --> "still" (James R)
- p198 2/3 down page: "s \mapsto g" --> "s \mapsto u"
- p204, bottom: "approximate state-value
function" --> "approximate the state-value
function"
- p212, line 1: "length the interval" --> "length of the
interval" (Prabhat Nagarajan)
- p212, second to last line: The i index should start at 1, not 0.
(Chris Harding)
- p220, middle of page: horizonal --> horizontal
- p229, In (9.22) and the equation above it labeled (from (9.20)),
all the x's should have their time index reduced by 1: x_t -->
x_{t-1} and x_{t+1} --> x_t (Frederic Godin)
- p229, bottom of page: forgeting --> forgetting
- p241: "approximated by linear combination" --> "approximated
by a linear combination" (Prabhat Nagarajan)
- p244, line 14: w_t --> w_{t-1} (Frederic Godin)
- p248, Exercise 10.1, line 2: "or in" --> "in"
- p256 in 10.3: "Tsitiklis" --> "Tsitsiklis" (Prabhat Nagarajan)
- p259, above (11.6): "Expected Sarsa" --> "Sarsa"
(Xiang Gu)
- p267, within Figure 11.3: "\overline{TDE}=0" --> \min
\overline{TDE}
- p286 in 11.7: "Mahadeval" --> "Mahadevan" (Prabhat Nagarajan)
- p302, middle: auxilary --> auxiliary
- p321, after equation (13.1): d should be d'
- p327, 5 lines from bottom: "boxed" --> "boxed algorithm"
(Douglas De Rizzo Meneghetti)
- p329, bottom: \w\in\Re^m --> \w\in\Re^d
- p337, bottom: Schall --> Schaal
- p350, third line: "Rescoral-Wgner" --> "Rescorla-Wagner" (Kyle
Simpson)
- p354, 11 lines from the bottom: "Rescoral-Wagner" -->
"Rescorla-Wagner" (Kyle Simpson)
- p371, in the note on section 14.2.2, 3rd line from bottom:
Schmajuk—> Schmajuk’s
- p372, in the note on section 14.3, line 2: Thorndikes —>
Thorndike’s
- p400: The left side of (15.3) is missing a logarithm (ln) between
the grad symbol (Nabla) and the policy symbol (pi) (Jiahao Fan)
- p400, in (15.3) and again 7 lines down: "A_t-\pi(A_t|S_t" -->
"A_t-\pi(1|S_t"
- p401, third paragraph, three times: "A_t-\pi(A_t|S_t" -->
"A_t-\pi(1|S_t"
- p415, biblio section 15.8: "\pi(A_t|S_t" --> "\pi(1|S_t"
- p436, eight lines from the bottom: "Tesauro and colleages" -->
"Tesauro and colleagues" (Raymund Chua)
- p447, 14 lines from the bottom: "Figure 16.7, were $\theta$ is"
-> "Figure 16.7, where $\theta$ is" (Kyle Simpson)
- p451, middle left: user-targed --> user-targeted
- p460, in paragraph 3, then again in paragraph 4: auxilary -->
auxiliary
- p461, bottom line: \g_\omega(S_{t+1}) --> 1-\g_\omega(S_{t+1})
- p462, 2nd line: \g_\omega(S_{t+2}) -->
1-\g_\omega(S_{t+2})
- p463, 6th line from bottom: C_t=\g(S_t)\cdot\ind{S_t=s'}
--> C_t=(1-\g_\omega(S_t))\ind{S_t=s'}
- p465, middle right: there should be no commas in the list
defining tau
- p504: "Pavlov, P. I." --> "Pavlov, I. P." (Brian Christian)
Notes:
- Thermal soaring (Section 16.8) has since been extended to real
gliders. See Reddy, G., Ng, J. W., Celani, A., Sejnowski, T. J.,
Vergassola, M. (2018). Soaring like a bird via reinforcement learning
in the field. 07 Nature 562:236-239.
- The differential Sarsa algorithm for the average-reward case,
shown on page 251, only converges to the true action values up to an
additive constant. That is, \hat q(s,a,w_\infty) = q_*(s,a) + Q for
some scalar Q. (To see this, note that adding an arbitrary constant,
say 100, to the estimated values of all actions would have no effect on
any of the TD errors (10.10).) If you wanted to estimate the true
differential action values, you would have to estimate Q in addition to
running the given algorithm. It is not hard to see that under
asymptotic on-policy conditions, the average of q_*(S_t,A_t) is zero.
It follows that Q is the asymptotic average of \hat q(S_t,A_t,w_t).
Thus one could estimate Q by \bar Q, updated by \bar Q_t = \bar Q_{t-1}
+ beta * (\hat q(S_t,A_t,w_t) - \bar Q_{t-1}). Then if at time t you
want an estimate of the true value you would use q_*(s,a) \approx \hat
q(s,a,w_t) - \bar Q_t.
- In the computation of the reward baseline, described in words
after (2.12) on page 37, it probably would be better if the baseline at
step t included only rewards up to time t (i.e., not the reward at time t). This
would be more consistent with the averages described earlier in this
chapter and with the baselines used in later chapters, and is arguably
necessary for the theory developed in the box on page 39. Thanks to
Frederic Godin for pointing this out.
- The results in Figure 4.3 are for asynchronous value iteration---the
algorithm given on page 83. Because of this, the results are dependent
on the order in which the states are swept through. The given results
are for sweeping through the states from Capital=1 to Capital=99, which
is probably a poor order in terms of rapidly converging to the optimal
value function.
- Exercise 7.3. This is a difficult exercise. Igor
Karpov made a thorough answer available, but at present only part of it
is still available, from the wayback machine, at
https://web.archive.org/web/20160326154910/http://www.cs.utexas.edu/~ikarpov/Classes/RL/RandomWalk/.