Figures for:
Reinforcement Learning: An Introduction
by
Richard S. Sutton
and
Andrew G. Barto
Below are links to postscript files for the figures of the book.
Page 1
Tic-Tac-Toe Game
Figure 1.1
Tic-Tac-Toe Tree
Figure 2.1
10-armed Testbed Results
Figure 2.2
Easy and Difficult Regions
Figure 2.3
Performance on Bandits A and B
Figure 2.4
Effect of Optimistic Initial Action Values
Figure 2.5
Performance of Reinforcement Comparison Method
Figure 2.6
Performance of Pursuit Method
Figure 3.1
Agent-Environment Interaction
Figure 3.2
Pole-balancing Example
Page 62
Absorbing State Sequence
Figure 3.3
Transition Graph for the Recycling Robot Example
Figure 3.4
Prediction Backup Diagrams
Figure 3.5
Gridworld Example
Figure 3.6
Golf Example
Figure 3.7
"Max" Backup Diagrams
Figure 3.8
Solution to Gridworld Example
Page 62
4 x 4 Gridworld Example
Figure 4.2
Convergence Example (4 x 4 Gridworld)
Figure 4.4
Policy sequence in Jack's Car Rental Example
Figure 4.6
Solution to the Gambler's Problem
Figure 4.7
Generalized Policy Iteration
Page 106
Coconvergence of Policy and Value
Figure 5.1
Blackjack Policy Evaluation
Figure 5.3
Backup Diagram for Monte Carlo Prediction
Page 118
Small GPI Diagram
Figure 5.5
Blackjack Solution
Figure 5.8
Two Racetracks
Figure 6.2
TD(0) Backup Diagram
Figure 6.3
Monte Carlo Driving Example
Figure 6.4
TD Driving Example
Figure 6.5
5 State Random-Walk Process
Figure 6.6
Values Learned in a Sample Run of Walks
Figure 6.7
Learning of TD and MC Methods on Walks
Figure 6.8
Batch Performance of TD and MC Methods
Page 143
You Are the Predictor Example
Page 145
Sequence of States and Actions
Figure 6.10
Windy Gridworld
Figure 6.11
Performance of Sarsa on Windy Gridworld
Figure 6.13
Q-learning Backup Diagram
Figure 6.14
Cliff-Walking Task
Figure 6.15
The Actor-Critic Architecture
Figure 6.17
Solution to Access-Control Queuing Task
Page 156
Tic-Tac-Toe After States
Figure 7.1
N-Step Backups
Figure 7.2
N-Step Results
Page 169
Mixed Backup
Figure 7.3
Backup Diagram for TD(lambda)
Figure 7.4
Weighting of Returns in lambda-return
Figure 7.5
The Forward View
Figure 7.6
lambda-return Algorithm Performance
Page 173
Accumulating Traces
Figure 7.8
The Backward View
Figure 7.9
Performance of TD(lambda)
Figure 7.10
Sarsa(lambda)'s Backup Diagram
Figure 7.12
Tabular Sarsa(lambda)
Figure 7.13
Backup Diagram for Watkins's Q(lambda)
Figure 7.15
Backup Diagram for Peng's Q(lambda)
Figure 7.16
Accumulating and Replacing Traces
Figure 7.17
Error as a Function of Lambda
Figure 7.18
The Right-Action Task
Figure 8.2
Coarse Coding
Figure 8.3
Generalization via Coarse Coding
Figure 8.4
Tile Width Affects Generalization Not Acuity
Page 206
2D Grid Tiling
Figure 8.5
Multiple, Overlapping Gridtilings
Figure 8.6
Tilings
Page 207
One Hash-Coded Tile
Figure 8.7
Radial Basis Functions
Figure 8.10
Mountain-Car Value Functions
Figure 8.11
Mountain-Car Results
Figure 8.12
Baird's Counterexample
Figure 8.13
Blowup of Baird's Counterexample
Figure 8.14
Tsitsiklis and Van Roy's Counterexample
Figure 8.15
Summary Effect of Lambda
Figure 9.2
Circle of Learning, Planning and Acting
Figure 9.3
The General Dyna Architecture
Figure 9.5
Dyna Results
Figure 9.6
Snapshot of Dyna Policies
Figure 9.7
Results on Blocking Task
Figure 9.8
Results on Shortcut Task
Figure 9.10
Peng and Williams Figure
Figure 9.11
Moore and Atkeson Figure
Figure 9.12
The One-Step Backups
Figure 9.13
Full vs Sample Backups
Figure 9.14
Uniform vs On-Policy Backups
Figure 9.15
Heuristic Search as One-Step Backups
Figure 10.1
The Space of Backups
Figure 11.1
A Backgammon Position
Figure 11.2
TD-Gammon Network
Figure 11.3
Backup Diagram for Samuel's Checker Player
Figure 11.4
The Acrobot
Figure 11.6
Performance on Acrobat Task
Figure 11.7
Learned Behavior of Acrobat
Figure 11.8
Four Elevators
Figure 11.9
Elevator Results
Page 280
Channel Assignment Example
Figure 11.10
Performance of Channel Allocation Methods
Figure 11.11
Comparison of Schedule Repairs
Figure 11.12
Comparison of CPU Time