Reinforcement Learning and
Artificial
Intelligence (RLAI) 

Reinforcement
Learning and Function Approximation 
Jump to: Next
Meeting!
The ambition
of this web
page is to serve as a welldeserved home for the Reinforcement
Learning and Function Approximation reading group.
Date and Time  Topics  Notes 
March 14, 2007  General Discussion 

March 21, 2007 
Approximation theory  Generic intro 
March 28, 2007 
Paper of Pinkus, a little bit of nonlinear approximation based on the paper of DeVore 

April 4, 2007 
NO READING GROUP 
Csaba is in Hawaii!! 
April 11, 2007  Finish paper by Pinkus, continue with the paper of Kurkova and Sanguineti 
..instead we had a great time discussing
prerequisits of generalization; concepts needed: training data, performance measure (loss function). Without any assumptions generalization is impossible. Rich's conclusion: generalization is impossible. Csaba's conclusion: A useful model necessarily involves unjustifiable assumptions. Indeed, the purpose of models is to abstract away details, hence introducing some arbitary choices. It is all about the models, is not it? 
April 18, 2007 
Now we really finish the paper
of Pinkus and hopefully
the paper of Kurkova and
Sanguineti!

..and we have almost done that!
we had a great time discussing many things from the BanachTarski paradox to Saturation Theory:) Next time we will only discuss nwidth (we are done with the discussion of linear approximation processes). 
April 25, 2007 
NOTE: WE MOVED TO CSC349; We will discuss nwidths and
then move on to
the theory of nonlinear approximation 
We finished the paper by Pinkus!
It was great, but took quite a long time. Now we move on to the next paper, which will be the paper by Kurkova and Sanguineti. Pls see the bottom of this page for suggestion on what to concentrate on. 
May 2, 2007 
Paper by Kurkova and Sanguineti;
focus is on the difference of linear and nonlinear approximations. 
Discussion will be led by Hamid 
May 9, 2007 
Continued the paper by Kurkova
and Sanguineti 
Discussion still led by Hamid. 
May 16, 2007 
Finish paper by Kurkova and Sanguineti. Start paper by Lafferty and Wasserman(?)  
Sept
19, 2007 
We will read papers by Shapire and Warmuth, 1996 and Forster and Warmuth, 2003  We start again! These papers
approach the problem from a new perspective: online learning and regret
bounds! We discussed Shapire and Warmuth (1996) paper. 
Sept
26, 2007 
Relative Loss Bounds for
TemporalDifference Learning (Forster
and Warmuth, 2003) LeastSquares Policy Evaluation Algorithms with Linear Function Approximation (Nedic and Bertsekas, 2003) 
We review the results of
Forster and Warmuth and then start reading Nedic and Bertsekas's paper. (The meeting will be in CSC249.) Nedic's papers has been postponed. 
Oct 3,
2007 
An Analysis of
TemporalDifference
Learning with Function Approximation (Tsitsiklis
and Van Roy, 1997) On the Convergence of TemporalDifference Learning with Linear Function Approximator (Tadic, 2001) 
We review Tsitsiklis and Van
Roy's paper and then compare their result with Tadic's. 
Oct
10, 2007 
LeastSquares Policy Evaluation Algorithms with Linear Function Approximation (Nedic and Bertsekas, 2003)  Alborz presented the paper. 
Oct
17, 2007 
Geometric Variance Reduction in
Markov Chains: Application to Value Function and Gradient Estimation (Munos, 2006) 

Nov 7,
2007 
Performance Loss Bounds for Approximate Value Iteration with State Aggregation (Van Roy, 2006)  Amir massoud presented the paper. 
Nov
21, 2007 
Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions (Hutter, 2001)  This is not directly related to
RLFA, but we will cover it as Marcus will join us next week and we want
to be prepared! Csaba will lead the discussion. 
Nov 27, 2007 
A CostShaping Linear Program
for AverageCost Approximate
Dynamic Programming with Performance Guarantees (de Farias and Van Roy, 2006) 
Csaba will lead the discussion. 
[Baird95] L.
C. Baird,
"Residual Algorithms:
Reinforcement Learning with Function Approximation," Machine Learning:
Proceedings of the Twelfth International Conference, 912 July, Morgan
Kaufman Publishers, San Francisco, CA, 1995. (pdf)
Abstract: A number of reinforcement learning algorithms
have
been developed that are guaranteed to converge to the
optimal solution when used with lookup tables. It is shown,
however, that these algorithms can easily become unstable when
implemented directly with a general functionapproximation system, such
as a sigmoidal multilayer perceptron, a radialbasisfunction system, a
memorybased learning system, or even a linear functionapproximation
system. A new class of algorithms, residual gradient
algorithms,
is proposed, which perform gradient descent on the mean squared Bellman
residual, guaranteeing convergence. It is shown, however, that they may
learn very slowly in some cases. A larger class of
algorithms,
residual algorithms, is proposed that has the guaranteed convergence of
the residual gradient algorithms, yet can retain the fast learning
speed of direct algorithms. In fact, both direct and residual gradient
algorithms are shown to be special cases of residual algorithms, and it
is shown that residual algorithms can combine the advantages of each
approach. The direct, residual gradient, and residual forms
of
value iteration, Qlearning, and advantage learning are all presented.
Theoretical analysis is given explaining the properties these
algorithms have, and simulation results are given that demonstrate
these properties.
Tags:
Reading Guide:
[Boyan95] J. A.
Boyan and A. W. Moore, "Generalization in Reinforcement
Learning: Safely Approximating the Value Function," In G. Tesauro,
D. S. Touretzky, and T. K. Leen (eds.), Advances in Neural
Information Processing Systems 7 (NIPS). MIT Press, 1995. (pdf)
Abstract:
Tags: Experimental
Reading Guide:
[Sutton96] R. S. Sutton,
"Generalization in Reinforcement Learning: Successful
Examples using Sparse Coarse Coding," Advances in Neural Information
Processing Systems 8 (Proceedings of the 1995 conference), pp.
10381044, MIT Press, 1996. (pdf)
Abstract: On large problems, reinforcement learning
systems must
use parameterized function approximators such as neural networks in
order to generalize between similar situations and actions. In these
cases, there are no strong theoretical results on the accuracy of
convergence, and computational results have been mixed. In particular,
Boyan and Moore reported at last year's meeting a series of negative
results in attempting to apply dynamic programming together function
approximation to simple control problems with continuous state spaces.
In this paper, we present positive results for all the control tasks
they attempted, and for one that is significantly larger. The most
important differences are that we used sparsecoarsecoded function
approximators (CMACs) whereas they used mostly global function
approximators, and that we learned online whereas they used learned
offline. Boyan and Moore and others have suggested that the problems
they encountered could be solved by using actual outcomes (rollouts),
as in classical Monte Carlo methods, and as in the TD(lambda) algorithm
when lambda=1. However, in our experiments this always resulted in
substantially poorer performance. We conclude that reinforcement
learning can work robustly in conjunction with function approximators,
and that there is little justification at present for avoiding the case
of general lambda.
Tags:
Reading Guide:
[Schapire96]
R. E. Shapire and M.
K. Warmuth, "On the Worstcase Analysis of Temporal Difference Learning
Algorithms," Machine Learning Journal, Vol. 22, 1996. (pdf)
Abstract: We study
the behavior of a family of learning algorithms based on Suttonâ€™s
method of temporal differences. In our online learning framework,
learning takes place in a sequence of trials, and the goal of the
learning algorithm is to estimate a discounted sum of all the
reinforcements that will be received in the future. In this setting, we
are able to prove general upper bounds on the performance of a slightly
modified version of Sutton's socalled TD(\lambda) algorithm. These
bounds are stated in terms of the performance of the best linear
predictor on the given training sequence, and are provedwithout making
any statistical assumptions of any kind about the process producing the
learnerâ€™s observed training sequence. We also prove lower bounds on
the
performance of any algorithm for this learning problem, and give a
similar analysis of the closely related problem of learning to predict
in a model in which the learner must produce predictions for a whole
batch of observations before receiving reinforcement.
Tags: Policy Evaluation, Single
Trajectory, Online, Generalized Linear, Regret Bounds, No Experiment
Reading Guide:
[Tsitskiklis97] J. N.
Tsitsiklis and B. Van Roy, "An Analysis of TemporalDifference
Learning with Function Approximation,'' IEEE Transactions on Automatic
Control, Vol. 42, No. 5, May 1997, pp. 674690. (pdf) (pdf as
published)
Abstract:
Tags:
Reading Guide: The most important paper in RL with FA, in rich's
opinion.
[de Farias00] D. P. de
Farias and B. Van Roy, "On the Existence of Fixed Points for
Approximate Value Iteration and TemporalDifference Learning,'' Journal
of Optimization Theory and Applications, Vol. 105, No. 3, June, 2000. (pdf)
Abstract:
Tags:
Reading Guide:
[Grudic01] G. Z. Grudic and
L. H. Ungar, "Rates of Convergence of Performance
Gradient Estimates Using Function Approximation and Bias in
Reinforcement Learning," In Advances in Neural Information Processing
Systems, 2001. (pdf)
Abstract: We address two open theoretical questions in
Policy
Gradient Reinforcement Learning. The first concerns the efficacy of
using function approximation to represent the state action value
function, Q. Theory is presented showing that linear function
approximation representations of Q can degrade the rate of convergence
of performance gradient estimates by a factor of O(ML) relative to when
no function approximation of Q is used, where M is the number of
possible actions and L is the number of basis functions in the function
approximation representation. The second concerns the use of a bias
term in estimating the state action value function. Theory is presented
showing that a nonzero bias term can improve the rate of convergence
of performance gradient estimates by O(1(1/M)) , where is the number
of possible actions. Experimental evidence is presented showing that
these theoretical results lead to significant improvement in the
convergence properties of Policy Gradient Reinforcement Learning
algorithms.
[Precup01] D. Precup, R. S.
Sutton, S. Dasgupta, "Offpolicy TemporalDifference
Learning with Function Approximation," Proceedings of the 18th
International Conference on Machine Learning, 2001. (pdf)
Abstract: We introduce the first algorithm for offpolicy
temporaldifference learning that is stable with linear function
approximation. Offpolicy learning is of interest because it forms the
basis for popular reinforcement learning methods such as Qlearning,
which has been known to diverge with linear function approximation, and
because it is critical to the practical utility of multiscale,
multigoal, learning frameworks such as options, HAMs, and MAXQ. Our
new algorithm combines TD(lambda) over stateaction pairs with
importance sampling ideas from our previous work. We prove that, given
training under any epsilonsoft policy, the algorithm converges w.p.1
to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic,
2001) to the actionvalue function for an arbitrary target policy.
Variations of the algorithm designed to reduce variance introduce
additional bias but are also guaranteed convergent. We also illustrate
our method empirically on a small policy evaluation problem, showing
reduced variance compared to the most obvious importance sampling
algorithm for this problem. Our current results are limited to episodic
tasks with episodes of bounded length.
Tags:
Reading Guide:
[Tadic01] V.
Tadic, "On the Convergence of TemporalDifference Learning with Linear
Function Approximator," Machine Learning, Vol. 42, Issue 3, 2001 (acm portal)
Abstract:
Tags:
Reading Guide:
[Schoknecht02a]
R.
Schoknecht, "Optimality of Reinforcement Learning Algorithms with
Linear Function Approximation," In Advances in Neural Information
Processing Systems, volume 15, 2002. (pdf)
Abstract:
Tags:
Reading Guide:
[Schoknecht02b]
R.
Schoknecht and A. Merke,
"Convergent combinations of reinforcement learning with function
approximation," In Advances in Neural Information Processing Systems,
volume 15, 2002. (pdf)
Tags:
Reading Guide:
[Boyan02] J. A. Boyan,
"Technical Update: LeastSquares Temporal Difference
Learning," Machine Learning 49:23, 2002. (pdf)
Abstract:
Tags:
Reading Guide:
[de
Farias03] D. P. de Farias and
B. Van Roy, "The Linear Programming Approach to Approximate Dynamic
Programming," Operations Research, Vol. 51, No. 6, NovemberDecember
2003, pp. 850865. (pdf)
Abstract:
Tags:
Reading Guide:
[Nedic03] A.
Nedic and D. P. Bertsekas, "LeastSquares Policy Evaluation Algorithms
with Linear Function Approximation," J. of Discrete Event Systems, Vol.
13, 2003, pp. 79110. (pdf)
Abstract: We consider policy
evaluation algorithms within the context of infinitehorizon dynamic
programming problems with discounted cost. We focus on discretetime
dynamic systems with a large number of states, and we discuss two
methods, which use simulation, temporal differences, and linear cost
function approximation. The first method is a new gradientlike
algorithm involving leastsquares subproblems and a diminishing
stepsize, which is based on the λpolicy iteration method of Bertsekas
and Ioffe. The second method is the LSTD(λ) algorithm recently proposed
by Boyan, which for λ = 0 coincides with the linear leastsquares
temporaldifference algorithm of Bradtke and Barto. At present, there
is only a convergence result by Bradtke and Barto for the LSTD(0)
algorithm. Here, we strengthen this result by showing the convergence
of LSTD(λ), with probability 1, for every λ ∈ [0, 1].
Tags:
Reading Guide:
[Lagoudakis03]
Michail G. Lagoudakis and
Ronald Parr,
"LeastSquares Policy Iteration," Journal of Machine Learning Research,
4, 2003, pp. 11071149. (pdf)
Abstract: We
propose a new
approach to reinforcement learning for control problems which combines
valuefunction approximation with linear architectures and approximate
policy iteration. This new approach is motivated by the leastsquares
temporaldifference learning algorithm (LSTD) for prediction problems,
which is known for its efficient use of sample experiences compared to
pure temporaldifference algorithms. Heretofore, LSTD has not had a
straightforward application to control problems mainly because LSTD
learns the state value function of a fixed policy which cannot be used
for action selection and control without a model of the underlying
process. Our new algorithm, leastsquares policy iteration (LSPI),
learns the stateaction value function which allows for action
selection without a model and for incremental policy improvement within
a policyiteration framework. LSPI is a modelfree, offpolicy method
which can use efficiently (and reuse in each iteration) sample
experiences collected in any manner. By separating the sample
collection method, the choice of the linear approximation architecture,
and the solution method, LSPI allows for focused attention on the
distinct elements that contribute to practical reinforcement learning.
LSPI is tested on the simple task of balancing an inverted pendulum and
the harder task of balancing and riding a bicycle to a target location.
In both cases, LSPI learns to control the pendulum or the bicycle by
merely observing a relatively small number of trials where actions are
selected randomly. LSPI is also compared against Qlearning (both with
and without experience replay) using the same value function
architecture. While LSPI achieves good performance fairly consistently
on the difficult bicycle task, Qlearning variants were rarely able to
balance for more than a small fraction of the time needed to reach the
target location.
Tags:
Reading Guide:
[Parr03] R. Parr and M.
Lagoudakis, "Reinforcement Learning as Classification:
Leveraging Modern Classifiers," Proceedings of the Twentieth
International Conference on Machine Learning (ICML), 2003. (pdf)
Abstract:
Tags:
Reading Guide:
[Forster03]
J. Forster and M. K. Warmuth, "Relative Loss Bounds for
TemporalDifference Learning," Machine Learning, Vol. 51, 2003. (pdf)
Abstract: We
propose a new
approach to reinforcement learning for control problems which combines
valuefunction approximation with linear architectures and approximate
policy iteration. This new approach is motivated by the leastsquares
temporaldifference learning algorithm (LSTD) for prediction problems,
which is known for its efficient use of sample experiences compared to
pure temporaldifference algorithms. Heretofore, LSTD has not had a
straightforward application to control problems mainly because LSTD
learns the state value function of a fixed policy which cannot be used
for action selection and control without a model of the underlying
process. Our new algorithm, leastsquares policy iteration (LSPI),
learns the stateaction value function which allows for action
selection without a model and for incremental policy improvement within
a policyiteration framework. LSPI is a modelfree, offpolicy method
which can use efficiently (and reuse in each iteration) sample
experiences collected in any manner. By separating the sample
collection method, the choice of the linear approximation architecture,
and the solution method, LSPI allows for focused attention on the
distinct elements that contribute to practical reinforcement learning.
LSPI is tested on the simple task of balancing an inverted pendulum and
the harder task of balancing and riding a bicycle to a target location.
In both cases, LSPI learns to control the pendulum or the bicycle by
merely observing a relatively small number of trials where actions are
selected randomly. LSPI is also compared against Qlearning (both with
and without experience replay) using the same value function
architecture. While LSPI achieves good performance fairly consistently
on the difficult bicycle task, Qlearning variants were rarely able to
balance for more than a small fraction of the time needed to reach the
target location.
Tags: Policy Evaluation, Single
Trajectory, Online, Generalized Linear, Regret Bounds, No Experiment
Reading Guide:
[de Farias04]
D. P. de Farias and B. Van
Roy, "On Constraint Sampling in the Linear Programming Approach to
Approximate Dynamic Programming," Mathematics of Operations Research,
Vol. 29, No. 3, August 2004, pp. 462478. (pdf)
Abstract: In the linear programming approach to
approximate dynamic programming, one tries to solve a certain linear program—the ALP—that has a relatively
small number K of variables but an intractable number M of constraints.
In this paper, we study a
scheme that samples and imposes a subset of m << M constraints. A natural question that arises in this context is: How must
m scale with respect to K and M in order to ensure that the resulting approximation is almost as good as one
given by exact solution of the ALP? We show that, given an idealized sampling distribution and
appropriate constraints on the K variables, m can be chosen
independently of M
and need grow only as a polynomial
in K. We interpret this result in a context involving controlled
queueing networks.
Tags:
Reading Guide:
[Szepesvári04] Cs. Szepesvári and
William D. Smart, "Interpolationbased
Qlearning," ICML 2004. (pdf)
Abstract: We consider a variant of Qlearning in continuous state
spaces under the total expected discounted cost criterion combined with
local function approximation methods. Provided that the function
approximator satisfies certain interpolation properties, the resulting
algorithm is shown to converge with probability one. The limit function
is shown to satisfy a fixed point equation of the Bellman type, where
the fixed point operator depends on the stationary distribution of the
exploration policy and approximation properties of the function
approximation method. The basic algorithm is extended in several ways.
In particular, a variant of the algorithm is obtained that is shown to
converge in probability to the optimal Q function. Preliminary computer
simulations confirm the validity of the approach.
Tags:
Valuebased Method,
Theory,
Reading Guide:
[Engel05]
Y. Engel, S. Mannor
and R. Meir, "Reinforcement Learning with Gaussian Processes," ICML
2005. (pdf)
Abstract: Gaussian Process Temporal Difference (GPTD)
learning
offers a Bayesian solution to the policy evaluation problem of
reinforcement learning. In this paper we extend the GPTD framework by
addressing two pressing issues, which were not adequately treated in
the original GPTD paper (Engel et al., 2003). The first is the issue of
stochasticity in the state transitions, and the second is concerned
with action selection and policy improvement. We present a new
generative model for the value function, deduced from its relation with
the discounted return. We derive a corresponding online algorithm for
learning the posterior moments of the
value Gaussian process. We also present a SARSA based extension of
GPTD, termed GPSARSA, that allows the selection of actions and the
gradual improvement of
policies without requiring a worldmodel.
Tags:
Reading Guide:
[Szepesvári05] Cs. Szepesvári
and R. Munos,
"Finite Time Bounds for Sampling Based Fitted Value Iteration,"
ICML'2005, pp. 881—886, 2005. (pdf) (extended version submitted to
JMLR (pdf) )
Abstract: In this paper we consider
sampling based fitted value
iteration for discounted, large (possibly infinite) state space, finite
action Markovian Decision Problems where only a generative model of the
transition probabilities and rewards is available. At each step the
image of the current estimate of the optimal value function under a
MonteCarlo approximation to the Bellmanoperator is projected onto
some function space. PACstyle bounds on the weighted $L^p$norm
approximation error are obtained as a function of the covering number
and the approximation power of the function space, the iteration number
and the sample size.
Tags:
Reading Guide:
[Antos06]
A. Antos, Cs.
Szepesvári and R.
Munos, "Learning nearoptimal policies with Bellmanresidual
minimization based fitted policy iteration and a single sample
path," the Proceedings of The Nineteenth Annual Conference on
Learning Theory, COLT 2006, SpringerVerlag, Berlin, Heidelberg,
LNCS/LNAI 4005, pp. 574588, 2006. (conference version (pdf), extended version submitted to
MLJ (pdf))
Abstract: We consider the
problem of finding a
nearoptimal
policy in
continuous space, discounted Markovian Decision Problems given the
trajectory of some behaviour policy. We study the policy iteration
algorithm where in successive iterations the actionvalue functions of
the intermediate policies are obtained by picking a function from some
fixed function set (chosen by the user) that minimizes an unbiased
finitesample approximation to a novel loss function that upperbounds
the unmodified Bellmanresidual criterion. The main result is a
finitesample, highprobability bound on the performance of the
resulting policy that depends on the mixing rate of the tra jectory,
the capacity of the function set as measured by a novel capacity
concept that we call the VCcrossing dimension, the approximation power
of the function set and the discountedaverage concentrability of the
future state distribution. To the best of our knowledge this is the
first theoretical reinforcement learning result for offpolicy control
learning over continuous statespaces using a single trajectory.
Tags: LeastAngle
Regression Stepwise
(LARS),
Reading Guide:
[VanRoy06]
B. Van Roy, "Performance
Loss Bounds
for Approximate Value Iteration with State Aggregation," Mathematics
of Operations Research, Vol. 31, No. 2, pp. 234244, 2006. (pdf)
Abstract:
We consider
approximate value iteration with a parameterized approximator in which
the state space is partitioned and the optimal costtogo function over
each partition is approximated by a constant. We establish performance
loss bounds for policies derived from approximations associated with
fixed points. These bounds identify benefits to using invariant
distributions of appropriate policies as projection weights. Such
projection weighting relates to what is done by temporaldifference
learning. Our analysis also leads to the first performance loss bound
for approximate value iteration with an averagecost objective.
Tags: Approximate Value
Iteration, Asymptotic convergence result
Reading Guide:
[de Farias06]
D. P. de Farias and B. Van
Roy, "A CostShaping Linear Program for AverageCost Approximate
Dynamic Programming with Performance Guarantees," Mathematics of
Operations Research, Vol. 31, No. 3, pp. 597620, 2006. (pdf)
Abstract: We introduce a new
algorithm based on linear programming for optimization of averagecost
Markov decision processes (MDPs).
The algorithm approximates the differential cost function of a
perturbed MDP via a linear combination of basis functions. We establish a bound on the
performance of the resulting policy that scales gracefully with the
number of states without
imposing the strong Lyapunov condition required by its counterpart in
de Farias and Van Roy [de Farias, D. P., B. Van Roy. 2003. The linear programming
approach to approximate dynamic programming. Oper. Res. 51(6) 850–865].
We investigate implications
of this result in the context of a queueing control problem.
Tags: Valuebased, Linear
Programming
Reading Guide:
[Loth06]
M. Loth, M. Davy, R.
Coulom and P.
Preux, "Equigradient Temporal Difference Learning", ICML Workshop on
Kernel machines and Reinforcement Learning, Pittsburgh, USA, June 2006.
(pdf)
Abstract:
Tags:
Reading Guide:
Tags  Options 
Notes  
Goal  Policy Evaluation  Control
learning (offline) 
Control
learning (online) 

Model Assumption  Explicit Model  Generative Model  Single Trajectory  
Online vs Offline  Online  Offline  
On/Off Policy  OnPolicy  OffPolicy  
Class of Functions  Lookup table  Generalized Linear  Nonlinear (general)  
Theory  None  Asymptotic  Convergence Rate  
Experiment  None  Toy Problem  Realworld (Simulation OR really realworld!) 
I propose that on our next meeting you should (quickly) finish the discussion of the paper by Pinkus and move on to the discussion of the paper by Kurkova and Sanguineti.
This paper compares linear and nonlinear approximation (of which neural nets is a particular example) and has interesting lessons.
I suggest that you concentrate on Sections 12, Section 3.C, Section 5.C. I know that this is a difficult to read paper and we should not get lost in the details but concentrate on the main ideas (i.e., the results and the formulation of the results).
Logistics: We would need someone to volunteer to chair the meeting.
Therefore I ask everyone who feels like that she or he would be able to do that to drop me an email.
Again, let me emphasize that the goal is to understand the results (look for the theorems!); the proofs at this moment are not that interesting for us. Therefore I suggest that after reading the first 3 sections, you look for the theorems and work your way back to the definitions that are necessary to understand the theorem.
 Csaba