Reinforcement Learning and
Artificial
Intelligence (RLAI) |
|
Reinforcement
Learning and Function Approximation |
Jump to: Next
Meeting!
The ambition
of this web
page is to serve as a well-deserved home for the Reinforcement
Learning and Function Approximation reading group.
Date and Time | Topics | Notes |
March 14, 2007 | General Discussion |
|
March 21, 2007 |
Approximation theory | Generic intro |
March 28, 2007 |
Paper of Pinkus, a little bit of non-linear approximation based on the paper of DeVore |
|
April 4, 2007 |
NO READING GROUP |
Csaba is in Hawaii!! |
April 11, 2007 | Finish paper by Pinkus, continue with the paper of Kurkova and Sanguineti |
..instead we had a great time discussing
prerequisits of generalization; concepts needed: training data, performance measure (loss function). Without any assumptions generalization is impossible. Rich's conclusion: generalization is impossible. Csaba's conclusion: A useful model necessarily involves unjustifiable assumptions. Indeed, the purpose of models is to abstract away details, hence introducing some arbitary choices. It is all about the models, is not it? |
April 18, 2007 |
Now we really finish the paper
of Pinkus and hopefully
the paper of Kurkova and
Sanguineti!
|
..and we have almost done that!
we had a great time discussing many things from the Banach-Tarski paradox to Saturation Theory:) Next time we will only discuss n-width (we are done with the discussion of linear approximation processes). |
April 25, 2007 |
NOTE: WE MOVED TO CSC-349; We will discuss n-widths and
then move on to
the theory of non-linear approximation |
We finished the paper by Pinkus!
It was great, but took quite a long time. Now we move on to the next paper, which will be the paper by Kurkova and Sanguineti. Pls see the bottom of this page for suggestion on what to concentrate on. |
May 2, 2007 |
Paper by Kurkova and Sanguineti;
focus is on the difference of linear and non-linear approximations. |
Discussion will be led by Hamid |
May 9, 2007 |
Continued the paper by Kurkova
and Sanguineti |
Discussion still led by Hamid. |
May 16, 2007 |
Finish paper by Kurkova and Sanguineti. Start paper by Lafferty and Wasserman(?) | |
Sept
19, 2007 |
We will read papers by Shapire and Warmuth, 1996 and Forster and Warmuth, 2003 | We start again! These papers
approach the problem from a new perspective: online learning and regret
bounds! We discussed Shapire and Warmuth (1996) paper. |
Sept
26, 2007 |
Relative Loss Bounds for
Temporal-Difference Learning (Forster
and Warmuth, 2003) Least-Squares Policy Evaluation Algorithms with Linear Function Approximation (Nedic and Bertsekas, 2003) |
We review the results of
Forster and Warmuth and then start reading Nedic and Bertsekas's paper. (The meeting will be in CSC-249.) Nedic's papers has been postponed. |
Oct 3,
2007 |
An Analysis of
Temporal-Difference
Learning with Function Approximation (Tsitsiklis
and Van Roy, 1997) On the Convergence of Temporal-Difference Learning with Linear Function Approximator (Tadic, 2001) |
We review Tsitsiklis and Van
Roy's paper and then compare their result with Tadic's. |
Oct
10, 2007 |
Least-Squares Policy Evaluation Algorithms with Linear Function Approximation (Nedic and Bertsekas, 2003) | Alborz presented the paper. |
Oct
17, 2007 |
Geometric Variance Reduction in
Markov Chains: Application to Value Function and Gradient Estimation (Munos, 2006) |
|
Nov 7,
2007 |
Performance Loss Bounds for Approximate Value Iteration with State Aggregation (Van Roy, 2006) | Amir massoud presented the paper. |
Nov
21, 2007 |
Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions (Hutter, 2001) | This is not directly related to
RLFA, but we will cover it as Marcus will join us next week and we want
to be prepared! Csaba will lead the discussion. |
Nov 27, 2007 |
A Cost-Shaping Linear Program
for Average-Cost Approximate
Dynamic Programming with Performance Guarantees (de Farias and Van Roy, 2006) |
Csaba will lead the discussion. |
[Baird95] L.
C. Baird,
"Residual Algorithms:
Reinforcement Learning with Function Approximation," Machine Learning:
Proceedings of the Twelfth International Conference, 9-12 July, Morgan
Kaufman Publishers, San Francisco, CA, 1995. (pdf)
Abstract: A number of reinforcement learning algorithms
have
been developed that are guaranteed to converge to the
optimal solution when used with lookup tables. It is shown,
however, that these algorithms can easily become unstable when
implemented directly with a general function-approximation system, such
as a sigmoidal multilayer perceptron, a radial-basis-function system, a
memory-based learning system, or even a linear function-approximation
system. A new class of algorithms, residual gradient
algorithms,
is proposed, which perform gradient descent on the mean squared Bellman
residual, guaranteeing convergence. It is shown, however, that they may
learn very slowly in some cases. A larger class of
algorithms,
residual algorithms, is proposed that has the guaranteed convergence of
the residual gradient algorithms, yet can retain the fast learning
speed of direct algorithms. In fact, both direct and residual gradient
algorithms are shown to be special cases of residual algorithms, and it
is shown that residual algorithms can combine the advantages of each
approach. The direct, residual gradient, and residual forms
of
value iteration, Q-learning, and advantage learning are all presented.
Theoretical analysis is given explaining the properties these
algorithms have, and simulation results are given that demonstrate
these properties.
Tags:
Reading Guide:
[Boyan95] J. A.
Boyan and A. W. Moore, "Generalization in Reinforcement
Learning: Safely Approximating the Value Function," In G. Tesauro,
D. S. Touretzky, and T. K. Leen (eds.), Advances in Neural
Information Processing Systems 7 (NIPS). MIT Press, 1995. (pdf)
Abstract:
Tags: Experimental
Reading Guide:
[Sutton96] R. S. Sutton,
"Generalization in Reinforcement Learning: Successful
Examples using Sparse Coarse Coding," Advances in Neural Information
Processing Systems 8 (Proceedings of the 1995 conference), pp.
1038-1044, MIT Press, 1996. (pdf)
Abstract: On large problems, reinforcement learning
systems must
use parameterized function approximators such as neural networks in
order to generalize between similar situations and actions. In these
cases, there are no strong theoretical results on the accuracy of
convergence, and computational results have been mixed. In particular,
Boyan and Moore reported at last year's meeting a series of negative
results in attempting to apply dynamic programming together function
approximation to simple control problems with continuous state spaces.
In this paper, we present positive results for all the control tasks
they attempted, and for one that is significantly larger. The most
important differences are that we used sparse-coarse-coded function
approximators (CMACs) whereas they used mostly global function
approximators, and that we learned online whereas they used learned
offline. Boyan and Moore and others have suggested that the problems
they encountered could be solved by using actual outcomes (rollouts),
as in classical Monte Carlo methods, and as in the TD(lambda) algorithm
when lambda=1. However, in our experiments this always resulted in
substantially poorer performance. We conclude that reinforcement
learning can work robustly in conjunction with function approximators,
and that there is little justification at present for avoiding the case
of general lambda.
Tags:
Reading Guide:
[Schapire96]
R. E. Shapire and M.
K. Warmuth, "On the Worst-case Analysis of Temporal Difference Learning
Algorithms," Machine Learning Journal, Vol. 22, 1996. (pdf)
Abstract: We study
the behavior of a family of learning algorithms based on Sutton’s
method of temporal differences. In our on-line learning framework,
learning takes place in a sequence of trials, and the goal of the
learning algorithm is to estimate a discounted sum of all the
reinforcements that will be received in the future. In this setting, we
are able to prove general upper bounds on the performance of a slightly
modified version of Sutton's so-called TD(\lambda) algorithm. These
bounds are stated in terms of the performance of the best linear
predictor on the given training sequence, and are provedwithout making
any statistical assumptions of any kind about the process producing the
learner’s observed training sequence. We also prove lower bounds on
the
performance of any algorithm for this learning problem, and give a
similar analysis of the closely related problem of learning to predict
in a model in which the learner must produce predictions for a whole
batch of observations before receiving reinforcement.
Tags: Policy Evaluation, Single
Trajectory, Online, Generalized Linear, Regret Bounds, No Experiment
Reading Guide:
[Tsitskiklis97] J. N.
Tsitsiklis and B. Van Roy, "An Analysis of Temporal-Difference
Learning with Function Approximation,'' IEEE Transactions on Automatic
Control, Vol. 42, No. 5, May 1997, pp. 674-690. (pdf) (pdf as
published)
Abstract:
Tags:
Reading Guide: The most important paper in RL with FA, in rich's
opinion.
[de Farias00] D. P. de
Farias and B. Van Roy, "On the Existence of Fixed Points for
Approximate Value Iteration and Temporal-Difference Learning,'' Journal
of Optimization Theory and Applications, Vol. 105, No. 3, June, 2000. (pdf)
Abstract:
Tags:
Reading Guide:
[Grudic01] G. Z. Grudic and
L. H. Ungar, "Rates of Convergence of Performance
Gradient Estimates Using Function Approximation and Bias in
Reinforcement Learning," In Advances in Neural Information Processing
Systems, 2001. (pdf)
Abstract: We address two open theoretical questions in
Policy
Gradient Reinforcement Learning. The first concerns the efficacy of
using function approximation to represent the state action value
function, Q. Theory is presented showing that linear function
approximation representations of Q can degrade the rate of convergence
of performance gradient estimates by a factor of O(ML) relative to when
no function approximation of Q is used, where M is the number of
possible actions and L is the number of basis functions in the function
approximation representation. The second concerns the use of a bias
term in estimating the state action value function. Theory is presented
showing that a non-zero bias term can improve the rate of convergence
of performance gradient estimates by O(1-(1/M)) , where is the number
of possible actions. Experimental evidence is presented showing that
these theoretical results lead to significant improvement in the
convergence properties of Policy Gradient Reinforcement Learning
algorithms.
[Precup01] D. Precup, R. S.
Sutton, S. Dasgupta, "Off-policy Temporal-Difference
Learning with Function Approximation," Proceedings of the 18th
International Conference on Machine Learning, 2001. (pdf)
Abstract: We introduce the first algorithm for off-policy
temporal-difference learning that is stable with linear function
approximation. Off-policy learning is of interest because it forms the
basis for popular reinforcement learning methods such as Q-learning,
which has been known to diverge with linear function approximation, and
because it is critical to the practical utility of multi-scale,
multi-goal, learning frameworks such as options, HAMs, and MAXQ. Our
new algorithm combines TD(lambda) over state-action pairs with
importance sampling ideas from our previous work. We prove that, given
training under any epsilon-soft policy, the algorithm converges w.p.1
to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic,
2001) to the action-value function for an arbitrary target policy.
Variations of the algorithm designed to reduce variance introduce
additional bias but are also guaranteed convergent. We also illustrate
our method empirically on a small policy evaluation problem, showing
reduced variance compared to the most obvious importance sampling
algorithm for this problem. Our current results are limited to episodic
tasks with episodes of bounded length.
Tags:
Reading Guide:
[Tadic01] V.
Tadic, "On the Convergence of Temporal-Difference Learning with Linear
Function Approximator," Machine Learning, Vol. 42, Issue 3, 2001 (acm portal)
Abstract:
Tags:
Reading Guide:
[Schoknecht02a]
R.
Schoknecht, "Optimality of Reinforcement Learning Algorithms with
Linear Function Approximation," In Advances in Neural Information
Processing Systems, volume 15, 2002. (pdf)
Abstract:
Tags:
Reading Guide:
[Schoknecht02b]
R.
Schoknecht and A. Merke,
"Convergent combinations of reinforcement learning with function
approximation," In Advances in Neural Information Processing Systems,
volume 15, 2002. (pdf)
Tags:
Reading Guide:
[Boyan02] J. A. Boyan,
"Technical Update: Least-Squares Temporal Difference
Learning," Machine Learning 49:2-3, 2002. (pdf)
Abstract:
Tags:
Reading Guide:
[de
Farias03] D. P. de Farias and
B. Van Roy, "The Linear Programming Approach to Approximate Dynamic
Programming," Operations Research, Vol. 51, No. 6, November-December
2003, pp. 850-865. (pdf)
Abstract:
Tags:
Reading Guide:
[Nedic03] A.
Nedic and D. P. Bertsekas, "Least-Squares Policy Evaluation Algorithms
with Linear Function Approximation," J. of Discrete Event Systems, Vol.
13, 2003, pp. 79-110. (pdf)
Abstract: We consider policy
evaluation algorithms within the context of infinite-horizon dynamic
programming problems with discounted cost. We focus on discrete-time
dynamic systems with a large number of states, and we discuss two
methods, which use simulation, temporal differences, and linear cost
function approximation. The first method is a new gradient-like
algorithm involving least-squares subproblems and a diminishing
stepsize, which is based on the λ-policy iteration method of Bertsekas
and Ioffe. The second method is the LSTD(λ) algorithm recently proposed
by Boyan, which for λ = 0 coincides with the linear least-squares
temporal-difference algorithm of Bradtke and Barto. At present, there
is only a convergence result by Bradtke and Barto for the LSTD(0)
algorithm. Here, we strengthen this result by showing the convergence
of LSTD(λ), with probability 1, for every λ ∈ [0, 1].
Tags:
Reading Guide:
[Lagoudakis03]
Michail G. Lagoudakis and
Ronald Parr,
"Least-Squares Policy Iteration," Journal of Machine Learning Research,
4, 2003, pp. 1107-1149. (pdf)
Abstract: We
propose a new
approach to reinforcement learning for control problems which combines
value-function approximation with linear architectures and approximate
policy iteration. This new approach is motivated by the least-squares
temporal-difference learning algorithm (LSTD) for prediction problems,
which is known for its efficient use of sample experiences compared to
pure temporal-difference algorithms. Heretofore, LSTD has not had a
straightforward application to control problems mainly because LSTD
learns the state value function of a fixed policy which cannot be used
for action selection and control without a model of the underlying
process. Our new algorithm, least-squares policy iteration (LSPI),
learns the state-action value function which allows for action
selection without a model and for incremental policy improvement within
a policy-iteration framework. LSPI is a model-free, off-policy method
which can use efficiently (and reuse in each iteration) sample
experiences collected in any manner. By separating the sample
collection method, the choice of the linear approximation architecture,
and the solution method, LSPI allows for focused attention on the
distinct elements that contribute to practical reinforcement learning.
LSPI is tested on the simple task of balancing an inverted pendulum and
the harder task of balancing and riding a bicycle to a target location.
In both cases, LSPI learns to control the pendulum or the bicycle by
merely observing a relatively small number of trials where actions are
selected randomly. LSPI is also compared against Q-learning (both with
and without experience replay) using the same value function
architecture. While LSPI achieves good performance fairly consistently
on the difficult bicycle task, Q-learning variants were rarely able to
balance for more than a small fraction of the time needed to reach the
target location.
Tags:
Reading Guide:
[Parr03] R. Parr and M.
Lagoudakis, "Reinforcement Learning as Classification:
Leveraging Modern Classifiers," Proceedings of the Twentieth
International Conference on Machine Learning (ICML), 2003. (pdf)
Abstract:
Tags:
Reading Guide:
[Forster03]
J. Forster and M. K. Warmuth, "Relative Loss Bounds for
Temporal-Difference Learning," Machine Learning, Vol. 51, 2003. (pdf)
Abstract: We
propose a new
approach to reinforcement learning for control problems which combines
value-function approximation with linear architectures and approximate
policy iteration. This new approach is motivated by the least-squares
temporal-difference learning algorithm (LSTD) for prediction problems,
which is known for its efficient use of sample experiences compared to
pure temporal-difference algorithms. Heretofore, LSTD has not had a
straightforward application to control problems mainly because LSTD
learns the state value function of a fixed policy which cannot be used
for action selection and control without a model of the underlying
process. Our new algorithm, least-squares policy iteration (LSPI),
learns the state-action value function which allows for action
selection without a model and for incremental policy improvement within
a policy-iteration framework. LSPI is a model-free, off-policy method
which can use efficiently (and reuse in each iteration) sample
experiences collected in any manner. By separating the sample
collection method, the choice of the linear approximation architecture,
and the solution method, LSPI allows for focused attention on the
distinct elements that contribute to practical reinforcement learning.
LSPI is tested on the simple task of balancing an inverted pendulum and
the harder task of balancing and riding a bicycle to a target location.
In both cases, LSPI learns to control the pendulum or the bicycle by
merely observing a relatively small number of trials where actions are
selected randomly. LSPI is also compared against Q-learning (both with
and without experience replay) using the same value function
architecture. While LSPI achieves good performance fairly consistently
on the difficult bicycle task, Q-learning variants were rarely able to
balance for more than a small fraction of the time needed to reach the
target location.
Tags: Policy Evaluation, Single
Trajectory, Online, Generalized Linear, Regret Bounds, No Experiment
Reading Guide:
[de Farias04]
D. P. de Farias and B. Van
Roy, "On Constraint Sampling in the Linear Programming Approach to
Approximate Dynamic Programming," Mathematics of Operations Research,
Vol. 29, No. 3, August 2004, pp. 462-478. (pdf)
Abstract: In the linear programming approach to
approximate dynamic programming, one tries to solve a certain linear program—the ALP—that has a relatively
small number K of variables but an intractable number M of constraints.
In this paper, we study a
scheme that samples and imposes a subset of m << M constraints. A natural question that arises in this context is: How must
m scale with respect to K and M in order to ensure that the resulting approximation is almost as good as one
given by exact solution of the ALP? We show that, given an idealized sampling distribution and
appropriate constraints on the K variables, m can be chosen
independently of M
and need grow only as a polynomial
in K. We interpret this result in a context involving controlled
queueing networks.
Tags:
Reading Guide:
[Szepesvári04] Cs. Szepesvári and
William D. Smart, "Interpolation-based
Q-learning," ICML 2004. (pdf)
Abstract: We consider a variant of Q-learning in continuous state
spaces under the total expected discounted cost criterion combined with
local function approximation methods. Provided that the function
approximator satisfies certain interpolation properties, the resulting
algorithm is shown to converge with probability one. The limit function
is shown to satisfy a fixed point equation of the Bellman type, where
the fixed point operator depends on the stationary distribution of the
exploration policy and approximation properties of the function
approximation method. The basic algorithm is extended in several ways.
In particular, a variant of the algorithm is obtained that is shown to
converge in probability to the optimal Q function. Preliminary computer
simulations confirm the validity of the approach.
Tags:
Value-based Method,
Theory,
Reading Guide:
[Engel05]
Y. Engel, S. Mannor
and R. Meir, "Reinforcement Learning with Gaussian Processes," ICML
2005. (pdf)
Abstract: Gaussian Process Temporal Difference (GPTD)
learning
offers a Bayesian solution to the policy evaluation problem of
reinforcement learning. In this paper we extend the GPTD framework by
addressing two pressing issues, which were not adequately treated in
the original GPTD paper (Engel et al., 2003). The first is the issue of
stochasticity in the state transitions, and the second is concerned
with action selection and policy improvement. We present a new
generative model for the value function, deduced from its relation with
the discounted return. We derive a corresponding on-line algorithm for
learning the posterior moments of the
value Gaussian process. We also present a SARSA based extension of
GPTD, termed GPSARSA, that allows the selection of actions and the
gradual improvement of
policies without requiring a world-model.
Tags:
Reading Guide:
[Szepesvári05] Cs. Szepesvári
and R. Munos,
"Finite Time Bounds for Sampling Based Fitted Value Iteration,"
ICML'2005, pp. 881—886, 2005. (pdf) (extended version submitted to
JMLR (pdf) )
Abstract: In this paper we consider
sampling based fitted value
iteration for discounted, large (possibly infinite) state space, finite
action Markovian Decision Problems where only a generative model of the
transition probabilities and rewards is available. At each step the
image of the current estimate of the optimal value function under a
Monte-Carlo approximation to the Bellman-operator is projected onto
some function space. PAC-style bounds on the weighted $L^p$-norm
approximation error are obtained as a function of the covering number
and the approximation power of the function space, the iteration number
and the sample size.
Tags:
Reading Guide:
[Antos06]
A. Antos, Cs.
Szepesvári and R.
Munos, "Learning near-optimal policies with Bellman-residual
minimization based fitted policy iteration and a single sample
path," the Proceedings of The Nineteenth Annual Conference on
Learning Theory, COLT 2006, Springer-Verlag, Berlin, Heidelberg,
LNCS/LNAI 4005, pp. 574-588, 2006. (conference version (pdf), extended version submitted to
MLJ (pdf))
Abstract: We consider the
problem of finding a
near-optimal
policy in
continuous space, discounted Markovian Decision Problems given the
trajectory of some behaviour policy. We study the policy iteration
algorithm where in successive iterations the action-value functions of
the intermediate policies are obtained by picking a function from some
fixed function set (chosen by the user) that minimizes an unbiased
finite-sample approximation to a novel loss function that upper-bounds
the unmodified Bellman-residual criterion. The main result is a
finite-sample, high-probability bound on the performance of the
resulting policy that depends on the mixing rate of the tra jectory,
the capacity of the function set as measured by a novel capacity
concept that we call the VC-crossing dimension, the approximation power
of the function set and the discounted-average concentrability of the
future state distribution. To the best of our knowledge this is the
first theoretical reinforcement learning result for off-policy control
learning over continuous state-spaces using a single trajectory.
Tags: Least-Angle
Regression Stepwise
(LARS),
Reading Guide:
[VanRoy06]
B. Van Roy, "Performance
Loss Bounds
for Approximate Value Iteration with State Aggregation," Mathematics
of Operations Research, Vol. 31, No. 2, pp. 234-244, 2006. (pdf)
Abstract:
We consider
approximate value iteration with a parameterized approximator in which
the state space is partitioned and the optimal cost-to-go function over
each partition is approximated by a constant. We establish performance
loss bounds for policies derived from approximations associated with
fixed points. These bounds identify benefits to using invariant
distributions of appropriate policies as projection weights. Such
projection weighting relates to what is done by temporaldifference
learning. Our analysis also leads to the first performance loss bound
for approximate value iteration with an average-cost objective.
Tags: Approximate Value
Iteration, Asymptotic convergence result
Reading Guide:
[de Farias06]
D. P. de Farias and B. Van
Roy, "A Cost-Shaping Linear Program for Average-Cost Approximate
Dynamic Programming with Performance Guarantees," Mathematics of
Operations Research, Vol. 31, No. 3, pp. 597-620, 2006. (pdf)
Abstract: We introduce a new
algorithm based on linear programming for optimization of average-cost
Markov decision processes (MDPs).
The algorithm approximates the differential cost function of a
perturbed MDP via a linear combination of basis functions. We establish a bound on the
performance of the resulting policy that scales gracefully with the
number of states without
imposing the strong Lyapunov condition required by its counterpart in
de Farias and Van Roy [de Farias, D. P., B. Van Roy. 2003. The linear programming
approach to approximate dynamic programming. Oper. Res. 51(6) 850–865].
We investigate implications
of this result in the context of a queueing control problem.
Tags: Value-based, Linear
Programming
Reading Guide:
[Loth06]
M. Loth, M. Davy, R.
Coulom and P.
Preux, "Equi-gradient Temporal Difference Learning", ICML Workshop on
Kernel machines and Reinforcement Learning, Pittsburgh, USA, June 2006.
(pdf)
Abstract:
Tags:
Reading Guide:
Tags | Options |
Notes | ||
Goal | Policy Evaluation | Control
learning (off-line) |
Control
learning (on-line) |
|
Model Assumption | Explicit Model | Generative Model | Single Trajectory | |
Online vs Offline | Online | Offline | ||
On/Off Policy | On-Policy | Off-Policy | ||
Class of Functions | Look-up table | Generalized Linear | Nonlinear (general) | |
Theory | None | Asymptotic | Convergence Rate | |
Experiment | None | Toy Problem | Real-world (Simulation OR really real-world!) |
I propose that on our next meeting you should (quickly) finish the discussion of the paper by Pinkus and move on to the discussion of the paper by Kurkova and Sanguineti.
This paper compares linear and non-linear approximation (of which neural nets is a particular example) and has interesting lessons.
I suggest that you concentrate on Sections 1-2, Section 3.C, Section 5.C. I know that this is a difficult to read paper and we should not get lost in the details but concentrate on the main ideas (i.e., the results and the formulation of the results).
Logistics: We would need someone to volunteer to chair the meeting.
Therefore I ask everyone who feels like that she or he would be able to do that to drop me an e-mail.
Again, let me emphasize that the goal is to understand the results (look for the theorems!); the proofs at this moment are not that interesting for us. Therefore I suggest that after reading the first 3 sections, you look for the theorems and work your way back to the definitions that are necessary to understand the theorem.
- Csaba