Reinforcement Learning and Artificial Intelligence (RLAI) Reinforcement Learning and Function Approximation
Edited by Amir massoud Farahmand and Csaba Szepesvari

Jump to:The ambition of this web page is to serve as a well-deserved home for the Reinforcement Learning and Function Approximation reading group.

# Introduction

Although current Reinforcement Learning (RL) methods have shown some good successes in some application domains, we still need to address several important theoretical issues before having a fully competent and versatile learning method. One important challenge is dealing with problems with large or continuous state/action spaces. In these problems, the look-up table approach is not feasible anymore since there is no way to enumerate all values (there are an infinite number of then). Function approximation (FA) techniques should come at rescue, but our understanding of the interplay between function approximation and RL methods is limited in a number of important ways. In the early stages of RL research, there were many examples and counter-examples that show the feasibility or the infeasibility of using FA in RL. Gradually, the theory started to develop, and the conditions and situations that FA can be used in a RL setting were revealed. However, this problem has not been solved completely and researchers still need to work on it, to create methods that work more efficiently for a larger class of problems..

In the Learning and Function Approximation (RLFA) reading group we will discuss current challenges and approaches, taking a critical standpoint. We discuss theoretical papers alongside examining successful applications. Questions to investigate: Do we need nonlinear architectures, or non-linear learning methods? Do we need real-valued features? Do we need to construct the function approximation method based on data (extract basis functions from data?). What can be learned by looking at the regression, function approximation, control, operations research, financial mathematics etc. literature? What are the successes there, relevant to our goals? Can we recycle successful methods and ideas (lasso,  sparsity, automatic feature relevance determination)? Should we change our modelling assumptions or is it OK to work with the good old MDP models?

You are invited (and strongly encouraged) to suggest new papers and complete missing parts, by simply editing this page (you may like to take a look at How To ... guide before doing so). Interested folks are invited to subscribe to the page to get the news about the meetings, discussions, etc.

# Schedule

Where? CSC-249
When? Wed 3:00-4:00 PM.
Exceptions posted separately

# Paper Bank

The paper bank now lists the relevant papers (with abstracts), but the goal is to put them into context by adding a short summary (use your own words). Tags should give viewers a good idea about the essential characteristics of the paper. The reading guide should help the reader to concentrate on the most interesting aspects of the paper.

## Approximation Theory and Regression

Allan Pinkus: Negative Theorems in Approximation Theory, American Math. Monthly 110 (2003), pp. 900-911.

Ronald DeVore: Nonlinear approximation, Acta Numerica (1998) 7 pp. 51-150.

Věra Kůrková and Marcello Sanguineti: Comparison of worst case errors in linear and neural network approximation, IEEE Trans. on Information Theory, 48 (2002), pp. 264-275. (if you cannot download the paper, send Csaba an e-mail)

Ronald DeVore, Gerard Kerkyacharian, Dominique Picard and Vladimir Temlyakov: On Mathematical Methods for Supervised Learning, J. FoCM 6 (2006), pp. 3-58.

Emmanual J. Candès. Modern statistical estimation via oracle inequalities. Acta Numerica, (2006) 15 pp. 257-325.

John Lafferty and Larry Wasserman: Challenges in Statistical Machine Learning, Statistica Sinica (2006) 16, pp. 307-323.

Holger Wendland, Robert Schaback: Kernel techniques: From machine learning to meshless methods, Acta Numerica 15 (2006), pp. 543-639.

László Györfi, Michael Kohler, Adam Krzyzak, Harro Walk: "A Distribution-Free Theory of Nonparametric Regression", Springer-Verlag, New York (2002). Csaba has this book, if you are interested. This is "the book" on non-parametric regression.

Some more people working in regression or something close related (no particular order): Alexandre Tsybakov, Luc Devroye, Iain Johnstone, David L. Donoho, ... (feel free to add more!)

## Value-based Approaches

[Baird95] L. C. Baird, "Residual Algorithms: Reinforcement Learning with Function Approximation," Machine Learning: Proceedings of the Twelfth International Conference, 9-12 July, Morgan Kaufman Publishers, San Francisco, CA, 1995. (pdf)
Abstract:
A number of reinforcement learning algorithms have been developed that are guaranteed to converge to the
optimal solution when used with lookup tables.  It is shown, however, that these algorithms can easily become unstable when implemented directly with a general function-approximation system, such as a sigmoidal multilayer perceptron, a radial-basis-function system, a memory-based learning system, or even a linear function-approximation system.  A new class of algorithms, residual gradient algorithms, is proposed, which perform gradient descent on the mean squared Bellman residual, guaranteeing convergence. It is shown, however, that they may learn very slowly in some cases.  A larger class of algorithms, residual algorithms, is proposed that has the guaranteed convergence of the residual gradient algorithms, yet can retain the fast learning speed of direct algorithms. In fact, both direct and residual gradient algorithms are shown to be special cases of residual algorithms, and it is shown that residual algorithms can combine the advantages of each approach.  The direct, residual gradient, and residual forms of value iteration, Q-learning, and advantage learning are all presented.
Theoretical analysis is given explaining the properties these algorithms have, and simulation results are given that demonstrate these properties.
Tags:

[Boyan95] J. A. Boyan and A. W. Moore, "Generalization in Reinforcement Learning: Safely Approximating the Value Function," In G. Tesauro,  D. S. Touretzky, and T. K. Leen (eds.), Advances in Neural Information Processing Systems 7 (NIPS). MIT Press, 1995. (pdf)
Abstract:
Tags:
Experimental

[Sutton96] R. S. Sutton, "Generalization in Reinforcement Learning: Successful Examples using Sparse Coarse Coding," Advances in Neural Information Processing Systems 8 (Proceedings of the 1995 conference), pp. 1038-1044, MIT Press, 1996. (pdf)
Abstract:
On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases, there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparse-coarse-coded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they used learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes (rollouts), as in classical Monte Carlo methods, and as in the TD(lambda) algorithm when lambda=1. However, in our experiments this always resulted in substantially poorer performance. We conclude that reinforcement learning can work robustly in conjunction with function approximators, and that there is little justification at present for avoiding the case of general lambda.
Tags:

[Schapire96] R. E. Shapire and M. K. Warmuth, "On the Worst-case Analysis of Temporal Difference Learning Algorithms," Machine Learning Journal, Vol. 22, 1996. (pdf) Abstract: We study the behavior of a family of learning algorithms based on Suttonâ€™s method of temporal differences. In our on-line learning framework, learning takes place in a sequence of trials, and the goal of the learning algorithm is to estimate a discounted sum of all the reinforcements that will be received in the future. In this setting, we are able to prove general upper bounds on the performance of a slightly modified version of Sutton's so-called TD(\lambda) algorithm. These bounds are stated in terms of the performance of the best linear predictor on the given training sequence, and are provedwithout making any statistical assumptions of any kind about the process producing the learnerâ€™s observed training sequence. We also prove lower bounds on the performance of any algorithm for this learning problem, and give a similar analysis of the closely related problem of learning to predict in a model in which the learner must produce predictions for a whole batch of observations before receiving reinforcement.
Tags: Policy Evaluation, Single Trajectory, Online, Generalized Linear, Regret Bounds, No Experiment

[Tsitskiklis97] J. N. Tsitsiklis and B. Van Roy, "An Analysis of Temporal-Difference Learning with Function Approximation,'' IEEE Transactions on Automatic Control, Vol. 42, No. 5, May 1997, pp. 674-690. (pdf) (pdf as published)
Abstract:
Tags:
Reading Guide: The most important paper in RL with FA, in rich's opinion.

[de Farias00] D. P. de Farias and B. Van Roy, "On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning,'' Journal of Optimization Theory and Applications, Vol. 105, No. 3, June, 2000. (pdf)
Abstract:
Tags:

[Grudic01] G. Z. Grudic and L. H. Ungar, "Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning," In Advances in Neural Information Processing Systems, 2001. (pdf)
Abstract:
We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, Q. Theory is presented showing that linear function approximation representations of Q can degrade the rate of convergence of performance gradient estimates by a factor of O(ML) relative to when no function approximation of Q is used, where M is the number of possible actions and L is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by O(1-(1/M)) , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms.

[Precup01] D. Precup, R. S. Sutton, S. Dasgupta, "Off-policy Temporal-Difference Learning with Function Approximation," Proceedings of the 18th International Conference on Machine Learning, 2001. (pdf)
Abstract:
We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(lambda) over state-action pairs with importance sampling ideas from our previous work. We prove that, given training under any epsilon-soft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the action-value function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem, showing reduced variance compared to the most obvious importance sampling algorithm for this problem. Our current results are limited to episodic tasks with episodes of bounded length.
Tags:

[Tadic01] V. Tadic, "On the Convergence of Temporal-Difference Learning with Linear Function Approximator," Machine Learning, Vol. 42, Issue 3, 2001 (acm portal)
Abstract:

Tags:

[Schoknecht02a] R. Schoknecht, "Optimality of Reinforcement Learning Algorithms with Linear Function Approximation," In Advances in Neural Information Processing Systems, volume 15, 2002. (pdf)
Abstract:
Tags:

[Schoknecht02b] R. Schoknecht and A. Merke, "Convergent combinations of reinforcement learning with function approximation," In Advances in Neural Information Processing Systems, volume 15, 2002. (pdf)
Tags:

[Boyan02] J. A. Boyan, "Technical Update: Least-Squares Temporal Difference Learning," Machine Learning 49:2-3, 2002. (pdf)
Abstract:
Tags:

[de Farias03] D. P. de Farias and B. Van Roy, "The Linear Programming Approach to Approximate Dynamic Programming," Operations Research, Vol. 51, No. 6, November-December 2003, pp. 850-865. (pdf)
Abstract:
Tags:

[Nedic03] A. Nedic and D. P. Bertsekas, "Least-Squares Policy Evaluation Algorithms with Linear Function Approximation," J. of Discrete Event Systems, Vol. 13, 2003, pp. 79-110. (pdf)
Abstract: We consider policy evaluation algorithms within the context of infinite-horizon dynamic programming problems with discounted cost. We focus on discrete-time dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function approximation. The first method is a new gradient-like algorithm involving least-squares subproblems and a diminishing stepsize, which is based on the λ-policy iteration method of Bertsekas and Ioffe. The second method is the LSTD(λ) algorithm recently proposed by Boyan, which for λ = 0 coincides with the linear least-squares temporal-difference algorithm of Bradtke and Barto. At present, there is only a convergence result by Bradtke and Barto for the LSTD(0) algorithm. Here, we strengthen this result by showing the convergence of LSTD(λ), with probability 1, for every λ ∈ [0, 1].
Tags:

[Lagoudakis03] Michail G. Lagoudakis and Ronald Parr, "Least-Squares Policy Iteration," Journal of Machine Learning Research, 4, 2003, pp. 1107-1149. (pdf)
Abstract: We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. This new approach is motivated by the least-squares temporal-difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference algorithms. Heretofore, LSTD has not had a straightforward application to control problems mainly because LSTD learns the state value function of a fixed policy which cannot be used for action selection and control without a model of the underlying process. Our new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration framework. LSPI is a model-free, off-policy method which can use efficiently (and reuse in each iteration) sample experiences collected in any manner. By separating the sample collection method, the choice of the linear approximation architecture, and the solution method, LSPI allows for focused attention on the distinct elements that contribute to practical reinforcement learning. LSPI is tested on the simple task of balancing an inverted pendulum and the harder task of balancing and riding a bicycle to a target location. In both cases, LSPI learns to control the pendulum or the bicycle by merely observing a relatively small number of trials where actions are selected randomly. LSPI is also compared against Q-learning (both with and without experience replay) using the same value function architecture. While LSPI achieves good performance fairly consistently on the difficult bicycle task, Q-learning variants were rarely able to balance for more than a small fraction of the time needed to reach the target location.
Tags:

[Parr03] R. Parr and M. Lagoudakis, "Reinforcement Learning as Classification: Leveraging Modern Classifiers," Proceedings of the Twentieth International Conference on Machine Learning (ICML), 2003. (pdf)
Abstract:
Tags:

[Forster03] J. Forster and M. K. Warmuth, "Relative Loss Bounds for Temporal-Difference Learning," Machine Learning, Vol. 51, 2003. (pdf)
Abstract:
We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. This new approach is motivated by the least-squares temporal-difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference algorithms. Heretofore, LSTD has not had a straightforward application to control problems mainly because LSTD learns the state value function of a fixed policy which cannot be used for action selection and control without a model of the underlying process. Our new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration framework. LSPI is a model-free, off-policy method which can use efficiently (and reuse in each iteration) sample experiences collected in any manner. By separating the sample collection method, the choice of the linear approximation architecture, and the solution method, LSPI allows for focused attention on the distinct elements that contribute to practical reinforcement learning. LSPI is tested on the simple task of balancing an inverted pendulum and the harder task of balancing and riding a bicycle to a target location. In both cases, LSPI learns to control the pendulum or the bicycle by merely observing a relatively small number of trials where actions are selected randomly. LSPI is also compared against Q-learning (both with and without experience replay) using the same value function architecture. While LSPI achieves good performance fairly consistently on the difficult bicycle task, Q-learning variants were rarely able to balance for more than a small fraction of the time needed to reach the target location.
Tags: Policy Evaluation, Single Trajectory, Online, Generalized Linear, Regret Bounds, No Experiment

[de Farias04] D. P. de Farias and B. Van Roy, "On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming," Mathematics of Operations Research, Vol. 29, No. 3, August 2004, pp. 462-478. (pdf)
Abstract: In the linear programming approach to approximate dynamic programming, one tries to solve a certain linear program—the ALP—that has a relatively small number K of variables but an intractable number M of constraints. In this paper, we study a scheme that samples and imposes a subset of m << M constraints. A natural question that arises in this context is: How must m scale with respect to K and M in order to ensure that the resulting approximation is almost as good as one given by exact solution of the ALP? We show that, given an idealized sampling distribution and appropriate constraints on the K variables, m can be chosen independently of M
and need grow only as a polynomial in K. We interpret this result in a context involving controlled queueing networks.
Tags:

[Szepesvári04] Cs. Szepesvári and William D. Smart, "Interpolation-based Q-learning," ICML 2004. (pdf)
Abstract: We consider a variant of Q-learning in continuous state spaces under the total expected discounted cost criterion combined with local function approximation methods. Provided that the function approximator satisfies certain interpolation properties, the resulting algorithm is shown to converge with probability one. The limit function is shown to satisfy a fixed point equation of the Bellman type, where the fixed point operator depends on the stationary distribution of the exploration policy and approximation properties of the function approximation method. The basic algorithm is extended in several ways. In particular, a variant of the algorithm is obtained that is shown to converge in probability to the optimal Q function. Preliminary computer simulations confirm the validity of the approach.
Tags: Value-based Method, Theory,

[Engel05] Y. Engel, S. Mannor and R. Meir, "Reinforcement Learning with Gaussian Processes," ICML 2005. (pdf)
Abstract:
Gaussian Process Temporal Difference (GPTD) learning offers a Bayesian solution to the policy evaluation problem of reinforcement learning. In this paper we extend the GPTD framework by addressing two pressing issues, which were not adequately treated in the original GPTD paper (Engel et al., 2003). The first is the issue of stochasticity in the state transitions, and the second is concerned with action selection and policy improvement. We present a new generative model for the value function, deduced from its relation with the discounted return. We derive a corresponding on-line algorithm for learning the posterior moments of the
value Gaussian process. We also present a SARSA based extension of GPTD, termed GPSARSA, that allows the selection of actions and the gradual improvement of
policies without requiring a world-model.
Tags:

[Szepesvári05] Cs. Szepesvári and R. Munos, "Finite Time Bounds for Sampling Based Fitted Value Iteration," ICML'2005, pp. 881—886, 2005. (pdf) (extended version submitted to JMLR (pdf) )
Abstract: In this paper we consider sampling based fitted value iteration for discounted, large (possibly infinite) state space, finite action Markovian Decision Problems where only a generative model of the transition probabilities and rewards is available. At each step the image of the current estimate of the optimal value function under a Monte-Carlo approximation to the Bellman-operator is projected onto some function space. PAC-style bounds on the weighted $L^p$-norm approximation error are obtained as a function of the covering number and the approximation power of the function space, the iteration number and the sample size.
Tags:

[Antos06] A. Antos, Cs. Szepesvári and R. Munos, "Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path," the Proceedings of The Nineteenth Annual Conference on Learning Theory, COLT 2006, Springer-Verlag, Berlin, Heidelberg, LNCS/LNAI 4005, pp. 574-588, 2006. (conference version (pdf), extended version submitted to MLJ (pdf))
Abstract: We consider the problem of finding a near-optimal policy in continuous space, discounted Markovian Decision Problems given the trajectory of some behaviour policy. We study the policy iteration algorithm where in successive iterations the action-value functions of the intermediate policies are obtained by picking a function from some fixed function set (chosen by the user) that minimizes an unbiased finite-sample approximation to a novel loss function that upper-bounds the unmodified Bellman-residual criterion. The main result is a finite-sample, high-probability bound on the performance of the resulting policy that depends on the mixing rate of the tra jectory, the capacity of the function set as measured by a novel capacity concept that we call the VC-crossing dimension, the approximation power of the function set and the discounted-average concentrability of the future state distribution. To the best of our knowledge this is the first theoretical reinforcement learning result for off-policy control learning over continuous state-spaces using a single trajectory.
Tags: Least-Angle Regression Stepwise (LARS),

[VanRoy06] B. Van Roy, "Performance Loss Bounds for Approximate Value Iteration with State Aggregation," Mathematics of Operations Research, Vol. 31, No. 2, pp. 234-244, 2006. (pdf)
Abstract: We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to using invariant distributions of appropriate policies as projection weights. Such projection weighting relates to what is done by temporaldifference learning. Our analysis also leads to the first performance loss bound for approximate value iteration with an average-cost objective.
Tags: Approximate Value Iteration, Asymptotic convergence result

[de Farias06] D. P. de Farias and B. Van Roy, "A Cost-Shaping Linear Program for Average-Cost Approximate Dynamic Programming with Performance Guarantees," Mathematics of Operations Research, Vol. 31, No. 3, pp. 597-620, 2006. (pdf)
Abstract: We introduce a new algorithm based on linear programming for optimization of average-cost Markov decision processes (MDPs). The algorithm approximates the differential cost function of a perturbed MDP via a linear combination of basis functions. We establish a bound on the performance of the resulting policy that scales gracefully with the number of states without imposing the strong Lyapunov condition required by its counterpart in de Farias and Van Roy [de Farias, D. P., B. Van Roy. 2003. The linear programming approach to approximate dynamic programming. Oper. Res. 51(6) 850–865]. We investigate implications of this result in the context of a queueing control problem.
Tags: Value-based, Linear Programming

[Loth06] M. Loth, M. Davy, R. Coulom and P. Preux, "Equi-gradient Temporal Difference Learning", ICML Workshop on Kernel machines and Reinforcement Learning, Pittsburgh, USA, June 2006. (pdf)
Abstract:
Tags:

[Terminator10] T-800 and T-1000, "Ultimate Reinforcement Learning Algorithm: Analysis and Results," Rejected.

Abstract: We find a reliable way to use randomly connected recurrent neural network and prove that the proposed method can learn a policy up to a certain accuracy (\epsilon) after seeing n samples just by observing any sample trajectory while following a random policy. We applied this method to Mission Impossible IV problem, and the result was \$1B gross internationally.
Tags: Policy Improvement, Single Trajectory, Off-Policy, Nonlinear, Convergence Rate, Real-world

## Basis Function Selection and Novel Features in RL

[Ratitch04] B. Ratitch and D. Precup, "Sparse Distributed Memories for On-Line Value-Based Reinforcement Learning," In Proceedings of the ECML-2004. (pdf)
Abstract:
In this paper, we advocate the use of Sparse Distributed Memories (SDMs) for on-line, value-based reinforcement learning (RL). SDMs provide a linear, local function approximation scheme, designed to work when a very large/high-dimensional input (address) space has to be mapped into a much smaller physical memory. We present an implementation of the SDM architecture for on-line, value-based RL in continuous state spaces. An important contribution of this paper is an algorithm for dynamic on-line allocation and adjustment of memory resources for SDMs, which eliminates the need for choosing the memory size and structure a priori. In our experiments, this algorithm provides very good performance while efficiently managing the memory resources.
Tags:

[Bouchard-Cote04] Alexandre Bouchard-Cote, "Sparse Distributed Memories in a Bounded Metric State Space: Some Theoretical and Empirical Results," Technical Report, McGill University, 2004. (pdf)
Abstract:
Sparse Distributed Memories (SDM) [7] is a linear, local function approximation architecture that can be used to represent cost-to-go or state-action value functions of reinforcement learning (RL) problems. It offers a possibility to reconcile the convergence guarantees of linear approximators and the potential to scale to higher dimensionality typically exclusive to nonlinear architectures. We investigated both avenues to see if SDM can fulfill its promises in the context of a RL problem with a bounded metric state space. On the theoretical side, we found that using properties of interpolative approximators provided by [3], it can be proven that a version of Q-learning converges with probability one when it is used with a SDM architecture. On the other hand, an important non-divergence result on the SARSA algorithm, an approximate, optimistic policy iteration algorithm, failed to apply in our case because of our loose assumptions on the nature of the state space. However, one of the most important convergence result in reinforcement learning [6], the convergence of TD(λ) was successfully translated into the language of measure theory to cover the case of Markov processes with a general probability state space. This forms the foundation for an eventual proof of convergence of an approximate policy iteration algorithm. On the empirical side, the first step was to design and implement a specialized data structure to store and retrieve “hard locations” (basis points). The specifications of this hash-based data structure will be discussed, as well as statistics on the performances of SDM equipped with this data structure. We conclude by presenting potential extensions to SDM. Their motivations and drawbacks
will be examined, focusing on the implications related to the hard location storage data structure. These results form our basic toolbox for our current work targeting a general-purpose, “tweaking-free” reinforcement learning algorithm with flexible assumptions on the space, convergence guarantees and high-dimensional scalability.
Tags:

[Menache05] I. Menache, S. Mannor, and N. Shimkin, "Basis Function Adaptation in Temporal Difference Reinforcement Learning," Annals of Operations Research, 134:1 215-238, 2005. (pdf) (Version Discrepancy?!)
Abstract:
Reinforcement Learning (RL) is an approach for solving complex multi-stage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact representation and enables generalization. Linear approximation architectures (where the adjustable parameters are the weights of pre-fixed basis functions) have recently gained prominence due to effi-
cient algorithms and convergence guarantees. Nonetheless, an appropriate choice of basis function is important for the success of the algorithm. In the present paper we examine methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy. Using the Bellman approximation error as an optimization criterion, we optimize the weights of the basis function while simultaneously adapting the (non-linear) basis function parameters.
We present two algorithms for this problem. The first uses a gradient-based approach and the second applies the Cross Entropy method. The performance of the proposed algorithms is evaluated and compared in simulations.
Tags:

[Keller06] P. W. Keller and D. Precup, and S. Mannor, "Automatic Basis Function Construction for Approximate Dynamic Programming and Reinforcement Learning," ICML 2006. (pdf)
Abstract:
We address the problem of automatically constructing basis functions for linear approximation of the value function of a Markov Decision Process (MDP). Our work builds on results by Bertsekas and Castanon (1989) who proposed a method for automatically aggregating states to speed up value iteration. We propose to use neighborhood component analysis (Goldberger et al., 2005), a dimensionality reduction technique created for supervised learning, in order to map a high-dimensional state space to a low-
dimensional space, based on the Bellman error, or on the temporal difference (TD) error. We then place basis function in the lower-dimensional space. These are added as new features for the linear function approximator. This approach is applied to a high-dimensional inventory control problem.
Tags:

[Glaubius?] R. Glaubius and W. D. Smart, "Manifold Representations for Value-Function Approximation in Reinforcement Learning," (?). (pdf)
Abstract:
Tags:

A paper or two by Mahadevan

[Dayan93] P. Dayan: "Improving Generalization by Temporal Difference Learning: The Successor Representation,"Neural Computation, 5:613-624, 1993. (pdf)
Abstract
: Estimation of returns over time, the focus of temporal difference (TD) algorithms, imposes particular constraints on good function approximators or representations.
Appropriate generalisation between states is determined by how similar their successors are, and representations should follow suit.
This paper shows how TD machinerycan be used to learn such representations, and illustrates, using a navigation task, the appropriately distributed nature of the result.

## Related Mathematical Mathematical Tools

[Munos06] Remi Munos, "Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation," Journal of Machine Learning Research, vol. 7, 2006. (pdf)
Abstract: We study a variance reduction technique for Monte Carlo estimation of functionals in Markov chains. The method is based on designing sequential control variates using successive approximations of the function of interest V. Regular Monte Carlo estimates have a variance of O(1/N), where N is the number of sample trajectories of the Markov chain. Here, we obtain a geometric variance reduction O(r^N) (with r < 1) up to a threshold that depends on the approximation error V - AV, where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. AV =V), the variance decreases geometrically to zero.
An immediate application is value function estimation in Markov chains, which may be used for policy evaluation in a policy iteration algorithm for solving Markov Decision Processes.
Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity dV/da of the performance measure V with respect to some parameter a of the transition probabilities. For example, in policy parametric optimization, computing an estimate of the policy gradient is required to perform a gradient optimization method.
We show that, using two approximations for the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of
both of those representations.
Tags:

## Non-parametric Regression

Maybe some chapters from Gyorfi's Distribution-free non-parametric regression.
Introduction (1 or 2?)
Least Squares - Consistency (10?)
Least Squares  + Regularization (20/21?)

B. Efron, T. Hastie, I. Johnstone and R. Tibshirani, "Least Angle Regression," Annals of Statistics (with discussion) 32(2), 2004, pp. 407-499. (pdf)
Abstract:
The purpose of model selection algorithms such as Al l Subsets, Forward Selection, and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient
prediction of a response variable. Least Angle Regression (”LARS”), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived. (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of Ordinary Least Squares that
constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude
less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as Ordinary Least Squares applied to the full set of covariates.
Tags:
Reading Guide: You may like to read this to understand Loth et al. work (Equi-gradient TD learning) better.

## Miscellaneous

[Hutter01] M. Hutter, "Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions," ECML 2001 (pdf).
Abstract:
Decision theory formally solves the problem of rational agents in uncertain worlds if the true environmental probability distribution is known. Solomonoff's theory of universal induction formally solves the problem of sequence prediction for unknown distribution. We unify both theories and give strong arguments that the resulting universal AIXI model behaves optimally in any computable environment. The major drawback of the AIXI model is that it is uncomputable. To overcome this problem, we construct a modified algorithm AIXI^tl, which is still superior to any other time t and space l bounded agent. The computation time of AIXI^tl is of the order t x 2^l.
Tags:

## To Consider

Ryo Goto, Hiroshi Matsuo, "State generalization method with support vector machines in reinforcement learning"
Tao Geng , Bernd Porr and Florentin Worgotter, "Fast biped walking with a reflexive controller and real-time policy searching", NIPS 2005.
H. Yao, D. Dongcui, Z. Sun, "Historical Temporal Difference Learning: Some Initial Results," IMSCCS 2006. (check!)

# Slides

You may find these related slides useful:

Remi Munos and Csaba Szepesvari, "A Statistical Learning Approach to Approximate Dynamic Programming," ICML Workshop on Kernel Machines and Reinforcement Learning, 2006. (Part 1) (Part 2) (both in pdf)

# How To ... ?!

In this section, the way you can fill Tags and Reading Guide fields of a paper is explained. The following guide is just a suggestion and your comments are more than welcome.

## Tags

Tags field summarizes the the paper. It shows that whether the paper is experimental or theoretical, the theoretical result is in the form of an asymptotic result or a finite-sample bound, and also the tags indicate the class of function approximators it is using, and etc.
We are going to use the following system of tagging:

 Tags Options Notes Goal Policy Evaluation Control learning (off-line) Control learning (on-line) Model Assumption Explicit Model Generative Model Single Trajectory Online vs Offline Online Offline On/Off Policy On-Policy Off-Policy Class of Functions Look-up table Generalized Linear Nonlinear (general) Theory None Asymptotic Convergence Rate Experiment None Toy Problem Real-world (Simulation OR really real-world!)

The above categorization does not meant to be complete in any way. You are free to add new (sensible) categories and category-values.

For instance, for a paper ultimate-RL which uses a Neural Network (Nonlinear) and proves that we can learn a policy (Control learning) up to a certain accuracy after seeing n samples (Convergence Rate) just by observing any sample trajectory (Single Trajectory) while following a random policy (Off-Policy), and applies this method to help producing Mission Impossible IV, you can write Tags as

Tags:
Control learning (off-line), Single Trajectory, Off-Policy, Nonlinear, Convergence Rate, Real-world

If you read the paper, you may like to help others by filling the Reading Guide part of each paper. For example, you can pinpoint the important parts/theorems/figures of the paper, tell them where to look, and avoid what. For instance, for the ultimate-RL paper, you may like to state "Read Definition 1 and 3, Read Theorem 2, and check Experiment Impossible". This is practically helpful, because people usually do not have enough time to read the whole paper!

Hi,

I propose that on our next meeting you should (quickly) finish the discussion of the paper by Pinkus and move on to the discussion of the paper by Kurkova and Sanguineti.

This paper compares linear and non-linear approximation (of which neural nets is a particular example) and has interesting lessons.

I suggest that you concentrate on Sections 1-2, Section 3.C, Section 5.C. I know that this is a difficult to read paper and we should not get lost in the details but concentrate on the main ideas (i.e., the results and the formulation of the results).

Logistics: We would need someone to volunteer to chair the meeting.
Therefore I ask everyone who feels like that she or he would be able to do that to drop me an e-mail.

Again, let me emphasize that the goal is to understand the results (look for the theorems!); the proofs at this moment are not that interesting for us. Therefore I suggest that after reading the first 3 sections, you look for the theorems and work your way back to the definitions that are necessary to understand the theorem.

- Csaba

Hi,
Don't forget to come to the reading group meeting today: We will continue our discussion of Pinkus' paper and hopefully move on to the next topic.
See you there!
- Csaba

FYI: Next time we are going to wrap up the Kurkova/Sanguineti paper. That is, we will focus on the main results of the paper, presented in last section of it. Questions to consider: What do we learn from the paper? How does these results compare with our previous knowledge about linear function approximation? Are these results significant?