Home Reinforcement Learning and Artificial Intelligence (RLAI)

NIPS Workshop on
Reinforcement Learning: Benchmarks and Bake-offs
--Rich Sutton, Sep 5 2004
This page is the official page of the above-named workshop, held December 17, 2004.  See also the page for the followon 2005 NIPS workshop.

2004 discussion page on standards for reinforcement learning interfaces

Workshop Description

The workshop on Reinforcement Learning: Benchmarks and Bake-offs
will be held on December 17 as part of the NIPS conference.
The workshop will explore the establishment of a standard set of
benchmarks and a series of competitive events (bake-offs) to enhance
reinforcement-learning research.  The workshop will ideally produce
the following outputs: 1) a proposed specification for implementing
benchmark problems, 2) identification of a list of initial benchmarks,
with assignment of responsibility for their implementation, 3)
policies for extending the benchmark set to address new issues, 4)
specific proposals for a series of competitive events comparing
different reinforcement-learning methods on various kinds of problems,
and 5) the formation of a policy committee to guide the construction
and evolution of the benchmarks and competitions.  For the purposes of
this workshop, reinforcement learning (RL) is meant to include a broad
range of interactive learning problems, including POMDPs, navigation,
control problems, probabilistic planning, and sequential prediction
problems with and without actions.

Rationale and Challenges

It has often been suggested that the field of reinforcement learning
would benefit from the establishment of standard benchmark problems
and regular competitive events.  Isolated efforts have
faltered due to a lack of "buy in" from the community.  Competitions
can greatly increase the interest and focus in an area by clarifying
its objectives and challenges, publicly acknowledging the best
algorithms, and generally making the area more exciting and enjoyable.
Standard benchmarks can make it much easier to apply new algorithms to
existing problems and thus provide clear first steps toward their
evaluation.  Competitions and benchmark problems that can be used in
these ways have have yet to be established.  Some reasons are:

  1. RL problems are interactive and cannot be represented by simple
    data files as are benchmark problems for other kinds of machine
    learning.  If RL benchmarks are implemented as computer programs, then
    difficulties arise regarding their availability and consistency across
    computers and software environments.
  2. RL benchmarks of a sort already exist as descriptions in published
    papers.  It has been thought that as long as the problem descriptions
    were sufficiently clear then they could be easily re-implemented by
    other scientists in the software environment of their choice. 
    Unfortunately, the published descriptions are often incomplete or ignored.
  3. RL is a relatively young field and the kind of problems studied is
    constantly changing.  New algorithm frequently require new problems to
    probe their strengths and weaknesses.  Established benchmarks could
    discourage the exploration of important new problem settings.

The purpose of this workshop is to explore innovative approaches
to overcoming these issues (leading to outputs 1-5 above).  Examples
of successful prior standards efforts include the UCI database,
RoboCup, the International Planning Competitions, the AAAI Robot
Challenge, and the Trading Agents Competition.  We will endeavor to
attract experts in each of these competitions to report on their
successes and failures, strengths and weaknesses, lessons learned,


The workshop will consist of an outline of the goals and scope of the
workshop from the organizers, followed by reports from representatives
from existing competitions and benchmark sets, followed by working
sessions to produce workshop outputs 1-5.  Slightly more than half the
time will be reserved for general discussion and working sessions


Richard S. Sutton, University of Alberta, Alberta, CANADA
Michael L. Littman, Rutgers University, New Jersey, USA

Proposed invited speakers

Peter Stone (RoboCup)
Sven Koenig (IPC)
Michael Littman (IPC, probabilistic track)
Alan Schultz (AAAI Robotics Challenge)
David Aha (UCI database)
Michael Wellman (TAC)


Participants might include:

John Langford
Andrew Ng
Tom Dietterich
Sven Koenig (IPC)
Peter Stone (RoboCup, TAC)
Michael Littman (IPC, TAC, TREC)
Satinder Singh (TAC)
Michael Bowling (RoboCup)
Rich Sutton
Bob Givan (IPC)
Leslie Kaelbling
Sridhar Mahadevan
Andrew Moore
Hajime Kimura
David Aha (UCI database)
Haakan Younes
Alan Schultz
Paul Cohen
Michael Wellman

Workshop length

The workshop will last for one day, but we hope that it will be on
the first day so that more detailed discussion and work can continue
on the day after.


I suggest that the Workshop page have a suggestions section so that those of us who are unlikely to attend can contribute ideas and so that everyone can consider the issues prior to the Workshop.

The potential problem of benchmarks discouraging new problem settings might be partly avoided by having a "language" for specifying problems. The chosen benchmarks at any time would be specific instances of problems from the universe of problems generated by that language. This would allow researchers to generate scaled and related problems for their own exploratory purposes and allow new benchmarks to be created more easily than if they were one-off exercises.

I would like to see families of benchmarks that are related but vary on a number of dimensions (e.g. complexity, noisiness, degree of look-ahead required, etc). A single benchmark does not, of itself, assist with decomposing the causal factors of performance. The concepts from Paul Cohen's "Empirical methods for Artificial Intelligence" would be relevant here.

I am interested in problems that have recursive structure (i.e. where attainment of subgoals is important to performance) so would like the problem generation language to allow for this.

I am interested in problems that have relational structure (e.g. the alignment of objects in a 2-D world indicates the direction to a goal) so would like the problem generation language to allow for this.

Many problems in the literature have a spatial aspect (i.e. 2-D or 3-D worlds for robotics). I would be uncomfortable if the problem generation language was necessarily spatial. I would prefer to see a language that could generate spatial problems as special cases.

Ross Gayler

An appropriate language for specifying RL benchmark problems might be the GGP language (http://games.stanford.edu/language.html). This is intended for use in a game-playing computational challenge, so it may already address many of the issues relevant to benchmarks and bake-offs.

The GGP system allows single-player games (RL equivalent = environment and agent) and multi-player games (RL equivalent = environment and multiple agents). It is not clear from the initial description whether GGP allows for probabilistic games and information that is known to the environment but hidden from the agent. It is also not clear whether the GGP concept of victory can be adequately mapped onto an RL reward.

However, it is clear that the GGP language is currently in a state of flux, so now would be the right time to see if it can be made to suit both purposes.

Ross Gayler (17 September 2004)  
Extend this Page   How to edit   Style   Subscribe   Notify   Suggest   Help   This open web page hosted at the University of Alberta.   Terms of use  6768/0