Reinforcement Learning
NIPS Workshop: The First Annual Reinforcement
Learning Competition

This page is the official page of the above-named workshop, to
be held
December 9, 2006 at Whistler BC, Canada in conjunction with the 2006
Neural Information Processing Systems Conference.
Edited by Adam White
Workshop News:
- Competition has officially begun: 26th October 2006
- Competition has officially ended: 7th December 2006
Offical Results:
The Winners are:
Cat and Mouse
University of Texas, Austin
University of Texas, Austin
University of Texas, Austin
University of Osnabrueck, Germany
Non-stationary Mountain Car
University of Newcastle, Australia
Puddle World
Rutgers University, USA
The Pentathlon
Rutgers University, USA
Workshop presentation of results can be found here.
Quick Links
Competitions can greatly increase the interest and focus in an area by
clarifying its objectives and challenges, publicly acknowledging the
best algorithms, and generally making the area more exciting and
enjoyable. The First Annual Reinforcement Learning competition is
ongoing and will reach culmination six days before this workshop
(December 3rd). The purpose of this one day workshop is to 1) report
the results from the competition, 2) summarize the experiences of and
solution approaches employed by the competition participants, and 3)
plan for future competitions.
There have been two previous events within the machine learning
community that have involved comparing different reinforcement learning
methods, one at a NIPS 2005 workshop and one at a ICML 2006 workshop.
Last year's event at NIPS had over 30 submissions, from 9 countries.
This competition differs from previous events in that it will be a
competitive event. The winners will be invited to describe their
approach to each problem, at the workshop, improving the attendees’
expertise on applying and implementing reinforcement learning systems.
This year's competition will feature seven events, three discrete
observation problems and three continuous observation problems. The
final event will be a Pentathlon: each participant's agent will be
evaluated on five continuous observation problems consecutively. The
agents that perform best across all five problems will be awarded small
Previous, related workshops:
The competition software will be based on RL-Glue and the environments
be drawn from the Reinforcement Learning Library. Please see
the documentation on the RL-Glue website and example code in RL-Library
ensure that your agent code is compatible with RL-Glue.
The competition events are divided into two classes, discrete and
continuous. Participants can submit agents
that are specifically designed for each event. The discrete events are:
Cat and Mouse
The continuous events are:
Non-stationary Mountain Car
Puddle World
In the pentathlon your agent will be tested on 5 problems
consecutively. For local testing, only 2 of the 5 problems will be
available; your agent will be tested on 3 unknown environments during
competition evaluation. The first two problems used in the pentathlon
will be Mountain Car and Acrobot, described below.
The other 3 will be continuous observation (array of doubles) and
discrete action (integer).
The scoring of the pentathlon will be done as follows:
Let n be
the number of teams competing in the pentathlon (data-files on the
server). For each of the five problems, the teams will be ranked first
to last according to highest cumulative reward. Each team will be
assigned points based on their rank on each problem. The team with the
highest point total will be declared the winner. For example, if we had
5 teams:
Mountain Car
1st (5 points)
3rd (3 points) |
3rd (3 points) |
1st (5 points) |
3rd (3 points) |
4th (2 points)
2nd (4 points) |
4th (2 points) |
2nd (4 points) |
4th (2 points) |
2nd (4 points)
5th (1 point) |
1st (5 points) |
3rd (3 points) |
5th (1 point) |
5th (1 point)
4th (2 points) |
5th (1 point) |
4th (2 points) |
2nd (4 points) |
3rd (3 points)
1st (5 points) |
2nd (4 points) |
5th (1 points) |
1st (5 points) |
19 points
14 points
14 points
10 points
18 points
Team 1 finishes 1st, team 5 finishes second and team 2 and 3 tie for
third. This evaluation
scheme is extended to n teams, by
assigning n points to first place on each problem, n-1 points to
second, etc,
General Setting
Competition run
- one connection to the server.
Run - each
competition run consists of 30 experiments of an agent on an
environment. Each run consists of Emax
episodes (set below for each problem). Results are
averaged across the 30 runs.
Each of the Emax episodes,
within a run, will begin at random
starting states. An episode ends if either a terminal state is reached
or Smax steps are taken (with
no punishment assigned to this
threshold). Smax is set for
each problem below. Agents cannot terminate
episodes prematurely. We have
chosen not to explicitly separate training and testing phases. Each
solver should strive to maximize the rewards it accumulates over the
10000 episodes.
The environments running on the competition server will be different
from the local test copies distributed with the competition software.
For instance, in the cat and mouse event, the policy of the cat may
change or the size of the maze may change. Participants are encouraged
to test their agents on a variety of problem configurations before
benchmarking on the server. The
details of what will be changed and how will not be specified.
The server environments will have stationary dynamics unless otherwise
stated in the problem description.
Agent can try to transfer knowledge gained during one run to the
next, but the starting states and problem
parameters may be varied from run to run.
The beginning of a run is signaled to the agent through a call to the
agent_init function. This function should re-initialize the agent. The
agent is also passed the Task_spec string as a parameter to agent_init.
This encodes information about the environment the agent is being
tested on. Finally, the agent_cleanup function is called to signal to
the agent that the current run has finished. For more information
please see the For more information please see the RL-Glue Documentation.
The evaluation methodology, described above, roughly corresponds to the
following pseudo code:
for run = 1: NUM_RUNS
RL_init() --> small random variations to
environment and generates start set
for episode = 1: Emax
RL_episode(Smax) -->
runs a single episode
with cutoff of Smax steps
end for
end for
How to
- Email
to obtain a user name and password
for your team
- Download the competition
- Implement your agents according to the RL-Glue standard (sample
agents in competition software)
- Once you are ready to make a competition run, connect
to the server
- Repeat for each task you wish to solve
Note: You do not need a valid
user name and password to do local testing with the competition
- Each team is allowed only one account. The organizers reserve the
right to restrict access to any individual or team.
- Each team is allowed ONLY
10 competition runs for each
environment on the server.
- On December 3rd the server will be deactivated and (1) in the 6
individual events, the BEST (max
cumulative reward) data-file for each team-environment pair will
be entered in the
competition and (2) for the pentathlon,
the MOST RECENT data-file for
each team will be entered in the
- Users will not be allowed to access data-files on the server.
However, competition runs will output descriptive stats to the terminal
Since users are only allowed 10 connections per environment,
participants are encouraged to test their agents locally (See instructions in
software download). Only connect to
server when you are satisfied with the performance of your agent for a
variety of parameter settings for a given environment-benchmark pair.
back to top
[tar.gzip file, documentation within.]
Release Date: 22nd November 2006
Description: Competition
software with small bug fixes -- Highly recommended
- Acrobot task spec was not allocating enough memory. This
caused memory faults on some platforms.
Interact with the Server
Here we describe how to connect a C agent and a Java agent to the
competition environments on the server.
Evaluating an agent "remotely" on the server is basically the same as
"locally", except we don't have to worry about the server. To evaluate
an agent "remotely" you must follow the following steps:
1. Edit RLcommon.h to match the environment types
2. Compile your agent with the correct client
3. Run the client (Agent)
For example, to evaluate the ZeroAgent.cpp on the Cat and Mouse problem:
Open a terminal. Goto Client Directory!
Edit RLcommon.h:
typedef int* Observation;
typedef int Action;
Edit makeAgentC:
AGENTV = ZeroAgent.cpp
>> make -f makeAgentC clean
>> make -f makeAgentC
./RL_client 3490 (3490 is the port
number for the cat and mouse environment)
To evaluate the ZeroAgent.java on the Cart-pole problem:
Open a terminal. Goto Client Directory!
Edit RLcommon.h:
typedef double* Observation;
typedef int Action;
Edit makeAgentJava:
AGENTV = ZeroAgent.java
>> make -f makeAgentJava clean
>> make -f makeAgentJava
java ClientAgent ZeroAgent
3495 (3495 is the port number for the cart-pole environment)
Your agent class name (ZeroAgent in this example) must be passed as the
first parameter to ClientAgent.
For example, to evaluate the ZeroAgent.cpp on the Pentathlon event:
Open a terminal. Goto Client Directory!
Edit RLcommon.h:
typedef double* Observation;
typedef int Action;
Edit makeAgentC:
AGENTV = ZeroAgent.cpp
>> make -f makeAgentC clean
>> make -f makeAgentC
./RL_client 3946 (3946 is the port
number for the Pentathlon)
For example, to evaluate the ZeroAgent.java on the Pentathlon event:
Open a terminal. Goto Client Directory!
Edit RLcommon.h:
typedef double* Observation;
typedef int Action;
Edit makeAgentJava:
AGENTV = ZeroAgent.java
>> make -f makeAgentJava clean
>> make -f makeAgentJava
.java ClientAgent ZeroAgent 3946 (3946
is the port number for the Pentathlon)
Each port corresponds to a different environment as follows:
Cat and mouse
Tetris |
Garnet |
mountain car |
Puddle world |
Competition Problems
General Cat
and Mouse
(n x n grid):
There are several solid obstacles, several stationary pieces of cheese,
one mouse hole and a non-stationary cat. The cat moves (after the mouse
moves) to minimize its distance to the mouse, choosing randomly between
equal quality moves. The cat cannot see or move to the mouse if it is
hiding in its hole. Each map is randomly generated (on init) at the
beginning of each trial, but does not change between consecutive
episodes. The agent gets a small positive reward for each step that the
mouse occupies the same grid space as a piece of cheese and a large
negative reward if the cat and mouse occupy the same grid space. An
episode ends when the cat catches the mouse. The agent's objective is
to navigate the mouse through the maze, collecting as much cheese
as possible while avoiding the cat.
This is a generalization of Robert
Schapire's formulation.
Episodic task
Action Space:
mouse movement
∈ [0,7]
|a| = 8
Observation Space:
position and cat's position and if mouse is in hole
[mp, cp, flag] where each
{mp,cp}∈ [0, n*n -1 and
flag ∈ [0,1]
if [{mp} == {cheese position}] then
if [{cp} == {mp}] then -100
0 otherwise
Performance measure:
cumulative reward
Rmax = +5
Rmin = -100
Emax = 10,000
Smax = 300
Episodic task
Action Space:
move piece left, right, rotate
clockwise, rotate counter clockwise , null action
∈ [0,4]
= 5
Observation Space:
full game board (10 x 20)
binary vector (1 x 200)
|observation| = 200
+1 for each line eliminated per time
-100 for death
0 otherwise
Performance measure:
cumulative reward
Rmax = 2
Rmin = -100
Emax = 10,000
Smax = 500
The Garnet environment is a randomly generated MDP (random states,
actions and rewards) with probabilistic state transitions and a non
stationary element: every k iterations the transitions are changed by
randomly deleting n state connections and creating n new links between
previously unconnected states. The agent is provided with a random
binary observation vector that is not large enough to uniquely identify
states. The agent's objective in this continuing task is to locate and
try to stay in regions of the MDP that result in the highest average
Non-stationary, random MDP
Continuing task
Action Space:
discrete actions, stochastic transitions
a ∈ [0, n-1]
n is a problem parameter
Observation Space:
vector of active binary
feature indices
eg: if state = 5
random feature vector:
normal distribution with mean 0 and
variance 1
Performance measure:
average reward
Rmax = inf
Rmin = -inf
Smax = 10,000
Emax - none, continuing task
non-stationary Mountain
Force of gravity changes every k steps
Episodic task
Action Space:
full reverse, neutral, full forward
a ∈ [0,2]
= 3
Observation Space:
position and velocity
[x, x_dot]
x ∈ [-1.2, 0.6], x_dot ∈ [-0.07, 0.07]
-100 if velocity > 0.0005 &
position > 0.5 : termination occurs
0 if velocity <= 0.0005 & position > 0.5 : termination occurs
-1 otherwise
Performance measure:
cumulative reward
Rmax = 0
Rmin = -100
Emax = 500
Smax = 500
Episodic task
Action Space:
discretized forces applied to cart
a ∈ [-10,10]
|a| = 21
Observation Space:
pole angle (from vertical), pole
angular velocity, cart position (from center), cart velocity
θ ∈ [-π/6, π/6]
θ_dot ∈ [-5, 5]
x ∈ [-2.4, 2.4]
x_dot ∈ [-10, 10]
if |pole_angle|<=
PI/60 and |pos| <= 0.05 the reward = +2 {NEW}
if |pole angle| >= PI/6 or |pos|
>= 2.4 then reward = 0: termination occurs
+1 otherwise
Performance measure:
cumulative reward
Rmax = 2
Rmin = 0
Emax = 500
Smax = 500
Puddle World:
Episodic task
Action Space:
move agent north east south west
-- stochastic
∈ [0,3]
= 4
Observation Space:
agent grid position
{x, y}∈
[0, 1]
Reward: -1 per step
plus -400*<distance inside the
puddle ∈ [0,0.1]> if agent in puddle
Performance measure:
cumulative reward
Rmax = -1
Rmin = -401
Emax = 500
Smax = 500
Pentathlon Event:
Delayed Mountain
Observation is delayed by k steps. That is, on step k the observation, given to the
agent, is equal to the state of the environment from k-1 steps in the past. First k observations are random and agent
does not receive the last k
observations before termination.
Episodic task
Action Space:
full reverse, neutral, full forward
a ∈ [0,2]
= 3
Observation Space:
position and velocity
[x, x_dot]
x ∈ [-1.2, 0.6], x_dot ∈ [-0.07, 0.07]
-1 per time step
Performance measure:
cumulative reward
Emax = 500
Smax = 500
Episodic task
Action Space:
full reverse, neutral, full forward
a ∈ [0,2]
= 3
Observation Space:
angle and angular velocity of two joints
[θ1, θ2, θ1_dot, θ2_dot ]
{θ1, θ2} not restricted
θ1_dot ∈ [4π, -4π]
θ2_dot ∈ [9π, -9π]
-1 per time step
Performance measure:
cumulative reward
Task spec gives suggested operational ranges for θ1,
θ2, but values are not bounded in code. θ1_dot and θ2_dot are bounded
as above.
Emax = 500
Smax = 500
3 Unknown Environments ...
Rmax = 5
Rmin = -500
Emax = 500
Smax = 500
Participants will be allowed
to submit to any number of events and solve any number of problems
within the Discrete and Continuous events.
Prizes will be awarded to participants as follows:
Best Cat and Mouse Agent - Certificate
Best Tetris Agent - Certificate
Best GARNET Agent - Certificate
Best Non-stationary Mountain Car Agent - Certificate
Best Puddle World Agent - Certificate
Best Cart-pole Agent - Certificate
first place -- iPod Nano + Certificate
second place -- Certificate
third place -- Certificate
Adam White, University of Alberta, Alberta, Canada (chair)
Richard S. Sutton, University of Alberta, Alberta, Canada
Michael L. Littman, Rutgers University, New Jersey, USA
Doina Precup, Mcgill University , Montreal, Canada
Peter Stone, University of Texas, Austin, Texas, USA
Technical Organization Committee
Andrew Butcher, University of Alberta, Alberta, Canada