|
Reinforcement Learning
Competition
NIPS Workshop: The First Annual Reinforcement
Learning Competition
|
|
This page is the official page of the above-named workshop, to
be held
December 9, 2006 at Whistler BC, Canada in conjunction with the 2006
Neural Information Processing Systems Conference.
Edited by Adam White
Workshop News:
- Competition has officially begun: 26th October 2006
- Competition has officially ended: 7th December 2006
Offical Results:
The Winners are:
Cat and Mouse
University of Texas, Austin
Garnet
University of Texas, Austin
Tetris
University of Texas, Austin
Cart-Pole
University of Osnabrueck, Germany
Non-stationary Mountain Car
University of Newcastle, Australia
Puddle World
Rutgers University, USA
The Pentathlon
Winner
Rutgers University, USA
Workshop presentation of results can be found here.
Quick Links
Workshop
Description
Competitions can greatly increase the interest and focus in an area by
clarifying its objectives and challenges, publicly acknowledging the
best algorithms, and generally making the area more exciting and
enjoyable. The First Annual Reinforcement Learning competition is
ongoing and will reach culmination six days before this workshop
(December 3rd). The purpose of this one day workshop is to 1) report
the results from the competition, 2) summarize the experiences of and
solution approaches employed by the competition participants, and 3)
plan for future competitions.
There have been two previous events within the machine learning
community that have involved comparing different reinforcement learning
methods, one at a NIPS 2005 workshop and one at a ICML 2006 workshop.
Last year's event at NIPS had over 30 submissions, from 9 countries.
This competition differs from previous events in that it will be a
competitive event. The winners will be invited to describe their
approach to each problem, at the workshop, improving the attendees’
expertise on applying and implementing reinforcement learning systems.
This year's competition will feature seven events, three discrete
observation problems and three continuous observation problems. The
final event will be a Pentathlon: each participant's agent will be
evaluated on five continuous observation problems consecutively. The
agents that perform best across all five problems will be awarded small
prizes.
Previous, related workshops:
Competition
Logistics
The competition software will be based on RL-Glue and the environments
will
be drawn from the Reinforcement Learning Library. Please see
the documentation on the RL-Glue website and example code in RL-Library
to
ensure that your agent code is compatible with RL-Glue.
The competition events are divided into two classes, discrete and
continuous. Participants can submit agents
that are specifically designed for each event. The discrete events are:
Cat and Mouse
Tetris
Garnet
The continuous events are:
Non-stationary Mountain Car
cart-pole
Puddle World
In the pentathlon your agent will be tested on 5 problems
consecutively. For local testing, only 2 of the 5 problems will be
available; your agent will be tested on 3 unknown environments during
competition evaluation. The first two problems used in the pentathlon
will be Mountain Car and Acrobot, described below.
The other 3 will be continuous observation (array of doubles) and
discrete action (integer).
The scoring of the pentathlon will be done as follows:
Let n be
the number of teams competing in the pentathlon (data-files on the
server). For each of the five problems, the teams will be ranked first
to last according to highest cumulative reward. Each team will be
assigned points based on their rank on each problem. The team with the
highest point total will be declared the winner. For example, if we had
5 teams:
Team
|
Mountain Car
|
Acrobot
|
unknown
|
unknown
|
unknown
|
T1
|
1st (5 points)
|
3rd (3 points) |
3rd (3 points) |
1st (5 points) |
3rd (3 points) |
T2
|
4th (2 points)
|
2nd (4 points) |
4th (2 points) |
2nd (4 points) |
4th (2 points) |
T3
|
2nd (4 points)
|
5th (1 point) |
1st (5 points) |
3rd (3 points) |
5th (1 point) |
T4
|
5th (1 point)
|
4th (2 points) |
5th (1 point) |
4th (2 points) |
2nd (4 points) |
T5
|
3rd (3 points)
|
1st (5 points) |
2nd (4 points) |
5th (1 points) |
1st (5 points) |
Summary:
T1
|
T2
|
T3
|
T4
|
T5
|
19 points
|
14 points
|
14 points
|
10 points
|
18 points
|
Team 1 finishes 1st, team 5 finishes second and team 2 and 3 tie for
third. This evaluation
scheme is extended to n teams, by
assigning n points to first place on each problem, n-1 points to
second, etc,
------------------------
General Setting
------------------------
Definitions:
Competition run
- one connection to the server.
Run - each
competition run consists of 30 experiments of an agent on an
environment. Each run consists of Emax
episodes (set below for each problem). Results are
averaged across the 30 runs.
Each of the Emax episodes,
within a run, will begin at random
starting states. An episode ends if either a terminal state is reached
or Smax steps are taken (with
no punishment assigned to this
threshold). Smax is set for
each problem below. Agents cannot terminate
episodes prematurely. We have
chosen not to explicitly separate training and testing phases. Each
solver should strive to maximize the rewards it accumulates over the
10000 episodes.
The environments running on the competition server will be different
from the local test copies distributed with the competition software.
For instance, in the cat and mouse event, the policy of the cat may
change or the size of the maze may change. Participants are encouraged
to test their agents on a variety of problem configurations before
benchmarking on the server. The
details of what will be changed and how will not be specified.
The server environments will have stationary dynamics unless otherwise
stated in the problem description.
Agent can try to transfer knowledge gained during one run to the
next, but the starting states and problem
parameters may be varied from run to run.
The beginning of a run is signaled to the agent through a call to the
agent_init function. This function should re-initialize the agent. The
agent is also passed the Task_spec string as a parameter to agent_init.
This encodes information about the environment the agent is being
tested on. Finally, the agent_cleanup function is called to signal to
the agent that the current run has finished. For more information
please see the For more information please see the RL-Glue Documentation.
The evaluation methodology, described above, roughly corresponds to the
following pseudo code:
NUM_RUNS = 30
for run = 1: NUM_RUNS
srand(run)
RL_init() --> small random variations to
environment and generates start set
for episode = 1: Emax
RL_episode(Smax) -->
runs a single episode
with cutoff of Smax steps
end for
RL_cleanup
end for
How to
compete
- Email to obtain a user name and password
for your team
- Download the competition
Software
- Implement your agents according to the RL-Glue standard (sample
agents in competition software)
- Once you are ready to make a competition run, connect
to the server
- Repeat for each task you wish to solve
Note: You do not need a valid
user name and password to do local testing with the competition
software.
Competition
Rules
- Each team is allowed only one account. The organizers reserve the
right to restrict access to any individual or team.
- Each team is allowed ONLY
10 competition runs for each
environment on the server.
- On December 3rd the server will be deactivated and (1) in the 6
individual events, the BEST (max
cumulative reward) data-file for each team-environment pair will
be entered in the
competition and (2) for the pentathlon,
the MOST RECENT data-file for
each team will be entered in the
competition
- Users will not be allowed to access data-files on the server.
However, competition runs will output descriptive stats to the terminal
Since users are only allowed 10 connections per environment,
participants are encouraged to test their agents locally (See instructions in
software download). Only connect to
server when you are satisfied with the performance of your agent for a
variety of parameter settings for a given environment-benchmark pair.
back to top
Software
Download:
CompetitionSoftwareV2.4.3
[tar.gzip file, documentation within.]
Release Date: 22nd November 2006
Description: Competition
software with small bug fixes -- Highly recommended
Changes:
- Acrobot task spec was not allocating enough memory. This
caused memory faults on some platforms.
How
Interact with the Server
Here we describe how to connect a C agent and a Java agent to the
competition environments on the server.
Evaluating an agent "remotely" on the server is basically the same as
"locally", except we don't have to worry about the server. To evaluate
an agent "remotely" you must follow the following steps:
1. Edit RLcommon.h to match the environment types
2. Compile your agent with the correct client
3. Run the client (Agent)
For example, to evaluate the ZeroAgent.cpp on the Cat and Mouse problem:
Open a terminal. Goto Client Directory!
Edit RLcommon.h:
typedef int* Observation;
typedef int Action;
Edit makeAgentC:
AGENTV = ZeroAgent.cpp
Compile:
>> make -f makeAgentC clean
>> make -f makeAgentC
Run:
./RL_client 129.128.22.27 3490 (3490 is the port
number for the cat and mouse environment)
----
To evaluate the ZeroAgent.java on the Cart-pole problem:
Open a terminal. Goto Client Directory!
Edit RLcommon.h:
typedef double* Observation;
typedef int Action;
Edit makeAgentJava:
AGENTV = ZeroAgent.java
Compile:
>> make -f makeAgentJava clean
>> make -f makeAgentJava
Run:
java ClientAgent ZeroAgent 129.128.22.27
3495 (3495 is the port number for the cart-pole environment)
Your agent class name (ZeroAgent in this example) must be passed as the
first parameter to ClientAgent.
----
For example, to evaluate the ZeroAgent.cpp on the Pentathlon event:
Open a terminal. Goto Client Directory!
Edit RLcommon.h:
typedef double* Observation;
typedef int Action;
Edit makeAgentC:
AGENTV = ZeroAgent.cpp
Compile:
>> make -f makeAgentC clean
>> make -f makeAgentC
Run:
./RL_client 129.128.22.27 3946 (3946 is the port
number for the Pentathlon)
----
For example, to evaluate the ZeroAgent.java on the Pentathlon event:
Open a terminal. Goto Client Directory!
Edit RLcommon.h:
typedef double* Observation;
typedef int Action;
Edit makeAgentJava:
AGENTV = ZeroAgent.java
Compile:
>> make -f makeAgentJava clean
>> make -f makeAgentJava
Run:
.java ClientAgent ZeroAgent 129.128.22.27 3946 (3946
is the port number for the Pentathlon)
----------------------------------------------------------------------------
Each port corresponds to a different environment as follows:
----------------------------------------------------------------------------
Cat and mouse
|
3490
|
Tetris |
3491
|
Garnet |
3492
|
Non-stationary
mountain car |
3493
|
Puddle world |
3494
|
Cart-Pole
|
3495
|
Pentathlon
|
3496
|
Competition Problems
------------------
Discrete
------------------
General Cat
and Mouse
(n x n grid):
There are several solid obstacles, several stationary pieces of cheese,
one mouse hole and a non-stationary cat. The cat moves (after the mouse
moves) to minimize its distance to the mouse, choosing randomly between
equal quality moves. The cat cannot see or move to the mouse if it is
hiding in its hole. Each map is randomly generated (on init) at the
beginning of each trial, but does not change between consecutive
episodes. The agent gets a small positive reward for each step that the
mouse occupies the same grid space as a piece of cheese and a large
negative reward if the cat and mouse occupy the same grid space. An
episode ends when the cat catches the mouse. The agent's objective is
to navigate the mouse through the maze, collecting as much cheese
as possible while avoiding the cat.
This is a generalization of Robert
Schapire's formulation.
Episodic task
Action Space:
mouse movement
a
∈ [0,7]
|a| = 8
Observation Space:
mouse's
position and cat's position and if mouse is in hole
[mp, cp, flag] where each
{mp,cp}∈ [0, n*n -1 and
flag ∈ [0,1]
Reward:
if [{mp} == {cheese position}] then
+5
if [{cp} == {mp}] then -100
0 otherwise
Performance measure:
cumulative reward
Rmax = +5
Rmin = -100
Emax = 10,000
Smax = 300
Tetris:
Episodic task
Action Space:
move piece left, right, rotate
clockwise, rotate counter clockwise , null action
a
∈ [0,4]
|a|
= 5
Observation Space:
full game board (10 x 20)
binary vector (1 x 200)
|observation| = 200
Reward:
+1 for each line eliminated per time
step
-100 for death
0 otherwise
Performance measure:
cumulative reward
Rmax = 2
Rmin = -100
Emax = 10,000
Smax = 500
GARNET:
The Garnet environment is a randomly generated MDP (random states,
actions and rewards) with probabilistic state transitions and a non
stationary element: every k iterations the transitions are changed by
randomly deleting n state connections and creating n new links between
previously unconnected states. The agent is provided with a random
binary observation vector that is not large enough to uniquely identify
states. The agent's objective in this continuing task is to locate and
try to stay in regions of the MDP that result in the highest average
reward.
Non-stationary, random MDP
Continuing task
Action Space:
discrete actions, stochastic transitions
a ∈ [0, n-1]
n is a problem parameter
Observation Space:
vector of active binary
feature indices
eg: if state = 5
random feature vector:
[00010001000110000001]
observation:
[3,7,11,12,19]
Reward:
normal distribution with mean 0 and
variance 1
Performance measure:
average reward
Rmax = inf
Rmin = -inf
Smax = 10,000
Emax - none, continuing task
-----------------------
Continuous
-----------------------
non-stationary Mountain
Car:
Force of gravity changes every k steps
Episodic task
Action Space:
full reverse, neutral, full forward
a ∈ [0,2]
|a|
= 3
Observation Space:
position and velocity
[x, x_dot]
x ∈ [-1.2, 0.6], x_dot ∈ [-0.07, 0.07]
Reward:
-100 if velocity > 0.0005 &
position > 0.5 : termination occurs
0 if velocity <= 0.0005 & position > 0.5 : termination occurs
-1 otherwise
Performance measure:
cumulative reward
Rmax = 0
Rmin = -100
Emax = 500
Smax = 500
Cart-pole:
Episodic task
Action Space:
discretized forces applied to cart
a ∈ [-10,10]
|a| = 21
Observation Space:
pole angle (from vertical), pole
angular velocity, cart position (from center), cart velocity
θ ∈ [-π/6, π/6]
θ_dot ∈ [-5, 5]
x ∈ [-2.4, 2.4]
x_dot ∈ [-10, 10]
Reward:
if |pole_angle|<=
PI/60 and |pos| <= 0.05 the reward = +2 {NEW}
if |pole angle| >= PI/6 or |pos|
>= 2.4 then reward = 0: termination occurs
+1 otherwise
Performance measure:
cumulative reward
Rmax = 2
Rmin = 0
Emax = 500
Smax = 500
Puddle World:
Episodic task
Action Space:
move agent north east south west
-- stochastic
a
∈ [0,3]
|a|
= 4
Observation Space:
agent grid position
{x, y}∈
[0, 1]
real-valued
Reward: -1 per step
plus -400*<distance inside the
puddle ∈ [0,0.1]> if agent in puddle
Performance measure:
cumulative reward
Rmax = -1
Rmin = -401
Emax = 500
Smax = 500
----------------------
Pentathlon Event:
----------------------
Delayed Mountain
Car:
Observation is delayed by k steps. That is, on step k the observation, given to the
agent, is equal to the state of the environment from k-1 steps in the past. First k observations are random and agent
does not receive the last k
observations before termination.
Episodic task
Action Space:
full reverse, neutral, full forward
a ∈ [0,2]
|a|
= 3
Observation Space:
position and velocity
[x, x_dot]
x ∈ [-1.2, 0.6], x_dot ∈ [-0.07, 0.07]
Reward:
-1 per time step
Performance measure:
cumulative reward
Emax = 500
Smax = 500
Acrobot:
Episodic task
Action Space:
full reverse, neutral, full forward
a ∈ [0,2]
|a|
= 3
Observation Space:
angle and angular velocity of two joints
[θ1, θ2, θ1_dot, θ2_dot ]
{θ1, θ2} not restricted
θ1_dot ∈ [4π, -4π]
θ2_dot ∈ [9π, -9π]
Reward:
-1 per time step
Performance measure:
cumulative reward
Task spec gives suggested operational ranges for θ1,
θ2, but values are not bounded in code. θ1_dot and θ2_dot are bounded
as above.
Emax = 500
Smax = 500
3 Unknown Environments ...
Rmax = 5
Rmin = -500
Emax = 500
Smax = 500
Participants will be allowed
to submit to any number of events and solve any number of problems
within the Discrete and Continuous events.
Competition
Prizes
Prizes will be awarded to participants as follows:
Best Cat and Mouse Agent - Certificate
Best Tetris Agent - Certificate
Best GARNET Agent - Certificate
Best Non-stationary Mountain Car Agent - Certificate
Best Puddle World Agent - Certificate
Best Cart-pole Agent - Certificate
Pentathlon:
first place -- iPod Nano + Certificate
second place -- Certificate
third place -- Certificate
Organization
Committee
Adam White, University of Alberta, Alberta, Canada (chair)
Richard S. Sutton, University of Alberta, Alberta, Canada
Michael L. Littman, Rutgers University, New Jersey, USA
Doina Precup, Mcgill University , Montreal, Canada
Peter Stone, University of Texas, Austin, Texas, USA
Technical Organization Committee
Andrew Butcher, University of Alberta, Alberta, Canada