The First Annual Reinforcement Learning Competition

Reinforcement Learning Competition

NIPS Workshop: The First Annual Reinforcement Learning Competition

This page is the official page of the above-named workshop, to be held December 9, 2006 at Whistler BC, Canada in conjunction with the 2006 Neural Information Processing Systems Conference.

Edited by Adam White

Workshop News:

Competition has officially begun: 26th October 2006
Competition has officially ended: 7th December 2006

Offical Workshop Schedule

Offical Results:
The Winners are:

    Cat and Mouse
        University of Texas, Austin
     Garnet
        University of Texas, Austin
    Tetris
        University of Texas, Austin
    Cart-Pole
        University of Osnabrueck, Germany
    Non-stationary Mountain Car
        University of Newcastle, Australia
    Puddle World
        Rutgers University, USA
    The Pentathlon Winner
        Rutgers University, USA

Workshop presentation of results can be found here.

Quick Links

Workshop Description

Competitions can greatly increase the interest and focus in an area by clarifying its objectives and challenges, publicly acknowledging the best algorithms, and generally making the area more exciting and enjoyable. The First Annual Reinforcement Learning competition is ongoing and will reach culmination six days before this workshop (December 3rd). The purpose of this one day workshop is to 1) report the results from the competition, 2) summarize the experiences of and solution approaches employed by the competition participants, and 3) plan for future competitions.

There have been two previous events within the machine learning community that have involved comparing different reinforcement learning methods, one at a NIPS 2005 workshop and one at a ICML 2006 workshop. Last year's event at NIPS had over 30 submissions, from 9 countries. This competition differs from previous events in that it will be a competitive event. The winners will be invited to describe their approach to each problem, at the workshop, improving the attendees’ expertise on applying and implementing reinforcement learning systems. This year's competition will feature seven events, three discrete observation problems and three continuous observation problems. The final event will be a Pentathlon: each participant's agent will be evaluated on five continuous observation problems consecutively. The agents that perform best across all five problems will be awarded small prizes.

Previous, related workshops:

Competition Logistics

The competition software will be based on RL-Glue and the environments will be drawn from the Reinforcement Learning Library. Please see the documentation on the RL-Glue website and example code in RL-Library to ensure that your agent code is compatible with RL-Glue.

The competition events are divided into two classes, discrete and continuous. Participants can submit agents that are specifically designed for each event. The discrete events are:

Cat and Mouse
Tetris
Garnet

The continuous events are:

Non-stationary Mountain Car
cart-pole
Puddle World

In the pentathlon your agent will be tested on 5 problems consecutively. For local testing, only 2 of the 5 problems will be available; your agent will be tested on 3 unknown environments during competition evaluation. The first two problems used in the pentathlon will be Mountain Car and Acrobot, described below. The other 3 will be continuous observation (array of doubles) and discrete action (integer).

The scoring of the pentathlon will be done as follows:

Let n be the number of teams competing in the pentathlon (data-files on the server). For each of the five problems, the teams will be ranked first to last according to highest cumulative reward. Each team will be assigned points based on their rank on each problem. The team with the highest point total will be declared the winner. For example, if we had 5 teams:

Team	Mountain Car	Acrobot	unknown	unknown	unknown
T1	1st (5 points)	3rd (3 points)	3rd (3 points)	1st (5 points)	3rd (3 points)
T2	4th (2 points)	2nd (4 points)	4th (2 points)	2nd (4 points)	4th (2 points)
T3	2nd (4 points)	5th (1 point)	1st (5 points)	3rd (3 points)	5th (1 point)
T4	5th (1 point)	4th (2 points)	5th (1 point)	4th (2 points)	2nd (4 points)
T5	3rd (3 points)	1st (5 points)	2nd (4 points)	5th (1 points)	1st (5 points)

Summary:

T1	T2	T3	T4	T5
19 points	14 points	14 points	10 points	18 points

Team 1 finishes 1st, team 5 finishes second and team 2 and 3 tie for third. This evaluation scheme is extended to n teams, by assigning n points to first place on each problem, n-1 points to second, etc,

------------------------
General Setting
------------------------

Definitions:
Competition run - one connection to the server.
Run - each competition run consists of 30 experiments of an agent on an environment. Each run consists of Emax episodes (set below for each problem). Results are averaged across the 30 runs.

Each of the Emax episodes, within a run, will begin at random starting states. An episode ends if either a terminal state is reached or Smax steps are taken (with no punishment assigned to this threshold). Smax is set for each problem below. Agents cannot terminate episodes prematurely. We have chosen not to explicitly separate training and testing phases. Each solver should strive to maximize the rewards it accumulates over the 10000 episodes.

The environments running on the competition server will be different from the local test copies distributed with the competition software. For instance, in the cat and mouse event, the policy of the cat may change or the size of the maze may change. Participants are encouraged to test their agents on a variety of problem configurations before benchmarking on the server. The details of what will be changed and how will not be specified. The server environments will have stationary dynamics unless otherwise stated in the problem description.

Agent can try to transfer knowledge gained during one run to the next, but the starting states and problem parameters may be varied from run to run.

The beginning of a run is signaled to the agent through a call to the agent_init function. This function should re-initialize the agent. The agent is also passed the Task_spec string as a parameter to agent_init. This encodes information about the environment the agent is being tested on. Finally, the agent_cleanup function is called to signal to the agent that the current run has finished. For more information please see the For more information please see the RL-Glue Documentation.

The evaluation methodology, described above, roughly corresponds to the following pseudo code:

NUM_RUNS = 30

for run = 1: NUM_RUNS
    srand(run)
    RL_init() --> small random variations to environment and generates start set
    for episode = 1: Emax
        RL_episode(Smax) --> runs a single episode with cutoff of Smax steps
    end for
    RL_cleanup
end for

How to compete
How to perform a competition run
Competition environments
Competition rules

How to compete

Email to obtain a user name and password for your team
Download the competition Software
Implement your agents according to the RL-Glue standard (sample agents in competition software)
Once you are ready to make a competition run, connect to the server
Repeat for each task you wish to solve

Note: You do not need a valid user name and password to do local testing with the competition software.

Competition Rules

Each team is allowed only one account. The organizers reserve the right to restrict access to any individual or team.
Each team is allowed ONLY 10 competition runs for each environment on the server.
On December 3rd the server will be deactivated and (1) in the 6 individual events, the BEST (max cumulative reward) data-file for each team-environment pair will be entered in the competition and (2) for the pentathlon, the MOST RECENT data-file for each team will be entered in the competition
Users will not be allowed to access data-files on the server. However, competition runs will output descriptive stats to the terminal

Since users are only allowed 10 connections per environment, participants are encouraged to test their agents locally (See instructions in software download). Only connect to server when you are satisfied with the performance of your agent for a variety of parameter settings for a given environment-benchmark pair.

Software

Download: CompetitionSoftwareV2.4.3 [tar.gzip file, documentation within.]
Release Date: 22nd November 2006
Description: Competition software with small bug fixes -- Highly recommended
Changes:

Acrobot task spec was not allocating enough memory. This caused memory faults on some platforms.

How Interact with the Server

Here we describe how to connect a C agent and a Java agent to the competition environments on the server.

Evaluating an agent "remotely" on the server is basically the same as "locally", except we don't have to worry about the server. To evaluate an agent "remotely" you must follow the following steps:

1. Edit RLcommon.h to match the environment types
2. Compile your agent with the correct client
3. Run the client (Agent)

For example, to evaluate the ZeroAgent.cpp on the Cat and Mouse problem:

Open a terminal. Goto Client Directory!

Edit RLcommon.h:
    typedef int* Observation;
    typedef int Action;

Edit makeAgentC:
    AGENTV = ZeroAgent.cpp

Compile:
    >> make -f makeAgentC clean
    >> make -f makeAgentC

Run:
    ./RL_client 129.128.22.27 3490 (3490 is the port number for the cat and mouse environment)

----

To evaluate the ZeroAgent.java on the Cart-pole problem:

Open a terminal. Goto Client Directory!

Edit RLcommon.h:
    typedef double* Observation;
    typedef int Action;

Edit makeAgentJava:
    AGENTV = ZeroAgent.java

Compile:
    >> make -f makeAgentJava clean
    >> make -f makeAgentJava

Run:
    java ClientAgent ZeroAgent 129.128.22.27 3495   (3495 is the port number for the cart-pole environment)

Your agent class name (ZeroAgent in this example) must be passed as the first parameter to ClientAgent.

----

For example, to evaluate the ZeroAgent.cpp on the Pentathlon event:

Open a terminal. Goto Client Directory!

Edit RLcommon.h:
    typedef double* Observation;
    typedef int Action;

Edit makeAgentC:
    AGENTV = ZeroAgent.cpp

Compile:
    >> make -f makeAgentC clean
    >> make -f makeAgentC

Run:
    ./RL_client 129.128.22.27 3946 (3946 is the port number for the Pentathlon)

----

For example, to evaluate the ZeroAgent.java on the Pentathlon event:

Open a terminal. Goto Client Directory!

Edit RLcommon.h:
    typedef double* Observation;
    typedef int Action;

Edit makeAgentJava:
    AGENTV = ZeroAgent.java

Compile:
    >> make -f makeAgentJava clean
    >> make -f makeAgentJava

Run:
    .java ClientAgent ZeroAgent 129.128.22.27 3946 (3946 is the port number for the Pentathlon)

----------------------------------------------------------------------------
Each port corresponds to a different environment as follows:
----------------------------------------------------------------------------

Cat and mouse	3490
Tetris	3491
Garnet	3492
Non-stationary mountain car	3493
Puddle world	3494
Cart-Pole	3495
Pentathlon	3496

Competition Problems

------------------
Discrete
------------------

General Cat and Mouse (n x n grid):

There are several solid obstacles, several stationary pieces of cheese, one mouse hole and a non-stationary cat. The cat moves (after the mouse moves) to minimize its distance to the mouse, choosing randomly between equal quality moves. The cat cannot see or move to the mouse if it is hiding in its hole. Each map is randomly generated (on init) at the beginning of each trial, but does not change between consecutive episodes. The agent gets a small positive reward for each step that the mouse occupies the same grid space as a piece of cheese and a large negative reward if the cat and mouse occupy the same grid space. An episode ends when the cat catches the mouse. The agent's objective is to navigate the mouse through the maze, collecting as much cheese as possible while avoiding the cat.

This is a generalization of Robert Schapire's formulation.

Episodic task
Action Space:

mouse movement
a ∈ [0,7]

|a| = 8
Observation Space:

mouse's position and cat's position and if mouse is in hole

[mp, cp, flag] where each {mp,cp}∈ [0, n*n -1 and flag ∈ [0,1]

Reward:

if [{mp} == {cheese position}] then +5

if [{cp} == {mp}] then -100

0 otherwise

Performance measure:

cumulative reward

    Rmax = +5
    Rmin = -100
    Emax = 10,000
    Smax = 300

Tetris:
    Episodic task
    Action Space:

move piece left, right, rotate clockwise, rotate counter clockwise , null action

a ∈ [0,4]

|a| = 5

Observation Space:

full game board (10 x 20)
binary vector (1 x 200)
|observation| = 200

Reward:

+1 for each line eliminated per time step
-100 for death

0 otherwise

Performance measure:

cumulative reward

    Rmax = 2
    Rmin = -100
    Emax = 10,000
    Smax = 500

GARNET:

The Garnet environment is a randomly generated MDP (random states, actions and rewards) with probabilistic state transitions and a non stationary element: every k iterations the transitions are changed by randomly deleting n state connections and creating n new links between previously unconnected states. The agent is provided with a random binary observation vector that is not large enough to uniquely identify states. The agent's objective in this continuing task is to locate and try to stay in regions of the MDP that result in the highest average reward.

    Non-stationary, random MDP
    Continuing task
    Action Space:

discrete actions, stochastic transitions

a ∈ [0, n-1]
n is a problem parameter

    Observation Space:
          vector of active binary feature indices
          eg: if state = 5
          random feature vector:
            [00010001000110000001]
        observation:
            [3,7,11,12,19]
    Reward:

normal distribution with mean 0 and variance 1

Performance measure:

average reward

    Rmax = inf
    Rmin = -inf
    Smax = 10,000
    Emax - none, continuing task

-----------------------
Continuous
-----------------------

non-stationary Mountain Car:
    Force of gravity changes every k steps
    Episodic task
    Action Space:

full reverse, neutral, full forward

a ∈ [0,2]

|a| = 3

Observation Space:

position and velocity

[x, x_dot]
x ∈ [-1.2, 0.6], x_dot ∈ [-0.07, 0.07]

Reward:

-100 if velocity > 0.0005 & position > 0.5 : termination occurs
0 if velocity <= 0.0005 & position > 0.5 : termination occurs
-1 otherwise

Performance measure:

cumulative reward

    Rmax = 0
    Rmin = -100
    Emax = 500
    Smax = 500

Cart-pole:
    Episodic task
    Action Space:

discretized forces applied to cart
a ∈ [-10,10]
|a| = 21

Observation Space:

pole angle (from vertical), pole angular velocity, cart position (from center), cart velocity
θ ∈ [-π/6, π/6]
θ_dot ∈ [-5, 5]
x ∈ [-2.4, 2.4]
x_dot ∈ [-10, 10]

Reward:
if |pole_angle|<= PI/60 and |pos| <= 0.05 the reward = +2 {NEW}

if |pole angle| >= PI/6 or |pos| >= 2.4 then reward = 0: termination occurs
+1 otherwise

Performance measure:

cumulative reward

    Rmax = 2
    Rmin = 0
    Emax = 500
    Smax = 500

Puddle World:
    Episodic task
    Action Space:
       move agent north east south west -- stochastic
        a ∈ [0,3]
        |a| = 4
    Observation Space:
       agent grid position
        {x, y}∈ [0, 1]
        real-valued
    Reward: -1 per step
        plus -400*<distance inside the puddle ∈ [0,0.1]> if agent in puddle
    Performance measure:
        cumulative reward
    Rmax = -1
    Rmin = -401
    Emax = 500
    Smax = 500

----------------------
Pentathlon Event:
----------------------

Delayed Mountain Car:
    Observation is delayed by k steps. That is, on step k the observation, given to the agent, is equal to the state of the environment from k-1 steps in the past. First k observations are random and agent does not receive the last k observations before termination.
    Episodic task
    Action Space:

full reverse, neutral, full forward

a ∈ [0,2]

|a| = 3

Observation Space:

position and velocity

[x, x_dot]
x ∈ [-1.2, 0.6], x_dot ∈ [-0.07, 0.07]

Reward:

-1 per time step

Performance measure:

cumulative reward

    Emax = 500
    Smax = 500

Acrobot:
    Episodic task
    Action Space:

full reverse, neutral, full forward

a ∈ [0,2]

|a| = 3

Observation Space:

angle and angular velocity of two joints

[θ1, θ2, θ1_dot, θ2_dot ]
{θ1, θ2} not restricted
θ1_dot ∈ [4π, -4π]
θ2_dot ∈ [9π, -9π]

Reward:

-1 per time step

Performance measure:

cumulative reward

    Task spec gives suggested operational ranges for θ1, θ2, but values are not bounded in code. θ1_dot and θ2_dot are bounded as above.
    Emax = 500
    Smax = 500

3 Unknown Environments ...
    Rmax = 5
    Rmin = -500
    Emax = 500
    Smax = 500

Participants will be allowed to submit to any number of events and solve any number of problems within the Discrete and Continuous events.

Competition Prizes

Prizes will be awarded to participants as follows:

Best Cat and Mouse Agent - Certificate
Best Tetris Agent - Certificate
Best GARNET Agent - Certificate

Best Non-stationary Mountain Car Agent - Certificate
Best Puddle World Agent - Certificate
Best Cart-pole Agent - Certificate

Pentathlon:
     first place -- iPod Nano + Certificate
     second place -- Certificate
     third place -- Certificate

Organization Committee

Adam White, University of Alberta, Alberta, Canada (chair)
Richard S. Sutton, University of Alberta, Alberta, Canada
Michael L. Littman, Rutgers University, New Jersey, USA
Doina Precup, Mcgill University , Montreal, Canada
Peter Stone, University of Texas, Austin, Texas, USA

Technical Organization Committee

Andrew Butcher, University of Alberta, Alberta, Canada

Extend this Page How to edit Style Subscribe Notify Suggest Help This open web page hosted at the University of Alberta. Terms of use 7079/0