RLAI Reinforcement Learning and Artificial Intelligence (RLAI)
Standards for reinforcement learning benchmarks - discussion of Dec 2004
This page records an email discussion concerning standards for reinforcement learning benchmarks, from Dec 2004, following a NIPS workshop on this topic.  In particular, we discussed the idea of software "glue" that connects RL learners and RL problems  to each other in a standard way.  There was not a clear resolution at that time, but it did lead to a 2005 NIPS workshop.


Some of the efforts currently underway for developing software for RL benchmarks are:


(Additional software for RL Benchmark)

Bandit by Joannès Vermorel. The Bandit project includes the implementation of many multi-armed bandit strategies as well as several benchmarking datasets. 

Much of the following is from an email conversation.
Rich Sutton, Jan 2, 2005

John Langford (12/22/04):

I spent some quality time with RLBench and just released a version 3.
There are two protocol changes which Rich convinced me are worthwhile.

1) In addition to outputting the initial action set, the simulators
output the initial (observation,reward,state) triple.  This simply
orders things so the environment moves first.

2) I shifted the simulators so that the time horizon is built into the
simulator.  (It seems that knowing the explicit halting time is a cheat
typically not available for real world problems.) In particular, if the
pole or the bike crashes, an 'end of episode' is signaled by sending an
empty string state and observation. 

This modification allows full expression of gamma discounting, hard
horizons, and variable length horizons, as we discussed.  I don't know
how to otherwise cope with the RL people Rich mentioned who think in odd
ways although suggestions are welcome.

This modification is not entirely complete as I don't know what "end of
episode" should mean in the context of tetris or the maze game.

Outstanding issues that we really need to address:

1) Toplevel.
I'm convinced now that a toplevel environment/agent evaluation program
as Rich suggests is a good idea as it removes the complexity of
evaluating performance from the algorithm _and_ the simulator
designer. 

For example, when creating a simulation, it seems it is often a good
idea to make a policy for the simulation just to debug it.  (In fact
'optimal_pole' is just such a program.)  Nobody writing a simulator
wants to cope with the complexity of calling subprograms, etc...

I am going to work on this in the near future.

2) Integration. 
The translators between CLSS and RLBench (both directions) seem
important.  I believe the RLBench protocol is stabilized now (although
I've believed that in the past too.)

I believe Martin is working on this.

It seems quite plausible that we want to fully integrate CLSS and
RLBench into one suite.

3) Sourceforge & License
I checked out sourceforge.  It looks pretty easy to setup a project.
Before we do that, perhaps we should have a license fight.  My
impression is "GPL" for a license.  I believe Drew is thinking along the
lines of "LGPL" or "BSD".  It looks like CLSS is under BSD.   Perhaps
Rich and Nick can comment?

4) Documentation
I'm terrible at this as I'm not very good about thinking about what
other people don't know.  Comments from anyone would be helpful.
Martin's documentation looked very nice.

5) New Simulators
Nick, I hear you made an interface to a robot simulator.  How feasible
is adding it into the distribution?

6) Little Details
Drew, try playing with the tetris code as it still isn't compiling on my
fedora core 3 linux machine.  Also, consider tweaking tetris and maze to
output the empty string on whatever seems like it should be "end of
episode".

Nick Roy (12/22/04):

yep, i added carmen support to rlbench. at the moment it's mostly just a
glorified mazeworld, but there's also partial support for dynamic
objects (e.g., people), and i thought of some other extensions.

it depends on having carmen around, of course, and there's no
documentation yet. i've attached a tarball -- you should be able to plug
it straight into the rlbench repository. it belongs in the g
subdirectory, not dg.

i think the lgpl is deprecated now. so either gpl or bsd. the only way
the gpl can go horribly wrong is if you sign over the copyright to the
fsf. i don't have a strong feeling one way or the other. mike, sebastian
and i decided on the gpl for carmen because we felt it would be
maximally acceptable to other users, but we've kept ownership of the
copyright so that we could in principle fork the source in the future.

JL: How viable is including all of carmen?

NR: not very viable: it's pretty large and depends on lots of other
packages. it can be annoying. we could modify the makefile to notice
that carmen isn't around, print a message and move on, or maybe do
something smart with autoconf.

Drew Bagnell (12/24/04):

For the record, I view the GPL as incompatible with academic work, and wouldn't be willing to release code under it. It may be fine for protecting corporate property, but our goal is dissemination. I would support allowing arbitrary (open) licenses with a preference for "just plain free"-- i.e. public domain, with BSD as the worst case.

JL (Jan 3 2005):

I suggest we settle on a BSD license for the benchmark suite.

(1) It satisfies Drew's ultimatum.
(2) It fits with Martin's choice.

It doesn't quite fit with Carmen, but it seems (at most) only a stub
from Carmen will be in the suite.

Rich Sutton (1/2/05):

I've been thinking about the top-level separations into the three modules, which here i will call environment, agent, and interface (we can discuss naming separately).  I think John and Drew have a good point about interfacing via pipes and separate processes (as i understand it).  This gives us a gold standard for independence of the agent and the environment modules.  In particular, it probably achieves the maximal generality of languages and OSs.  Still, in some cases people might prefer procedural interfaces.  To some extent we can finesse such issues with glue code, but there are a number of others, such as working with multiple agents and environments in the same program. It is not clear to me what is the best way to go here, so i have tried to step up a level.  I have been trying to write down the essential functions we want from each of the three modules, independent of how they are done (e.g., by pipes, methods on objects, or straight procedure calls).  If we can get these clear and standardized it will be a good step.  So the following is a draft proposal.  I tried to include the cases covered by RLbench.  Remember, the names and arguments here are not important, just the functionality for each module.  Let me know what you think.


The environment (plant, simulator) should be able to understand (respond appropriately to) the following:

env_init() --> task specification. initialize yourself and return an optional specification of your i/o interface - the space of actions you accept and the space of rewards and observations you return. (These will be made available to the agent.)  For a given environment, env_init would normally occur exactly once.

env_start() --> first observation.  For a continuing task, this is done once.  For an episodic task, this is done at the beginning of each episode.  Note no reward is returned.  In the case of an episodic environment, end-of-episode is signaled by some special observation.  This special observation cannot be returned by env_start.

env_step(action) --> reward, observation.  Do one time step of the environment.

env_state() --> state key.  Save the current state of the environment such that it can be recreated later upon presentation of the key.  The key could in fact be the state object, but returning just a key (a logical pointer to the state information) saves passing the state back and forth and also protects against the agent cheating by looking inside the state.

env_step(action, state_key) --> reward, observation.  Do one step, randomly sampling starting from the state specified by the state_key.  Here we suggest overloading the env_step function, but of course we could use a different name.

env_random_seed() --> random seed key.  Save the random seed object such that it can be restored upon presentation of the key.  Same comments as for env_state.

env_step(action, state_key, random_seed_key) --> reward, observation.  One step, effectively deterministic because the random seed is set. 


The agent (controller) should understand the following:

agent_init(task_spec).  Initialize the agent for interfacing to an environment with the given i/o interface according to some standard description language.

agent_start(first_observation) --> first action.  Do the first step of a run or episode.  Note there is no reward input.

agent_step(reward,observation) --> action.  Do one step of the agent. 

agent_end(reward).  Do the final step of the episode.

Note: the last three of these functions could be combined into one, but it is easier on the agent writer to separate them.


The RL interface should provide the following functionality:

RL_benchmark() --> performance, standard error.  The idea is to provide one overall measure of performance defining the benchmark.  I'm not sure exactly what the input and outputs of RL_benchmark should be - maybe a number of steps or episodes should be input.  But i think we need this one top-level number.  It would be built out of the following routines in various ways as appropriate to the particular benchmark.

RL_init().  Initialize everything, passing the environment's i/o specification to the agent.

RL_start() --> first observation, first action.  Do the first step of a run or episode.  The action is saved so that it is used on the next step.

RL_step() --> reward, observation, action.  Do one time step.  RL_step uses the saved action and saves the returned action for the next step.  The action returned from one call must be used in the next, so it is better to handle this implicitly so that the user doesn't have to keep track of the action, for example, of which action goes with which interface in the case of multiple interface instances.  If the end-of-episode observation occurs, then no action (or a null action) is returned.

RL_episode() --> first observation, first action, ..., last reward.  Do one episode.  As you might imagine, this is done by calling RL_start, then RL_step until the end-of-episode observation occurs.

RL_total_reward(gamma) --> total gamma-discounted reward of the current or just completed episode or run.

RL_num_steps() --> number of steps of the current or just completed episode or run.

John Langford, in response to the above, quotes from above in blue (Jan 3 2005):

or straight procedure calls).  If we can get these clear and
standardized it will be a good step.  So the following is a draft

I agree with the goal.  I'm going to critique what's below and make an
alternative proposal.

p.s. I put up all of our recent email on this on the RL benchmarks
discussion page (/rlbenchmarks.html).

Fine with me.  (The difficulty for me with the webpage is that it
doesn't transparently do email. In particular, changes are not announced
to people.)

The ENVIRONMENT (plant, simulator) should be able to understand
(respond appropriately to) the following:

env_init() --> task specification. initialize yourself and return an
optional specification of your i/o interface - the space of actions
you accept and the space of rewards and observations you return.
(These will be made available to the agent.)  For a given environment,
env_init would normally occur exactly once.

It seems that the space of actions does need to be specified.

The space of rewards is always a floating point number, so I don't think
it needs to be specified. 

In RLBench, we avoided specifying the space of observations, except
implicitly (the thing which a line varies over).  I find this elegant
because it means that the interface is very simple - we don't need to
specify a language for observations. 

The downside here is that we may make the RL problem artificially hard,
although I haven't really seen this. 

In practice, observations always seem to have constant dimensionality
and it is easy for an agent to detect this constant.  So the tradeoff
right now is future growth & slightly more complex agents vs. more
complex language specification language and (maybe) simpler agent.

env_start() --> first observation.  For a continuing task, this is
done once.  For an episodic task, this is done at the beginning of
each episode.  Note no reward is returned.  In the case of an episodic
environment, end-of-episode is signaled by some special observation.
This special observation cannot be returned by env_start.

I see no reason to forbid returning a reward on the first timestep.  If
the reward is always 0, then it is as if there was no reward specified.
If it is random, then there is just some minor additional stochasticity
in the environment.  Allowing the reward to be returned eliminates a
special case.

In RLBench, env_start() is sending an empty string.

env_state() --> state key.  Save the current state of the environment
such that it can be recreated later upon presentation of the key.  The
key could in fact be the state object, but returning just a key (a
logical pointer to the state information) saves passing the state back
and forth and also protects against the agent cheating by looking
inside the state.

I don't see a reason to distinguish between state and state key. 

I don't believe we should be too paranoid about people cheating on a
benchmark.  Cheating is unscientific, and doing so will (justly) result
in the ending of reputations.  Too much paranoia is cumbersome.

In a competition, we should worry about cheating.  And, in that case,
it's very easy to make a little wrapper program that scrambles the state
in an unrecognizable way.

So, what I'm saying is that "state" captures "state key", without
unloading extra complexity on the simulation designer.

env_step(action, state_key, random_seed_key) --> reward, observation.
One step, effectively deterministic because the random seed is set. 

This set of functions is sufficient to do deterministic generative
model, generative model, and trace model.  However, I do not like mixing
them.  There are several reasons:

1) Several of these functions can be derived from others.  Specifying
"we want these functions" invites errors on the part of the simulation
designer.   To avoid these errors, we want a _minimal_ set of functions,
from which we construct all others.

2) I'm scared that in use, people would be unclear about which interface
they are using.  I tend to think that "solved the problem in the trace
model" is _much_ more significant than "solved the problem in a
deterministic generative model" (simply because the trace model is much
more like "real life"), and I want to never wonder which is happening.

agent_start(first_observation) --> first action.  Do the first step of
a run or episode.  Note there is no reward input.

agent_step(reward,observation) --> action.  Do one step of the
agent. 

Another reason that I like having a reward in the first time step is
that a reward _is_ an observation.  Having the reward there sometimes
and not there other times is having a variable size observation space.
This is adding complexity without point.

agent_end(reward).  Do the final step of the episode.

Note: the last three of these functions could be combined into one,
but it is easier on the agent writer to separate them.

I completely disagree, for the reasons given above.

Stated another way, I see no reason why an agent should be able to infer
(based on the function used) whether or not this is the last timestep.
That's providing an additional implicit observation---something which
seems quite messy to me.

The RL INTERFACE should provide the following functionality:

RL_benchmark() --> performance, standard error.  The idea is to
provide one overall measure of performance defining the benchmark.
I'm not sure exactly what the input and outputs of RL_benchmark should
be - maybe a number of steps or episodes should be input.  But i think
we need this one top-level number.  It would be built out of the
following routines in various ways as appropriate to the particular
benchmark.

I agree.

An alternative proposal:

There are 3 kinds of environment.

--------------------------------------------------------------------

Trace environment:  (The most AIish environment where progress will
remain real progress in AI.)

env_init() -> specifies action space and (maybe or maybe not observation
space).  Specifies first reward and first observation.

env_step(action) -> reward, observation

---------------------------------------------------------------------

Generative environment: (Useful for complex realistic simulators where
learned policies may be useful in the real world.)

env_init() -> specifies action space and (maybe or maybe not observation
space).  Specifies first reward, first observation, first state.

env_step(action,state) -> reward, observation, next state

--------------------------------------------------------------------

Deterministic Generative Environment: (The determinism can significantly
reduce the complexity allowing us to solve harder problems, for now.
The random seed is (logically) a part of the state.)

env_init() -> specifies action space and (maybe or maybe not observation
space).  Specifies first reward, first observation, first state, first
random number.

env_step(action,state, random number) -> reward, observation, next
state, next random number

-------------------------------------------------------------------

It might appear that the above has a similar complexity to the other
proposal, but it does not because we can automatically convert upwards.
Consequently, the goal of a simulator writer is to come up with just
_two_ functions as close to "deterministic generative" as possible.

---------------------------------------------------------------------

Like environments, there are 3 kinds of agents, depending on the source
of information they use to learn.

---------------------------------------------------------------------

The trace agent:

agent_init(task_spec).  Initialize the agent for interfacing to an
environment with the given i/o interface according to some standard
description language.

agent_step(reward, observation) --> action

Every agent must implement this interface because all evaluations are
done in the trace model.

-------------------------------------------------------------------

The generative agent:

agent_init(task_spec).  Initialize the agent for interfacing to an
environment with the given i/o interface according to some standard
description language.

agent_step(reward, observation, state) --> state, action.

-------------------------------------------------------------------

The deterministic generative agent:

agent_init(task_spec).  Initialize the agent for interfacing to an
environment with the given i/o interface according to some standard
description language.

agent_step(reward, observation, state, random seed) --> state, action,
random seed.

-------------------------------------------------------------------

The other two interfaces are simply the means by which the
agent interacts with models providing more information.

-------------------------------------------------------------------

The RL Interface(this gets murkier as I haven't thought as much.)

I don't believe that we need to standardize the internal mechanisms of
the super program combining the environment and the agent, because we
are the only ones writing it, and no one else needs to examine the
internals.  There are two operations that need to be handled.

1) training the RL learner.

train(environment, agent, complexity_limit)->trained agent.

Here complexity_limit is something like "100 episodes" or "1000
actions".

2) testing the RL learner.

test(trace environment, agent, complexity limit)->performance.

The complexity limit here is always measured in episodes.  The
performance should (at a minimum) be the observed return.

-----------------------------------------------------------------

It isn't clear how to think about "which interface is better". 

It seems that we want sufficient richness of functionality.  Does this
one appear sufficiently rich?

We should also think about simplicity of use and implementation.  Is
this one sufficiently simple?  Are the arguments about the relative
complexity of the two interfaces compelling?

-John

Extend this Page   How to edit   Style   Subscribe   Notify   Suggest   Help   This open web page hosted at the University of Alberta.   Terms of use  3379/2