RLAI Reinforcement Learning and Artificial Intelligence (RLAI)

RL Interface and Benchmarks


The ambition of this page is to keep the members of the RL benchmark and Interface team up to date with recent developments and provide general interest groups with information, standards, documentation and product availability.
                                                                                                                                                                             Edited by Adam White

Introduction & Motivation
More to come soon....

Next Meeting:
when: 2:30pm ~ October 7th 2005
where: RLAI Lab

- update group about our recent public release
- discuss documention, and its matainance
- Should we support windows - currently only unix/mac/linux
    - and is it even possible??
- Phase II implementation
    - Mark and I have a proposal of how this might work/look
- Any workshop considerations

Previous meetings:

September 15th 2005

- everything Rich and I(Adam) talked about tuesday
- de_init for agents and evns
- web page
- the groups feedback (or lack there of)

August 2nd 2005
During this meeting we fleshed out the overall framework of the system, talked about compatability concerns, and finalized some macro issues. Below are some notes from the discussion (thanks anna :)

How it should work
- if you wanted to run a competition, you would want to restrict access to the world
- the main model would be going to the repository, getting the programs
- like UCI database - not monolithic, just pick which of what you use
- client/server issue - can my code run on your machine?

- standard benchmarks which are visible, then competitions where they're not
- provide benchmark in various forms (eg training set, with labels changed or order changes) - so any table lookup reinforcement learning problem, change the actions

- focus on benchmarks, run across machines
- RL Bench (CMU), CL^2 in Germany, us
- RL Bench communicate over a text-based protocol

Software Standard
Mark's Proposal

Agent and environment written it any language, as long as they comply to the RL Interface (7.0) (agent_start(),agent_step(),env_start(),env_step(),etc.). Also the agent and environment can be in another executable (communication by pipes), or on another machine (network socket communication).
RL_Interface written in C
RL_Interface - calls the RL Interface 7.0 (agent_start(), agent_step(), etc.), but the agent and environment it calls may be adapters instead of actual agents or environments. Adapters will translate between languages, or communicate over pipes/sockets.
- putting a network communication layer in between
- we create libraries that translate (examples) lisp agent->network->RL Interface, c environment->network->RLInterface
- could communicate within the agent to the RLInterface directly (opening sockets) but then dependent on following our standards

How to distribute RL_Interface?
- want to be able to compile it with C for speed
- Should be modular (users choose what they need)
- Set of agents and environments to choose from

Need something outside of the agent and environment to track statistics
- Rich - RL Interface has various benchmarks you could run
    - average reward, etc.
- big win if the person writing the agent code doesn't have to know anything about the interface - sets number of episodes and performance measure without doing anything extra
- trivial main, benchmark mains, you can write your own mains which determine experimental setup and gui, etc.

------------End of Notes------------------------------

Below is the diagram of the stucture of the new RL-interface. This is sure to change slightly but it gives a good overall sense of the ambition and direction of the project (Thanks again Anna :).

Our RL Framework
Below are my (Mark) thoughts on the task_spec and data types for actions, observations and rewards:

>In the meeting it was mentioned that the task_spec could specify a
>"level" to which it complies. I think this might be the best way
>to go. That way it can be pretty simple (lower levels) while still
>allowing for more complication (higher levels).
>So, then, the first level could be that all the data types are
>integers and the task_spec simply specifies how many states and
>actions there are.
Level 1 Environments:

Eg: "1,5,2,-1" for a world with 5 states, 2 actions, and -1 being

>Level 2 could be for a continuous world. All data types would  
>be floating point (should actions be? Continuous actions seems
>more complicated, maybe best to leave to a higher level?). The
>task_spec would specify the range of the state space (and action  
>spcae?) (Does the range of the reward space need to be supplied?
>Does that add anything?).
Level 2 Environments:

eg: "2,42,4242,0,7,-1" for a world with states numbered from 42 to 4242
and actions numbered from 0 to 7, with -1 being the terminal state.

>Level 3 could be for a multi-dimensional discreet world. So,
>the state and action space would be an array of integers. The
>task_spec would specify the number of dimensions, followed by
>the size of each dimension (range of values in that dimension).
Level 3 Environments:
(newlines for email clarity, not part of the specification)

Eg: "3,2,0,10,1,6,3,1,3,1,3,1,4,0,1" would be a world with 2 state
dimensions, the first ranging from 0 to 10, the second from 1 to 6. 3
action dimensions, the first ranging from 1 to 3, the second also from 1
to 3, and the third from 1 to 4. The terminal state would have the first
dimension be 0 and the second dimension be 1.

The state would be terminal if all of the state variables equal the
terminalState values. Something like -MAX_INT could be used for just say
the first state to indicate that it is terminal.

>Level 4 could be for a multi-dimensional continuous world. State
>and action space would be an array of floating point. task_spec
>would be the # of dimenions followed by the range of each
Level 4 Environments:
Same as level 3, but starting with a 4 not a 3.

>And, finally, level 5 could be for a general multi-dimensional
>world. So, the task_spec would specify the # of dimensions, followed
>by whether or not that dimension is continuous, followed by the
>range in that dimension. (for a world where a possible action
>might be [5, 0.3, 2], for example).
Level 5 Environments:

Where the discrete flags are simply 1 or 0, 1 for a continuous
state/action and 0 for a discrete state/action.

Eg: "5,3,1,0,1,0,1,10,1,5,8,2,0,1,8,1,0,3.141,0.5,7,7"

Would be a world with 3 state dimensions, the first being continous values
from 0 to 1, the second being discrete values from 1 to 10, and the third
being continuous values from 5 to 8. There are 2 action dimensions, the
first being discrete values from 1 to 8, and the second being continous
values from 0 to 3.141. The terminal state would be 0.5, 7, 7.

>For levels 2 and above the terminal state would need to be specified
>in the task_spec as well.


Below is the current (untested) code. In this code the task_spec is assumed to be of the form "n m", where n is the number of states and m the number of actions.


This is what we sent out to some of the people who will be attending and contributing to the NIPS workshop
<>                  Back to /RLAI/rlai.html

The environment and agent functions required by the RL Interface are passed (and return) actions, observations and rewards. But the data type of actions, observations and rewards depends on the task_spec contents. So, how can one write a method signature for these methods that will never have to change? Because C requires that the type be specified a priori. I've had 2 ideas on this:

1) Make the type of actions, observations, and rewards a void*. So, in other words, just a pointer to a space in memory. The size and contents of this memory will depend on the task_spec. This allows the actual size and type to be flexible, while still permanantly setting the type in the code. The main disadvantage is that it is ugly and prone to errors (like seg faults...).

2) Make a different version of each method for each level of the task_spec. So, instead of agent_step(), there would be agent_step1(), agent_step2(), agent_step3(), agent_step4(), and agent_step5(). So, the agent would have to export functions for every level that it supports. We could also add a function the RL Interface spec that would allow the interface to ask the agent which levels it supports. I haven't fully thought this idea through, because I like the first one better...

Mark Lee  

In response to Mark Lee's comments on varying the data types in C:

Making things into generic void* is the way to go for something like this. Its nice and flexible as Mark mentioned and easy enough to use.
Since things have to be sent through network sockets at some point, we are going to lose our explicit data types anyways. So, the RLInterface is going to need to have a set of functions for extracting the size of the data chunks that it expects to send and receive, from the Task_Spec string. Since it needs this functionality anyways, why not include a set of functions for parsing the void* chunk of memory?
So, I'm thinking something like:

RLInterface::setEnvTaskSpec(char* task_spec)
Parse task_spec for the level, number of states/actions, ranges of the states/actions, discrete/continuousness of the states/actions and terminal state value.
This would automatically be called when env_init is called, the interface would simply intercept and parse the task_spec before also passing it onto the user.

Then, a series of functions like the following could be implemented for the user's that aren't comfortable with either parsing the task_spec themselves or dealing with the void* pointer:

//Functions for querying about the environment itself.
int RLInterface::getEnvironmentLevel();
int RLInterface::getNumStateVariables();
int RLInterface::getNumActionVariables();
bool RLInterface::getStateIsContinuous(int stateDimensionNumber);
void* RLInterface::getStateMinValue(int stateDimensionNumber); //min/max could be combined into a struct for more clarity.
void* RLInterface::getStateMaxValue(int stateDimensionNumber);
void* RLInterface::getStateValue(void* state, int stateDimensionNumber);
bool RLInterface::getActionIsContinuous(int actionDimensionNumber);
void* RLInterface::getActionMinValue(int actionDimensionNumber);
void* RLInterface::getActionMaxValue(int actionDimensionNumber);
void* RLInterface::getActionValue(void* action, int actionDimensionNumber);
bool RLInterface::checkIfTerminalState(void* state); //checks given state against the stored terminal state.

The retrieved values would still have to be cast by the user into the appropriate type (int) or (double) depending on the discreteness of that variable, but we are providing the functions for checking that, so that shouldn't be a problem. Though, the following functions could additionally be added:

int RLInterface::getDiscreteStateValue(void* state, int stateDimensionNumber);
double RLInterface::getContinuousStateValue(void* state, int stateDimensionNumber);
int RLInterface::getDiscreteStateMinValue(int stateDimensionNumber);
double RLInterface::getContinuousStateMinValue(int stateDimensionNumber);
int RLInterface::getDiscreteStateMaxValue(int stateDimensionNumber);
double RLInterface::getContinuousStateMaxValue(int stateDimensionNumber);
int RLInterface::getDiscreteActionValue(void* action, int actionDimensionNumber);
double RLInterface::getContinuousActionValue(void* action, int actionDimensionNumber);
int RLInterface::getDiscreteActionMinValue(int actionDimensionNumber);
double RLInterface::getContinuousActionMinValue(int actionDimensionNumber);
int RLInterface::getDiscreteActionMaxValue(int actionDimensionNumber);
double RLInterface::getContinuousActionMaxValue(int actionDimensionNumber);

These would automatically cast the value from a void* to the actual data type. They would also do a check if that state/action is actually discrete or continuous and throw an exception if the user tries to extract a discrete value from a continuous state dimension (and similiar such errors) in order to prevent casting things into nonsense values. Naturally these checks would be bypassed by the void* values so the more advanced programmers wouldn't have to put up with the extremely minor performance hit these checks would provide (course, they could always parse the state and construct the action variables themselves based on the task_spec).

Function for setting action variables in the void* memory chunk, this would take an action variable "action" that points to a section of memory, the dimension of this action variable that we want to change, and a pointer to the value that we want to store in this dimension of the action. The action variable passed in is modified to reflect this change.
void RLInterface::setActionVariable(void* action, int actionDimensionNumber, void* actionDimensionValue);
This function would return a pointer to the value stored in the requested dimension of the action.
void* RLInterface::getActionVariable(void* action, int actionDimensionNumber);

These functions could also do things like throw exceptions if the user tries to set values outside of the given state/action's allowable range, or tries to access the 7th state dimension when there are only 6, stuff like that to prevent segmentation faults and hopefully provide some meaningfull feedback as to what the user is doing wrong.

Also, the data type of a reward should not need to vary (I think), so just make it always a double or a float. None of the current environment levels say anything about what the reward data type is, so I say make it a double unless we add in more additional environment levels.

I'd write this code if everyone thinks its a good approach to doing it.

This would work, but its a really ugly solution to the problem. I'd recommend strongly against it. Though, there should be a function added to the RLInterface and the agent, that is something like:
bool RLInterface::getAgentSupportsEnvironmentLevel(int level);
In case a given agent doesn't work with say non-discrete variables.

So, that all said, any flaws in my approach or should I go ahead and code it up?

-Thomas Pittman  

Just posted my not-tested SarsaAgent.cpp code. Also changed the links for RLInterface.c and Agent.h because I had to change agent_end to return void.

One assumption I'm making is that the states are integers numbered from 0 to n-1. One assumption I'm not making is that actions are anything other than the Action type, which I'm happy enough about. Then I make my own action array and index it from 0 to m-1, of course.

I didn't want to make a state array because it's one thing iterating over an action array and another entirely needing to create a structure the size of the number of states, and looking up the *index* into it on every step. That seems like a horrible waste of time. So should we enforce that states are ints, or should we generalize soon?


Just some random thought:

Terminal condition for learning could be handled as follows. The struct observation having 3 fields, reward, state, and terminal_flag. This makes it easy for environment to set and interface to check. Simplifies task_specs too because I think we can come up with counter examples every time someone proposes a terminal value that is numeric. Also C cant do the python trick of sometimes returning a string and sometimes returning a number.


Further discussions of task_spec:

We think that agents and environments will usually be able to handle only one kind of everything (continuous/discrete, multidimensional), and should have functions which return what main_spec they can handle. This will be an integer, but defined in RL_Interface in words - DISCRETE_SPACE, etc. We'll discuss.

So main_spec is separate from task_spec - main_spec is something that the agent and environment can either handle or not. task_spec is the information the environment needs given the kind it is - number of states, number of dimensions, range of continuous values, etc.

States in main_spec level 1 (integer everything) start counting from 0.

It will be the responsibility of the RL_Interface to cast things appropriately based on the specs, and to ensure that the agent and environment are okay with that spec. The agent and environment need functions (name suggestions?) which return true/false, or which return the task level they can handle.

It's not going to be too complicated with supported task specs, because now we are assuming that the agent only handles one type. This seems fair.


Some notes on ideas for the RL interface. These won't make all that much sense by themselves, but may give some sense what i am thinking about on this (and i don't have time just now to polish).

version number - this interface supports env versions X or higher, agent versions Y or higher
calls may have an N in them to indicate expected num of args, absent args default to...zero.

For env designers:
version number - this env has a version number
env must be able to describe the numericity of the sensations it generates:
env must be able to describe the ranges for each type of sensation with non-zero numericity:
cardinal: {0,1,..,N-1}, provide N for each
interval: [min,max), provide min,max for each
env must be able to describe the data types for each type of sensation with non-zero numericity:
env must be able to describe the numericity of the actions it expects to recieve:
env must be able to describe the ranges for each type of sensation with non-zero numericity:
cardinal: {0,1,..,N-1}, provide N for each
interval: [min,max), provide min,max for each
env must be able to describe the data types for each type of sensation with non-zero numericity:
For example: tabular env: generates N, expects M, one argument each
For example: mountain car:

For agent designers:
version number - this agent has a version number
agent must describe the numbers of the sensation space it accepts:
agent must describe the envelope of the actions it generates:
agent may be able to take info on the actions it should generate:
(and same for their ranges)
agent may be able to take info on the sensations it should expect:
(and same for their ranges)

Or better might be to make multiple calls, one per variable, providing all info about it.


Rich I think I understand the overall flavor of your approach, but like you said the details aren't entirely clear. Perhaps when you get a chance you could comment on how it differs from the following (a marriage of several peoples input).

I think the task_spec should include a version number X and a level. Why is the level important? Well in the interface (C) globals have to be declared. So if the task spec level corresponds to each of what we determine are the meaningful/possible senieros then we can declare things as void (as in marks previous posting) then cast the types based on the level. In this way level describes the types that we expect and expected format of actions and sensations.

I imagine the following levels, this is for the task_spec generated by the env only....others would easily follow.
format :: (Version_num, level, [sensation_minS], [sensation_maxS, [action_minA], [action_maA] )
maxA - max action value
minA - min action value
maxS - max sensation value
minS - hmmmmm

level 1:
(X, 1, 0, N-1, 0, N-1)
Discrete actions and discrete sensations ... standardize numbering starting at zero.
In this case the state can be multi-demensional, but can always be converted into an integer.

sensations are ints
actions are ints

level 2:
(X, 2, 0, N-1, [minA::INT], [maxA::INT])
Discrete actions but multiple actions and discrete sensations ...
therefore we must give an array of ints specifying the min action value in each dimension and a corresponding max array.

sensations are ints
actions are an array of ints

level 3:
(X, 3, 0, N-1, [minA::DOUBLE], [maxA::DOUBLE])
Continuous single or multiple actions and discrete sensations ...
Now we must pass arrays of doubles to specify the range of each dimension of the actions
Assumption here is once your working with continuous actions whether the agent emits a single output or multidimensional it is easier to work with arrays of doubles instead having two separate cases for double and [double] actions.

sensations are ints
actions are an array of doubles

level 4:
(X, 4, [minS::DOUBLE], [maxS::DOUBLE], 0, N-1 )
Continuous sensations that may be single or multi dimensional. Actions are discrete.
Now we must pass arrays of doubles to specify the range of each dimension of the state and we standardize action labeling from 0 to N-1. Again same assumption with sensation here as with actions previously. If we are working with continuous might as well work with double arrays.

sensations an array of doubles
actions are ints

level 5:
(X, 5, [minS::DOUBLE], [maxS::DOUBLE], [minA::INT], [maA::INT])
Same as level 4 except actions are now multi-D. So we have to specify range of each action dimension, hence arrays of ints.

sensations an array of doubles
actions an array of ints

level 6:
(X, 6, [minS::DOUBLE], [maxS::DOUBLE], [minA::DOUBLE], [maxA::DOUBLE])
Everything is continuous, sensations and actions. Maybe be single or multidimensional.

sensations an array of doubles
actions an array of doubles

This way the level corresponds the types of the sensations and actions without having it explicitly passed as a string "action-int" for example.
As someone who has worked alot with continuous sensations and actions, I think the min and max, (which cover the range, extreme values and carnality in the discrete case) are very important. If you tile coding you want to know the range so you can scale the sensation between [0,1]. I think now that the density would be important. Not the number of states per say or the distribution but a number between 0 and 1 that specifies the sparseness of the sensations produced. This is important for tile coding where I want to choose the number of tilings and memory size. Maybe that's providing to much info...not sure. It might look like this:

(X, 6, [minS::DOUBLE], [maxS::DOUBLE], [minA::DOUBLE], [maxA::DOUBLE], [densityS::DOUBLE] )
an array specifying the densities of each dimension of the state. Of course only apples to continuous sensations, levels 4 thru 6.

I think this gives the interface the info it needs to decide if an agent and env can talk to each other, and reduces the burden on the end user. I think this specification, if well documented is much simpler and faster than multiple functions. It also gives the agent lots of info about the task. My hope/Dream is that the env and agent writers have to only define around 4 functions and this spec. More complication may make our lives and the interface clearer but I think that is backwards. The interface can/will be horrible inside(no-one will look anyway) but writing agents and evn will be fast and simple. We will finally reach a point where we can literally write a new env and quickly plug it into a prewritten agent as long as these specs are handled properly.

Crap? Ok? - let me know!!!

I've made a new proposal for the task_spec format. Let me know what you think.


The new task spec looks good Mark. The only thing that I would change, would be to use a 0 for discrete and a 1 for continuous (simply for that sake of proper boolean representation), instead of the 1 and 2 that you proposed.

-Thomas Pittman  

0 and 1 are more convenient. I agree with that change.

Adam, in response to your post on the task-spec page:

I think having 6 task spec numbers might make the code a bit uglier. I was just thinking about how thw code would look and I think it'd be something like:

if (level 1)
elif (level 2)

To me this new way seem maybe a bit more straightforward and clean. Separating the levels for State and action gives many fewer levels, because we don't have to do the product of the two (you know what I mean?).

Some dimensions discrete and some continuous: I was just thinking about environments like... for example maybe you're drinving a car and you can control the gas pedal (continuous), but you also control the high beams (discrete). If everything is assumed to be continuous then how does that work? Does the environment say that the high beams are continuous from 0 to 1? And what does 0.5 mean? I just think it's maybe more flexible to allow both.


Brian pointed out that we should get feedback from people on the front lines of RL applications, to make sure our interface would be useful to them. Do we need replay? Do we need an "undo"? Do we need to feed in the seed to generate exactly the same environmental response? Do we need to account for batch updating?

How do we enforce these things? Do we need a concerted effort to take these into account, either through communication with people running experiments or reading their papers and see what they used?


So I have written some code to interface with an environment written for RLbench (assuming its compiled already in a make file or something). It supports all three generative models of rlBench, specified by the user through a compiler flag. It also parses the output of rlBench to construct a task spec according to marks last post on it.

A few assumptions I have made:
1 - Even though the task spec specifies whether the actions & state are int, double or mixed and the number of dimensions, I have assumed that everything (actions and states) should be declared and passed around as vectors of doubles.

This eliminates declaring things as void* and significantly reduces code length and complexity. An example of this complexity sheds alot of light:

env_step method
action - can be INT, [INT], double, [DOUBLE], or a mix
out :
reward - double
state - can be INT, double, [DOUBLE], or a mix

So trying to implement this is a pain and I don't think it gives added clarity to the user. If the task spec tells them the state is ( int, double, double, int) then thats enough. If we use vectors of doubles we have the same type across the interface code, we can still access elements array style (vec[3]) and vectors support .size() which is nice.

I have talked this over with Anna and we feel this reduces complexity and doesn't limit expressiveness.

2 - the interface will be extended to support agent passing a state and/or random seed from its step method.

3 - observation struct will be extend to support optional return "state".

I will post the code tomorrow after I have chatted with mark about some things. I am making a similar program to interface with RLbecnh agents. Which now forces me to define the following:

task_spec interaction proposal
The environment passes a task_spec to the interface, the agent passes its task_spec to the interface. The interface will compare them and determine whether the two can "Talk". If so the environments task_spec is passed to the agents init method.

The form of the agents task spec would be the following (much like Marks env one):

The first part is the version info

State(S) & Action(A) Info: Both have the same format:

level #dimensions

where level is a number specifying whether the space is continuous or discrete, #dimensions is a number specifying the number of dimensions in the space but its slightly different here.
if #dimensions == 0
single dimension state/action
if #dimensions == 1
multi dimension state/action

level can either be 1, 2, or 3 where the meanings of each are as follows:
1 means that the space is discrete
2 means that the space is continuous
3 means that the space is partially continuous

means version number 1, continuous multi-dimensional state space, and continuous single dimensional action

If everyone is ok with the idea of 2 task_specs and the interface comparing them AND the agent task_spec I have proposed then I will go ahead with the RLbench agent interfacing code.
Next would be doing the same thing for CSLL.


i think i have a reasonable way to handle less-standard performance measures such as the expected-reward measure yaki brought up.  we do it through side-calls directly to the environment.  so, you write your benchmark, calling rl_step, and on each step you make an additional call to the environment to get the additional performance measure.  (this could be anything you wanted and could have any name, but maybe something like env_expected_reward() would be appropriate for yaki's case.)  this strategy would allow complete generality in additional performance measures.  And i don't think we should view this as too much of a climb down from a pure interface.  i think in practice people will want to do lots of things in their main and benchmark programs -- for example graphical displays of various env and agent variables -- that are idiosyncratic to their particular purposes.


I was thinking about State and Action as defined in Globals.h. Rich suggested that we declare them as integers for the public release in order to present the simple case. However, we (Adam and I) have been more inclined to declare them as double arrays to present the general case (because an array of doubles can store an integer, a double, an array of integers, or an array of doubles). The problem with an array of doubles when the actual type is something like an integer is that both the env and agent must do some ugly casting on each step.

I've thought of an alternative that might be both simple and general. Using a union. So, State would be declared as:

       typedef union {
           int i;
           double d;
           int* ai;
           double* ad;
       } State;

in Globals.h. Then, lets say that the state is a single integer. The env code would look like this:

          State s;
          s.i = intVal;
          return s;

and the agent code would look like this:

     agent_step(State s)
         int intVal = s.i;

So, in theory this method is fairly simple to use (no ugly casting needed) and really general (Globals.h should not need to be changed except for really bizaar state types (like linked lists)). Does this sound ok to everyone?


We should plan on the type definitions for sensation and action to be made specially for the individual envs. This is what will happen almost always. We should plan on and for it. We should provide a few examples where the type definitions are different for different env-agent combinations. It would be nice if we could have two definitions in play in the same executable, as when running several agent-env combinations. Inside the interface, generic type definitions should be used. In the agent and environment, more specific definitions should be used.


Extend this Page   How to edit   Style   Subscribe   Notify   Suggest   Help   This open web page hosted at the University of Alberta.   Terms of use  1955/0