A Standard Interface for Reinforcement Learning Software in C++

Version: 1.1

by Richard S. Sutton and Juan Carlos Santamaria

Introduction
Agent
Environment
Simulation
Accessing one object from another
A complete example: CMAC and double-integrator

Introduction

This document presents a standard interface for programming reinforcement learning simulations in C++. There are three basic objects: agents, environments, and simulations. The agent is the learning agent and the environment is the task that it interacts with. The simulation manages the interaction between the agent and the environment, collects data, and manages the display, if any.

The outputs of the agent are termed actions, and the inputs of the agent are termed sensations. In the simplest case, the sensations are the states of the environment, but the interface allows them to be arbitraily related to the true states of the environment. Here's the standard figure:

Whereas the reward is a number, the actions and sensations are instances of classes derived from the Action and Sensation abstract classes respectively. The implementation of actions and sensation can be arbitrary as long as they are understood properly by the agent and the environment. Obviously the agent and environment have to be chosen to be compatible with each other in this way.

The interaction between the agent and environment is handled in discrete time. We assume we are working with simulations here; there are no real-time constraints enforced by the interface. In other words, the environment waits for the agent while the agent is selecting its action and the agent waits for the environment while the environment is computing its next state.

The interface supports either trial-based or continually running simulations. Continually running simulations are treated simply as a single trial that never ends.

In a typical use of the interface, the user first defines any needed specialized object classes and then creates her new agent, environment, and simulation by creating an instance of each. The agent and environment are then passed to Simulation::init which initializes and interconnects them. Finally, calls to Simulation::steps or Simulation::trials actually run the simulation. Here is a prototypical example:

#include "rli.h"

// declare My_Agent

class My_Agent : public Agent {
public:

   void    init( int argc, char *argv[] );

   Action *start_trial( const Sensation* ps );

   Action *step( const Sensation *ps, 
                 const Action    *pa, 
                 double           reward, 
                 const Sensation *pnext_s ); 
};

// Implementation of My_Agent
   .
   .
   .

// declare My_Environment

class My_Env : public Environment {
public:
   void       init( int argc, char *argv[] );

   Sensation *start_trial( void );

   void       step( const Action  *pa,
                          double        &reward,
                          Sensation    *&pnext_s );
};

// Implementation of My_Environment
   .
   .
   .

int main( int argc, char *argv[] )
{
   My_Agent   *pa = new My_Agent;
   My_Env     *pe = new My_Env;

   Simulation sim(pa,pe);

   // Initialize simulation 

   sim.init( argc, argv );

   // Run 1000 steps 

   sim.steps( 1000 );
}

A complete example including definitions of specific agent and environment classes is given in the final section of this document.

Note that all the documentation for all objects and routines includes HTML anchors to facilitate automatic indexing into this document. Links to the source code are also provided. For any C++ entity with a bracketed descriptor to its right (e.g., [function]), its source code can be brought up by clicking on the bracketed words.

Agent

The agent is the entity that interacts with the environment, that receives senations and selects actions. The agent may or may not learn, may or may not build a model of the environment, etc.

[abstract class]

Agent

The base class of all agents. Specific agents are instances of subclasses derived from Agent. User defined agent classes (subclasses of Agent) will normally provide specialized definitions of the following three functions.

[virtual function]

void Agent::init ( int argc, char *argv[] )

This function is normally provided by the user for her specialized agent class. Agent::init() should initialize the instance of the agent, making any needed data-structures. If the agent learns or changes in any way with experience, then this function should reset it to its original, naive condition. The input arguments provide the generic command-line initialization parameters using the standard format (i.e., argc is the number of command-line parameters and argv is the array of pointers to strings). Normally, Agent::init() is called once when the simulation is first assembled and initialized. The default implementation for Agent::init() does nothing.

If needed, the agent can consult with the environment or the simulation as part of setting up its initialization (although at present no standard interface has been defined for this sort of interaction). The agent can access the environment and simulation by accessing the Agent::psim member pointer. These environment and simulation are both guaranteed to be existant and inited by the time Agent::init() is called.

[pure virtual function]

Action* Agent::start_trial ( const Sensation* ps)

This function is usually provided by the user for her specialized agent class. It is called at the beginning of each new trial. Agent::start_trial() should perform any needed initialization of the agent to prepare it for beginning a new trial. It should return a pointer to the first action of the agent in the new trial, in response to ps (a pointer to the first sensation of the trial). Also, the agent instance should provide the space in memory where the action is stored, which should persist after the function is called. Memory allocation should be done with the new constructor, the simulator takes the responsibility of freeing the memory with delete when the object is no longer needed. A typical definition for Agent::start_trial is:

Action *My_Agent::start_trial(const Sensation* ps)
{
    // memory space to store the value of an action

    My_action* pa = new My_action; 

    // policy() is a function that stores the value
    // of the action corresponding to ps in pa. 

    policy( (My_Sensation *)ps, pa );

    return pa;
}

where policy() is a function that implements the decision-making policy of the agent.

[pure virtual function]

Action* Agent::step ( const Sensation* ps, const Action* pa, const Sensation* pnext_s double reward )

This is the main function for Agent, where all the learning takes place. It must be provided by the user and will be called once by the simulation instance on each step of the simulation. Agent::step() informs the agent that, in response to the sensation pointed by ps and its (previously chosen) action pointed by pa, the environment returned the payoff in reward and the sensation pointed by pnext_s. This function returns a pointer to the action to be taken in response to the sensation pointed by pnext_s. The agent instance should provide the space in memory where the action is stored, which should persist after the function is called. Memory allocation should be done with the new constructor, the simulator takes the responsibility of freeing the memory with delete when the object is no longer needed.

In a trial-based task, pnext_s may take on the special value 0, indicating that the trial has terminated with this step. The author of Agent::step() is responsible for checking for this and adjusting its learning and other processes accordingly. In this case, the value returned from Agent::step() will be ignored.

The sensation and action pointed by ps and pa on one call to Agent::step are always the same as the sensation pointed by pnext_s and the returned action on the previous call. Thus, there is a sense then in which these arguments are unnecessary, provided just as a convenience. They could simply be remembered by the agent from the previous call. This is permitted and often necessary for efficient agent code (to prevent redundant processing of sensations and actions). For this to work, Agent::step must never be called directly by the user.

Environment

The environment basically defines the problem to be solved. It determines the dynamics of the environment, the rewards, and the trial terminations.

[abstract class]

Environment

The base class of all environments. Specific environments are instances of subclasses derived from Environment. User defined environment classes (subclasses of Environment) will normally provide specialized definitions of the following three functions.

[virtual function]

void Environment::init ( int argc, char *argv[] )

This function is normally provided by the user for her specialized environment class. Environment::init() should initialize the instance of the environment, making any needed data-structures. If the environment changes in any way with experience, then this function should reset it to its original, naive condition. The input arguments provide the generic command-line initialization parameters using the standard format (i.e., argc is the number of command-line parameters and argv is the array of pointers to strings). Normally, Environment::init() is called once when the simulation is first assembled and initialized. The default method for Environment::init() does nothing.

If needed, the environment can consult with the simulation as part of setting up its initialization. The environment can access the simulation by accessing the Environment::psim member pointer. The corresponding agent is not available at the time Environment::init() is called.

[pure virtual function]

Sensation* Environment::start_trial ( void )

This function must be provided by the user for her specialized environment class. It is normally called at the beginning of each new trial. Environment::start_trial() should perform any needed initialization of the environment to prepare it for beginning a new trial. It should return a pointer to the first sensation of the trial. The environment instance should provide the space in memory where the sensation is stored, which should persist after the function is called. Memory allocation should be done with the new constructor, the simulator takes the responsibility of freeing the memory with delete when the object is no longer needed.

[pure virtual function]

double Environment::step ( const Action* pa, double &reward, const Sensation *&pnext_s )

This is the main function for Environment. It must be provided by the user and will be called once by the simulation instance on each step of the simulation. Environment::step() causes the environment to undergo a transition from its current state to a next state dependent on the action pointed by pa. The function returns the payoff of the state transition in the reference reward and the pointer to the next sensation in the reference pnext_s. The environment instance should provide the space in memory where the sensation is stored, which should persist after the function is called. Memory allocation should be done with the new constructor, the simulator takes the responsibility of freeing the memory with delete when the object is no longer needed.

If the transition is into a terminal state, then the pointer to the next sensation returned must have the special value 0.

Simulation

The simulation is the base object of the interface. It manages the interaction between the agent and the environment. The functions Simulation::init(), Simulation::steps(), and Simulation::trials() are not intended to be changed by the user. They define the heart of the interface, the uniform usage that all agents and environments are meant to conform to. A simulation class is created by deriving from Simulation and providing the implementation to the Simulation::collect_data() virtual function.

Simulations can be specialized to provide reporting (see Simulation::start_trial() and Simulation::collect_data()) and display capabilities. For example, a display may start or stop the simulation and show its progress in various ways. Display updates can be triggered in Simulation::collect-data(), Agent::step(), Environment::step(), or whatever, by calling user-provided functions associated with the specialized simulation object (accessible via Agent::psim or Environment::psim member pointers).

[class]

Simulation

The basic class of all simulations. Earlier we saw a prototypical example of the use of a simulation. An instance of the simulation is associated with an agent and an environment instances at the moment of creation. This is performed in the constructor of Simulator, which takes the form of Simulator::Simulator( Agent *pa, Environment *pe ).

[virtual function]

void Simulaton::init ( int argc, char *argv[] )

Simulation::init() initializes the simulation instance, the agent, and the environment, which should be instances of classes derived from the respective abstract classes. It also calls Simulation::start_trial() after initializing the agent and environment instances in order to set up the simulation object ready for Simulation::steps() and/or Simulation::trials(). See the source code (by clicking on the bracketed "function" above right) to see just how this works.

[virtual function]

void Simulation::start_trial ( void )

This function forces the beginning of a new trial. This is done primarily by calls to Environment::start_trial and Agent::start_trial() to get the first sensation of the environment and first action of the agent respectively. User-defined specialized methods may also compute average or accumulated rewards per trial or other data and update displays.

[virtual function]

void Simulation::steps ( long num_steps )

Runs the simulation for num_steps steps, starting from whatever state the environment is in. If the terminal state is reached, the simulation is immediately prepared for a new trial by calling Simulation::start_trial(). The switch from the terminal state to the new starting state does not count as a step. Thus, this function allows the user to control the execution of her simulation by providing the total number of steps directly.

[virtual function]

void Simulation::trials ( long num_trials, long max_steps_per_trial )

Runs the simulation for num_trials trials, starting from whatever state the environment is in. Each trial can be no longer than max_steps_per_trial steps. Each trial begins by calling Simulation::start_trial() and ends when the terminal state is reach or when max_steps_per_trial steps is reached, whichever comes first. Thus, this function allows the user to control the execution of her simulation by providing the total number of trials directly.

[virtual function]

void Simulation::collect_data ( const Sensation* ps, const Action* pa, double reward, const Sensation* pnext_s )

This function is called once on each step of the simulation. The default method does nothing, but user-defined specialized methods might accumulate rewards or other data and update displays. This is the preferred way to gain access to the simulation's behavior.

Accessing one object from another

The simulation class holds the values of the pointers to the agent and environment instances. This facilitates cross-references of instances in case it is need.

class Simulation {
public:
   Agent*       pagt;  // pointer to the agent instance
   Environment* penv;  // pointer to the environment instance

   .
   .
   .
};

A complete example: double-integrator

The following source code demonstrates the use of the interface in a double-integrator, which is a linear dynamical system with a bidimensional continuous state. The optimal agent implements the optimal policy for solving the problem. The CMAC agent approximates the Q-function with a CMAC and uses SARSA as the learning algorithm. The source code is divided into several modules. As it is traditional in C++, each module named xxx has an interface file (i.e., xxx.h) and an implementation file xxx.cc. The example includes two different agents: optimal and CMAC. The following list provides a brief description of each module. The code compiles and runs in UNIX using the GNU C++ compiler (g++).

rli (rli.h, rli.cc): This module defines all the abstract classes and functions of the reinforcement learning interface.

environment-dbi (enviroment-dbi.h, environment-dbi.cc): This module defines the double-integrator environment.

agent-optimal-dbi ( agent-optimal-dbi.h, agent-optimal-dbi.cc): This module defines the optimal agent for the double-integrator environment.

agent-cmac-dbi (agent-cmac-dbi.h, agent-cmac-dbi.cc): This module defines the learning agent that uses CMACs to solve the double-integrator environment.

main-dbi.cc: This module defines two types of simulation objects and contains the main program that executes the simulation.

All the source code, including the RL interface, double-integrator, agents, the makefile, and README is stored in a tar file here.