by Richard S. Sutton and Juan Carlos Santamaria
This document presents a standard interface for programming reinforcement learning simulations in C++. There are three basic objects: agents, environments, and simulations. The agent is the learning agent and the environment is the task that it interacts with. The simulation manages the interaction between the agent and the environment, collects data, and manages the display, if any.
The outputs of the agent are termed actions, and the inputs of the agent are termed sensations. In the simplest case, the sensations are the states of the environment, but the interface allows them to be arbitraily related to the true states of the environment. Here's the standard figure:
Whereas the reward is a number, the actions and sensations are
instances of classes derived from the Action
and
Sensation
abstract classes respectively. The
implementation of actions and sensation can be arbitrary as long as
they are understood properly by the agent and the environment.
Obviously the agent and environment have to be chosen to be compatible
with each other in this way.
The interaction between the agent and environment is handled in discrete time. We assume we are working with simulations here; there are no real-time constraints enforced by the interface. In other words, the environment waits for the agent while the agent is selecting its action and the agent waits for the environment while the environment is computing its next state.
The interface supports either trial-based or continually running simulations. Continually running simulations are treated simply as a single trial that never ends.
In a typical use of the interface, the user first defines any
needed specialized object classes and then creates her new agent,
environment, and simulation by creating an instance of each. The
agent and environment are then passed to Simulation::init
which initializes and interconnects them. Finally, calls to
Simulation::steps
or
Simulation::trials
actually run the simulation.
Here is a prototypical example:
A complete example including definitions of specific agent and environment classes is given in the final section of this document.#include "rli.h" // declareMy_Agent
classMy_Agent
: publicAgent
{ public: voidinit
( int argc, char *argv[] ); Action *start_trial
( const Sensation* ps ); Action *step
( const Sensation *ps, const Action *pa, double reward, const Sensation *pnext_s ); }; // Implementation ofMy_Agent
. . . // declareMy_Environment
classMy_Env
: publicEnvironment
{ public: voidinit
( int argc, char *argv[] ); Sensation *start_trial
( void ); voidstep
( const Action *pa, double &reward, Sensation *&pnext_s ); }; // Implementation ofMy_Environment
. . . int main( int argc, char *argv[] ) {My_Agent
*pa = new My_Agent;My_Env
*pe = new My_Env;Simulation
sim(pa,pe); // Initialize simulation sim.init
( argc, argv ); // Run 1000 steps sim.steps
( 1000 ); }
Note that all the documentation for all objects and routines includes HTML anchors to facilitate automatic indexing into this document. Links to the source code are also provided. For any C++ entity with a bracketed descriptor to its right (e.g., [function]), its source code can be brought up by clicking on the bracketed words.
The agent is the entity that interacts with the environment, that receives senations and selects actions. The agent may or may not learn, may or may not build a model of the environment, etc.
Agent
The base class of all agents. Specific agents are instances of
subclasses derived from Agent
. User defined agent
classes (subclasses of Agent
) will normally provide
specialized definitions of the following three functions.
void Agent::init
( int argc, char *argv[] )
This function is normally provided by the user for her specialized
agent class. Agent::init()
should initialize the
instance of the agent, making any needed data-structures. If the
agent learns or changes in any way with experience, then this function
should reset it to its original, naive condition. The input arguments
provide the generic command-line initialization parameters using
the standard format (i.e., argc is the number of
command-line parameters and argv is the array of pointers
to strings). Normally, Agent::init()
is called once when
the simulation is first assembled and initialized. The default
implementation for Agent::init()
does nothing.
If needed, the agent can consult with the environment or the
simulation as part of setting up its initialization (although at
present no standard interface has been defined for this sort of
interaction). The agent can access the environment and simulation by
accessing the Agent::psim
member pointer. These environment and simulation are both
guaranteed to be existant and inited by the time
Agent::init()
is called.
Action* Agent::start_trial
( const Sensation* ps)
This function is usually provided by the user for her specialized
agent class. It is called at the beginning of each new trial.
Agent::start_trial()
should perform any needed
initialization of the agent to prepare it for beginning a new trial.
It should return a pointer to the first action of the agent in the new
trial, in response to ps (a pointer to the first sensation
of the trial). Also, the agent instance should provide the space in
memory where the action is stored, which should persist after the
function is called. Memory allocation should be done with the
new
constructor, the simulator takes the responsibility
of freeing the memory with delete
when the object is no
longer needed. A typical definition for
Agent::start_trial
is:
Action *My_Agent::start_trial
(const Sensation* ps) { // memory space to store the value of an action My_action* pa = newMy_action
; //policy()
is a function that stores the value // of the action corresponding to ps in pa. policy( (My_Sensation *)ps, pa ); return pa; }
where policy()
is a function that implements the
decision-making policy of the agent.
Action* Agent::step
( const Sensation* ps, const Action* pa,
const Sensation* pnext_s
double reward )
This is the main function for Agent
, where all the
learning takes place. It must be provided by the user and will be
called once by the simulation instance on each step of the simulation.
Agent::step()
informs the agent that, in response to the
sensation pointed by ps and its (previously chosen) action
pointed by pa, the environment returned the payoff in
reward and the sensation pointed by pnext_s.
This function returns a pointer to the action to be taken in response
to the sensation pointed by pnext_s. The agent instance
should provide the space in memory where the action is stored, which
should persist after the function is called. Memory allocation should
be done with the new
constructor, the simulator takes the
responsibility of freeing the memory with delete
when the
object is no longer needed.
In a trial-based task, pnext_s may
take on the special value 0
, indicating that the trial
has terminated with this step. The author of
Agent::step()
is responsible for checking for this and
adjusting its learning and other processes accordingly. In this case,
the value returned from Agent::step()
will be ignored.
The sensation and action pointed by ps and pa
on one call to Agent::step
are always the same as
the sensation pointed by pnext_s and the returned action on the
previous call. Thus, there is a sense then in which these
arguments are unnecessary, provided just as a convenience.
They could simply be remembered by the agent from the previous
call. This is permitted and often necessary for efficient
agent code (to prevent redundant processing of
sensations and actions). For this to work,
Agent::step
must never be called directly by the
user.
The environment basically defines the problem to be solved. It determines the dynamics of the environment, the rewards, and the trial terminations.
Environment
The base class of all environments. Specific environments are
instances of subclasses derived from Environment
. User
defined environment classes (subclasses of Environment
)
will normally provide specialized definitions of the following three
functions.
void Environment::init
( int argc, char *argv[] )
This function is normally provided by the user for her specialized
environment class. Environment::init()
should initialize
the instance of the environment, making any needed data-structures.
If the environment changes in any way with experience, then this
function should reset it to its original, naive condition. The input
arguments provide the generic command-line initialization parameters
using the standard format (i.e., argc is the number of
command-line parameters and argv is the array of pointers
to strings). Normally, Environment::init()
is called once
when the simulation is first assembled and initialized. The default
method for Environment::init()
does nothing.
If needed, the environment can consult with the simulation as part
of setting up its initialization. The environment can access the
simulation by accessing the Environment::psim member
pointer. The corresponding agent is not available at the
time Environment::init()
is called.
Sensation* Environment::start_trial
( void )
This function must be provided by the user for her specialized
environment class. It is normally called at the beginning of each new
trial. Environment::start_trial()
should perform any
needed initialization of the environment to prepare it for beginning a
new trial. It should return a pointer to the first sensation of the
trial. The environment instance should provide the space in memory
where the sensation is stored, which should persist after the function
is called. Memory allocation should be done with the new
constructor, the simulator takes the responsibility of freeing the
memory with delete
when the object is no longer needed.
double Environment::step
( const Action* pa,
double &reward,
const Sensation *&pnext_s )
This is the main function for Environment
. It must
be provided by the user and will be called once by the simulation
instance on each step of the simulation.
Environment::step()
causes the environment to
undergo a transition from its current state to a next state dependent
on the action pointed by pa. The function returns the
payoff of the state transition in the reference reward and
the pointer to the next sensation in the reference pnext_s. The
environment instance should provide the space in memory where the
sensation is stored, which should persist after the function is
called. Memory allocation should be done with the new
constructor, the simulator takes the responsibility of freeing the
memory with delete
when the object is no longer needed.
If the transition is into a terminal state, then the pointer to
the next sensation returned must have the special value
0
.
The simulation is the base object of the interface. It manages
the interaction between the agent and the environment. The functions
Simulation::init()
, Simulation::steps()
,
and Simulation::trials()
are not intended to
be changed by the user. They define the heart of the interface, the
uniform usage that all agents and environments are meant to conform
to. A simulation class is created by deriving from
Simulation
and providing the implementation to the Simulation::collect_data()
virtual function.
Simulations can be specialized to provide reporting (see Simulation::start_trial()
and Simulation::collect_data()
)
and display capabilities. For example, a display may start or stop the
simulation and show its progress in various ways. Display updates can
be triggered in Simulation::collect-data()
,
Agent::step()
, Environment::step()
, or
whatever, by calling user-provided functions associated with the
specialized simulation object (accessible via Agent::psim
or Environment::psim
member
pointers).
Simulation
The basic class of all simulations. Earlier we saw a prototypical example of the
use of a simulation
. An instance of the simulation is
associated with an agent and an environment instances at the moment of
creation. This is performed in the constructor of
Simulator
, which takes the form of
Simulator::Simulator
( Agent *pa, Environment
*pe ).
void Simulaton::init
( int argc, char *argv[] )
Simulation::init()
initializes the simulation
instance, the agent, and the environment, which should be instances of
classes derived from the respective abstract classes. It also calls
Simulation::start_trial()
after initializing the agent
and environment instances in order to set up the simulation object
ready for Simulation::steps()
and/or
Simulation::trials()
. See the source code (by
clicking on the bracketed "function" above right) to see just how this
works.
void Simulation::start_trial
( void )
This function forces the beginning
of a new trial. This is done primarily by calls to
Environment::start_trial
and
Agent::start_trial()
to get the first sensation of the
environment and first action of the agent respectively. User-defined
specialized methods may also compute average or accumulated rewards per
trial or other data and update displays.
void Simulation::steps
( long num_steps )
Runs the simulation for num_steps steps, starting from
whatever state the environment is in. If the terminal state is
reached, the simulation is immediately prepared for a new trial by
calling Simulation::start_trial()
. The switch from the
terminal state to the new starting state does not count as a
step. Thus, this function allows the user to control the execution of
her simulation by providing the total number of steps directly.
void Simulation::trials
( long num_trials, long max_steps_per_trial )
Runs the simulation for num_trials trials, starting
from whatever state the environment is in. Each trial can be no longer
than max_steps_per_trial steps. Each trial begins by
calling Simulation::start_trial()
and ends when the
terminal state is reach or when max_steps_per_trial steps
is reached, whichever comes first. Thus, this function allows the user
to control the execution of her simulation by providing the total
number of trials directly.
void Simulation::collect_data
( const Sensation* ps, const Action* pa,
double reward,
const Sensation* pnext_s )
This function is called once on each step of the simulation. The default method does nothing, but user-defined specialized methods might accumulate rewards or other data and update displays. This is the preferred way to gain access to the simulation's behavior.
The simulation class holds the values of the pointers to the agent and environment instances. This facilitates cross-references of instances in case it is need.
class Simulation
{
public:
Agent* pagt; // pointer to the agent instance
Environment* penv; // pointer to the environment instance
.
.
.
};
The following source code demonstrates the use of the interface in a double-integrator, which is a linear dynamical system with a bidimensional continuous state. The optimal agent implements the optimal policy for solving the problem. The CMAC agent approximates the Q-function with a CMAC and uses SARSA as the learning algorithm. The source code is divided into several modules. As it is traditional in C++, each module named xxx has an interface file (i.e., xxx.h) and an implementation file xxx.cc. The example includes two different agents: optimal and CMAC. The following list provides a brief description of each module. The code compiles and runs in UNIX using the GNU C++ compiler (g++).
All the source code, including the RL interface, double-integrator, agents, the makefile, and README is stored in a tar file here.