Home Reinforcement Learning and Artificial Intelligence (RLAI)
Reinforcement learning interface documentation (python) version 5

The ambition of this web page is to fully describe how to use the Python module defining a standard reinforcement learning interface. We describe 1) how to construct an interface object for a given agent and environment, 2) the inputs and outputs of the interface object, and 3) the inputs and outputs of the functions (procedures) defining the agent and environment. Not convered is the internal workings of the interface object or any particular agent and environment.

The RLI (Reinforcement Learning Interface) module provides a standard interface for computational experiments with reinforcement-learning agents and environments. The interface is designed to facilitate comparison of different agent designs and their application to different problems (environments). This documentation presents the general ideas of the interface and a few examples of its use. After that is the source code for the RLinterface class and its three methods (episode, steps, and episodes) to answer any remaining questions.

An RLinterface is a Python object, created by calling RLinterface(agentFunction, environmentFunction). The agentFunction and environmentFunction define the agent and environment that will participate in the interface. There will be libraries of standard agentFunction's and environmentFunction's, and of course you can write your own. An environmentFunction normally takes an action from the agentFunction and produces a sensation and reward, while the agentFunction does the reverse:

environmentFunction(action) ==> sensation, reward

agentFunction(sensation, reward) ==> action

(An action is defined as anything accepted by environmentFunction and a sensation is defined as anything produced by environmentFunction; rewards must be numbers.) Together, the agentFunction and environmentFunction can be used to generate episodes -- sequences of sensations s, actions a, and rewards r:

import RLinterface
rli = RLinterface(myAgent, myEnv)

rli.episode(maxSteps) ==> s0, a0, r1, s1, a1, r2, s2, a2, ..., rT, 'terminal'

where 'terminal' is a special sensation recognized by RLinterface and agentFunction. (In a continuing problem there would be just one never-terminating episode.)

To produce the initial s0, and a0, the agentFunction and environmentFunction must also support being called with fewer arguments:

environmentFunction() ==> sensation

agentFunction(sensation) ==> action

When the environmentFunction is called in this way (with no arguments) it should start a new episode -- reset the environment to a characteristic initial state (or distribution of states) and produce just a sensation without a reward. When the agentFunction is called in this way (with just one argument) it should not try to process a reward on this step and should also initialize itself for the beginning of an episode. The agentFunction and environmentFunction will always be called in this "reduced" way before being called in the "normal" way.

Episodes can be generated by calling rli.episode(maxNumSteps) as above or, alternatively (and necessarily for continuing problems), segments of an episode can be generated by calling rli.steps(numSteps), which returns the sequence of experience on the next numSteps steps. For example, suppose rli is a freshly made RLinterface and we run it for a single step, then for one more step, and then for two steps after that:

rli.steps(1) ==> s0, a0

rli.steps(1) ==> r1, s1, a1

rli.steps(2) ==> r2, s2, a2, r3, s3, a3

Each call to rli.steps continues the current episode. To start a new episode, call rli.episode(1), which returns the same result as the first line above. Note that if rli.steps(numSteps) is called on an episodic problem it will run for numsteps even if episodes terminate and start along the way. Thus, for example,

rli.episode(1) ==> s0, a0

rli.steps(4) ==> r1, s1, a1, r2, 'terminal', s0, a0, r1, s1, a1

The method rli.episodes(numEpisodes, maxStepsPerEpisode, maxStepsTotal) is also provided for efficiently running multiple episodes.

Examples (these need to be reworked)

Here we do Q-learning with a random policy, presuming an MDP with N states and M actions.


Q = NxM array of zeros
alpha = 0.1
gamma =0.9

def agentFunction(s, r):
Q(s,a) = Q(s,a) + alpha * (r + max_a(gamma*Q(sp,ap))
return random(M) # better to do epsilon greedy

state = 0

def environmentFunction(a):
...
return s, 0

rli = RLinterface(agentFunction, environmentFunction)
rli.steps(1000)
If additional arguments are needed for the routines, use lambda expressions:


def agentFunction(agent, s, r):
...
return a

def environmentFunction(environment, a):
if a == None:
return s0
else:
...
return s, r

env = makeEnvironment ...
agt = makeAgent ...
rli = RLinterface(lambda s, r: agentFunction(agt, s, r), \
lambda a: environmentFunction(env, a) )
RLI.episodes(10,100000)

Calling Sequences for the RLinterface methods

Here are the details for calling the RLinterface methods introduced above:

RLinterface(agentFunction, environmentFunction)
  • This function sets up an interface object, which can then be used to run simulated episodes and steps. The two arguments are both functions, and are described below.

    agentFunction(s [, r])
    This function does the learning and chooses the actions for the agent. It will be called with a sensation s, and optionally with a rewardr. If there is no reward, it indicates that this is to be the first move in an episode. The agent function should always return an action. For agents which have a specific first move for an episode, the code will look something like this:

    def agentFunction(s, r==None):
    if r == None: # start of episode
    return a0 # return initial action
    else: # learn from previous action
    learn with s and r (and previously saved info)
    a = choose next action
    return a # return next action

    Where the first move is chosen the same as other moves, the code will look like this:


    def agentFunction(s, r==None):
    if r != None: # learn from previous action
    learn with s and r (and previously saved info)
    a = choose next action
    return a # return next action

    environmentFunction([a])
    This function does the environment task, such as determining the next state, or sensation after a move. It may be called with or without an action a. If there is no action, it indicates that this is the start of a new episode. In this case, the function should only return the initial sensation. Otherwise, it should return a new sensation and a reward.

    def environmentFunction(a==None):
    if a == None: # start of episode
    return s0 # return initial sensation
    else:
    s, r = do action a
    return s, r # return next sensation and reward

    Where the first move is chosen the same as other moves, the code will look like this:


    def agentFunction(s, r==None):
    if r != None: # do some learning
    learn with s and r (and previously saved info)
    choose next move
    return a
  • The object created by RLsimulation has the following methods:.

    step()
  • Runs the simulation for exactly one step. Returns the list of sensations, actions and rewards from that step.
  • steps(numSteps)
    stepsQ(numSteps)
  • Runs the simulation for numSteps steps, regardless of episode endings (if any). If steps is used, it will return a list of the sensations, actions and rewards in the simulation. If this is not wanted, use stepsQ instead (the quicker and quieter version).
  • episode([maxSteps])
    episodeQ([maxSteps])
  • Runs a single episode (until state 'terminal' is reached). If episode is used, it will return a list of the sensations, actions and rewards in the episode. If this is not wanted, use episodeQ instead (the quicker and quieter version). If maxSteps is specified, the simulation will stop after that many steps even if the end of the episode hasn't been reached.
  • episodes(numEpisodes [, maxSteps, maxStepsTotal])
    episodesQ(numEpisodes [, maxSteps, maxStepsTotal])
  • Runs numEpisodes episodes. If episodes is used, it will return a list of the sensations, actions and rewards in the episodes. If this is not wanted, use episodesQ instead (the quicker and quieter version). If maxSteps is specified, it indicates the maximum number of steps allowed for each episode. If maxStepsTotal is specified, it limits the number of steps for all of the episodes together (regardless of whether an episode has finished, or the specified number of episodes have run).
  •  

    Source Code for RLinterface Module

    You can get source code for the RLinterface module by downloading the RLtoolkit.
    Extend this Page   How to edit   Style   Subscribe   Notify   Suggest   Help   This open web page hosted at the University of Alberta.   Terms of use  2954/0