Reinforcement Learning and
Artificial
Intelligence (RLAI) |
|
Reinforcement learning interface documentation (python) version 5 |
The RLI (Reinforcement Learning Interface) module provides a
standard interface for computational experiments with
reinforcement-learning agents and environments. The interface is
designed to facilitate comparison of different agent designs and their
application to different problems (environments). This documentation
presents the general ideas of the interface and a few examples of its
use. After that is the source code for the RLinterface
class and its three methods (episode
, steps
,
and episodes
) to answer any remaining questions.
An RLinterface
is a Python object, created by calling RLinterface(agentFunction,
environmentFunction)
. The agentFunction
and environmentFunction
define the agent and environment that will participate in the
interface. There will be libraries of standard agentFunction
's
and environmentFunction
's, and of course you can write
your own. An environmentFunction
normally takes an action
from the agentFunction
and produces a sensation and
reward, while the agentFunction
does the reverse:
environmentFunction(action) ==> sensation, reward
agentFunction(sensation, reward) ==> action
(An action
is defined as anything accepted by environmentFunction
and a sensation
is defined as anything produced by environmentFunction
;
reward
s must be numbers.) Together, the agentFunction
and environmentFunction
can be used to generate episodes
-- sequences of sensations s, actions a, and rewards r:
import RLinterface
rli = RLinterface(myAgent, myEnv)
rli.episode(maxSteps) ==>
s0, a0, r1, s1, a1, r2, s2, a2, ..., rT,
'terminal'
where 'terminal'
is a special sensation recognized by RLinterface
and agentFunction.
(In a continuing problem there would
be just one never-terminating episode.)
To produce the initial s0, and a0, the agentFunction
and environmentFunction
must also support being called
with fewer arguments:
environmentFunction() ==> sensation
agentFunction(sensation) ==> action
When the environmentFunction
is called in this way
(with no arguments) it should start a new episode -- reset the
environment to a characteristic initial state (or distribution of
states) and produce just a sensation without a reward. When the agentFunction
is called in this way (with just one argument) it should not try to
process a reward on this step and should also initialize itself for the
beginning of an episode. The agentFunction
and environmentFunction
will always be called in this "reduced" way before being called in the
"normal" way.
Episodes can be generated by calling rli.episode(maxNumSteps)
as above or, alternatively (and necessarily for continuing problems),
segments of an episode can be generated by calling rli.steps(numSteps)
,
which returns the sequence of experience on the next numSteps
steps. For example, suppose rli
is a freshly made
RLinterface and we run it for a single step, then for one more step,
and then for two steps after that:
rli.steps(1) ==>
s0, a0
rli.steps(1) ==>
r1, s1, a1
rli.steps(2) ==>
r2, s2, a2, r3, s3, a3
Each call to rli.steps
continues the current episode.
To start a new episode, call rli.episode(1)
, which
returns the same result as the first line above. Note that if rli.steps(numSteps)
is called on an episodic problem it will run for numsteps
even if episodes terminate and start along the way. Thus, for example,
rli.episode(1) ==>
s0, a0
rli.steps(4) ==>
r1, s1, a1, r2,'terminal'
, s0, a0, r1, s1, a1
The method rli.episodes(numEpisodes,
maxStepsPerEpisode, maxStepsTotal)
is also provided for
efficiently running multiple episodes.
Here we do Q-learning with a random policy, presuming an MDP with N states and M actions.
Here are the details for calling the RLinterface methods introduced above:
RLinterface
(agentFunction,
environmentFunction)agentFunction
(s [, r])
def agentFunction(s, r==None):
if r == None: # start of episode
return a0 # return initial action
else: # learn from previous action
learn with s and r (and previously saved info)
a = choose next action
return a # return next action
Where the first move is chosen the same as other moves, the
code will
look like this:
def agentFunction(s, r==None):
if r != None: # learn from previous action
learn with s and r (and previously saved info)
a = choose next action
return a # return next action
environmentFunction
([a])
def environmentFunction(a==None):
if a == None: # start of episode
return s0 # return initial sensation
else:
s, r = do action a
return s, r # return next sensation and reward
Where the first move is chosen the same as other moves, the
code will
look like this:
def agentFunction(s, r==None):
if r != None: # do some learning
learn with s and r (and previously saved info)
choose next move
return a
The object created by RLsimulation
has the following
methods:.
step
()steps
(numSteps)stepsQ
(numSteps)steps
is used, it will
return a list of the sensations, actions and rewards in the simulation.
If this is not wanted, use stepsQ
instead (the quicker
and quieter version). episode
([maxSteps])episodeQ
([maxSteps])episode
is used, it will return a list of the sensations, actions and rewards
in the episode. If this is not wanted, use episodeQ
instead (the quicker and quieter version). If maxSteps is
specified, the simulation will stop after that many steps even if the
end of the episode hasn't been reached.
episodes
(numEpisodes [,
maxSteps,
maxStepsTotal])episodesQ
(numEpisodes [, maxSteps,
maxStepsTotal])episodes
is
used, it will return a list of the sensations, actions and rewards in
the episodes. If this is not wanted, use episodesQ
instead (the quicker and quieter version). If maxSteps is
specified, it indicates the maximum number of steps allowed for each
episode. If maxStepsTotal is specified, it limits the
number of steps for all of the episodes together (regardless of whether
an episode has finished, or the specified number of episodes have run).