by Rich Sutton, after a design created in conjunction with Henry Kautz.
This software creates random Markov decision processes (MDPs) that can be used to benchmark learning and planning algorithms. The MDPs can be unstructured, with enumerated states {0,1,2,...,N-1}, or they may be structured (factored) into state variables {x1,x2,x3,...} (each of which is enumerated) with limited influences from one time step to the other, as in dynamic Bayes nets. The details of these two cases, and the routines available for each, are given below, first for unstructured MDPs and then for structured MDPs.
Support is also provided in both cases for partially observable MDPs (POMDPs). This is done simply by postprocessing the states of the MDPs to observations, as detailed below.
At the end of this page is information on obtaining and installing RandomMDPs.lisp.
To create a CLOS object of type MDP you need only provide the three integers N, M, and B, as in (make-random-MDP N M B). Once the MDP has been constructed, you can query it for a sample next state or expected reward by calling next-state or next-reward. Finally, you can evaluate a given policy with evaluate-policy. These are detailed below.
When you make an unstructured MDP, it will be of this type. Structured-MDPs inherit from (are a subclass of) MDPs.
Returns a CLOS object of type MDP with N states, M actions per state, and B successor states for each state-action pair. The B successors are selected without replacement, so B must be less than N. The expected reward for each state-action pair is selected according to a normal distribution with mean 0 and variance 1.
Returns a sample next state for the MDP when it is in state s and action a is taken. For unstructured MDPs, s must be an integer < N, whereas, for structured MDPs, s must be a list of values for the state variables.
Returns the expected reward for the MDP when it is in state s and action a is taken. For unstructured MDPs, s must be an integer < N, whereas, for structured MDPs, s must be a list of values for the state variables.
This routine is used to evaluate a given policy at start-state of the MDP, that is, to compute the expected gamma-discounted reward starting from that state. The policy is a function taking a state as its sole input and producing a list of the M probabilities with which it would take each possible action in that state. The value is estimated iteratively, stopping when the estimate is approximately within threshold of the true value. Threshold defaults to 0.0000001. Here is a complete example:
(defun policy (state) '(0.5 0.5)) ;Define the random 2-action policy : (setq MDPs (loop repeat 50 ;Make 50 MDPs, each of 100 states, collect (make-random-MDP 100 2 3))) ;2 actions, and a branching factor of 3 (loop for MDP in MDPs do ;Evaluate the policy in state 0 (print (evaluate-policy 'policy MDP .9 0))) ;for each MDP, printing the results
Returns the expected reward for the MDP when it is in state s and action a is taken.
The generic routines next-state and expected-reward can be used as described above for unstructured MDPs.
When you make a structured MDP, it will be of this type.
A random structured MDP is a CLOS object with some number of variables, {v1, v2, v3, ..., vn}, each of which is an integer in {0,1,2,Nk}. The number of values that each variable can take on (Nk) can vary from variable to variable. The first argument to make-random-structured-MDP, N-list, is a list specifying the upper limits, (N1 N2 ... Nn). This argument also specifies the number of variables (n) by its length. For example, if N-list is (2 2 2 2 2) then you are requesting a state consisting of 5 binary variables.
The second argument, M, specifies the number of actions, which is assumed to be the same in all states. An action corresponds to a set of rules, one per state variable, for determining the next value of that variable when the action is taken. The argument num-transition-rules specifies how many of these rules are nontrivial, where the trivial rule simply copies the variable's previous value. For example, you might have 10 state variables and wish a randomly selected 5 of them to be influenced by the choice of action, the others retaining their values.
The argument num-transition-antecedents specified the complexity of the rules -- how many of the state variables in the preceding time step participate in the rule. These preceding variables---antecedents---are selected at random without replacement to form the rule. An exhaustive conditional probability table is constructed based on the antecedents and filled with random probabilities for the next values of the variable (these are random partitions as described for unstructured MDPs, but see simplex-generator below).
For each action, there are also randomly constructed rules for reward components that are added together to determine the expected reward on the transition. The argument num-reward-rules specifies the number of such rules, and num-reward-antecedents specifies the number of their antecedent variables. As before, we make an exhaustive table conditioning on the values of the antecedent variables, this time populating the table with random numbers chosen according to a standard normal distribution (mean zero, unit variance). Each rule then produces one such number; these are added up to determine the reward.
The final optional argument simplex-generator provides a way to customize the partition of unity used to determine the values of the state variables. If this argument is not provided, then the partition is the uniform one (a random simplex) as described above. If a function is provided here it is used instead of this.
This routine is used to evaluate a given policy at start-state of the MDP, that is, to compute the expected gamma-discounted reward starting from that state. Gamma must be less than 1. The policy is a function taking a state as its sole input and producing a sample of the possible next states (note that this is different from the policy used in evaluate-policy). The value is estimated in a Monte Carlo fashion, that is, by generating many sample trajectories and averaging their returns. The number of sample trajectories num-samples defaults to 1000. Since trajectories can be infinite they must cut off at some length L; they are cut off when gamma^L < threshold, where threshold defaults to 0.001. The start-state defaults to the state in which all state variables are zero. Here is a complete example:
(defun policy (state) (random 2) ;Define the random 2-action policy : (setq MDPs (loop repeat 50 ;Make 50 collect (make-random-structured-MDP ;random, structured MDPs '(2 2 2 2 2 2 2 2 2 2) ;each with 10 binary variables 2 ;and 2 actions 5 ; each of which affects the value of 5 variables 3 ; as a function of 3 antecedent variables 1 ; and each of which has one reward rule 3) ; which also depends on three antecedents (loop for MDP in MDPs do ;Evaluate the policy in state '(0 0 0 0 0 0 0 0 0 0) (print (MC-evaluate-policy 'policy MDP .9))) ;for each MDP, printing the results