Home Reinforcement Learning and Artificial Intelligence (RLAI)
The Party Problem
Created by Rich Sutton Jan 9 2006
This page describes a Markov decision process based on life as a student and the decisions one must make to both have a good time and remain in good academic standing.  This MDP is used as an example and in some homework exercises for CMPUT 499/609 (a course at the University of Alberta). 



          

States:

R = Rested
T = Tired
D = homework Done
U = homework Undone
8p = eight o'clock pm

Actions:

P = Party
R = Rest
S = Study
any means any action has the same effect
note not all actions are possible in all states
Red numbers are rewards
Green numbers are transition probabilities (all those not labeled are probability 1.0)
The gray rectangle denotes a terminal state.
Party Problem Programming Assignment #1:

    Implement a program that models the Party Problem described above.  Use any programming language of your choice.  Assume that the agent follows a random equiprobable policy (i.e. the probability of picking a particular action while in a given state is equal to 1 / number of actions that can be performed from that state).  Run your program for 50 episodes. For each episode, have your program print out the agent's sequence of experience (i.e. the ordered sequence of states/actions/rewards that occur in the episode) as well as the sum of the rewards received in that episode (i.e. the Return with respect to the start state) in a readable form.

    What to Hand In (on paper):

The second part of this assighment (below) is due one week after the first (see schedule).

 
Party Problem Programming Assignment #2:

    Implement the policy iteration algorithm (described in Figure 4.3, p. 98) to learn the optimal policy for the Party Problem described above. Set the initial policy to "Rock & Roll all night and Party every day" (i.e. policy should choose to party regardless of what state the agent is in). Perform each policy evaluation step until the largest change in a state's value (capital delta in Figure 4.3) is smaller than 0.001 (theta in Figure 4.3). Print out the policy and the value function for each iteration (policy change) of the algorithm in a readable form.

    What to Hand In (on paper):

Extend this Page   How to edit   Style   Subscribe   Notify   Suggest   Help   This open web page hosted at the University of Alberta.   Terms of use  2052/0