Reinforcement Learning and
Artificial
Intelligence (RLAI)
The
Party Problem
This page describes a Markov decision process
based on life as a student and the decisions one must make to both have
a good time and remain in good academic standing. This MDP is
used as an example and in some homework exercises
for CMPUT 499/609 (a course at the University of Alberta).
States:
R = Rested
T = Tired
D = homework Done
U = homework Undone
8p = eight o'clock pm
Actions:
P = Party R = Rest S = Study any means any action has
the same effect
note not all actions are possible in all states
Red numbers are rewards
Green numbers are transition
probabilities (all those not labeled are probability 1.0)
The gray rectangle denotes a terminal
state.
Party Problem Programming Assignment
#1:
Implement a program that models the Party Problem
described above. Use any programming language of your
choice. Assume that the agent follows a random equiprobable
policy (i.e. the probability of picking a particular action while in a
given state is equal to 1 / number of actions that can be performed
from that state). Run your program for 50 episodes. For each
episode, have your program print out the agent's sequence of experience
(i.e. the ordered sequence of states/actions/rewards that occur in the
episode) as well as the sum of the rewards received in that episode
(i.e. the Return with respect to the start state) in a readable form.
What to Hand In (on paper):
The sequences of experience from each episode, including the
Return observed in that episode.
The values of each state (computed by hand using the Bellman
equations).
The average Return from the fifty episodes.
The source code of your program.
The second part of this assighment (below) is due one week after the first (see schedule).
Party Problem Programming Assignment #2:
Implement the policy iteration algorithm (described
in Figure 4.3, p. 98) to learn the optimal policy for the Party Problem
described above. Set the initial policy to "Rock & Roll all night
and Party every day" (i.e. policy should choose to party regardless of
what state the agent is in). Perform each policy evaluation step until
the largest change in a state's value (capital delta in Figure 4.3) is
smaller than 0.001 (theta in Figure 4.3). Print out the policy and the
value function for each iteration (policy change) of the algorithm in a readable form.
What to Hand In (on paper):
The policy and value function for each iteration of the
algorithm.