Reinforcement Learning and Artificial Intelligence (RLAI) CMPUT 609: Reinforcement Learning for Artificial Intelligence

The ambition of this web page is to be the official, central site for information, software, and handouts for the 2010 version of CMPUT 609 (a course at the University of Alberta).  There are also pointers here to slides for the course and to related courses elsewhere.

For up-to-date course materials, see Sutton's home page, or go directly here.

The marking of the written assignments will be done by Mohammad Shafiei (shafieik@cs.ualberta.ca), so you must get an electronic copy to him by class time the day the assignment is due.
Assignments must be turned in on time because answer sheets will be handed out on the day the assignment is due.  There is a scanner in CSC B-23 that can be used, and that is accessible to all graduate students.  If your answers are already in pdf form then you can email them directly to mohammad.

See the web page for the book and check out the "errata/notes" page.

Reading diary entries should be emailed to sutton@cs.ualberta.ca. He will read them and send back a crytic marking, points out of four.  It is no longer necessary to bring a hardcopy of your entry to class on the day it is due.

Slides

Web viewable (slightly old):
Source files (powerpoint) as a tar archive (Jan 06)

A Few of the Related Courses Elsewhere

Exercise 2.5 asks for pseudocode, meaning a way of writing the code that makes the algorithmic ideas clear without presuming any particular language, and without worrying about all the details.  There should be no need to explain data structures or implementation details of that type.  For examples of the kind of pseudocode we are looking for, see that used in later chapters, such as in Figures 4.1 (page 92) or 6.9 (page 146). In addition, here is a template for the answer to this question:

Repeat for a = 1 to n:        ; initialization
...

Repeat forever:
a = ...
r = bandit(a)
...

Here is the mark distribution for the exercises:

2.1:  2pts.
2.5:  8pts.
2.55: 28pts.
2.8:  2pts. (extra credit)

be sure to answer every sub-question in every exercise - that's how the marking is done.

Questions 3.5, 3.8, & 3.11    - 4 pts. each
Questions 3.9                     - 5 pts.
Questions 3.10 & 3.17          - 6 pts. each

Exercise 4.1 - 6 pts.
Exercise 4.2 - 7 pts.
Exercise 4.3 - 9 pts.
Exercise 4.5 - 8 pts.
Exercise 4.9 - 4 pts.

Exercise 5.1 - 3 pts.

Exercise 5.2 - 4 pts.
Exercise 5.5 - 4 pts.

Exercise 7.2 - 3 pts.
Exercise 7.6 - 6 pts.

Exercise 8.1 - 3 pts.
Exercise 8.2 - 4 pts.
Exercise 8.6 - 3 pts.
Exercise 8.7 - 2 pts.

Exercise 9.1 - 4 pts.
Exercise 9.2 - 3 pts.
Exercise 9.3 - 3 pts.
Exercise 9.5 - 6 pts.
Exercise 9.6 - 2 pts. (Extra Credit)

a good survey paper on the neuroscience of reinforcement learning can be found here: http://dx.doi.org/10.2976/1.2732246.

Here is the problem that is due next Tuesday.

A man is standing on a queue waiting for service, with N people ahead of him. He knows the utility of waiting out the queue, r, and the probability p that a person will be served in unit time. On the other hand, he incurs a cost of c for every unit of time spent waiting. The problem is to determine his waiting policy if he wishes to maximize his expected return.

Of course, if he quits the queue, he will not be served.

The problem is to set up this as an MDP (states, actions, rewards), write the Bellman optimality equation and then solve it to identify the optimal policy.

If you have questions, do not hesitate to ask me.

Here is the next problem, due to next Thursday.

A driver is looking for parking on the way to his destination. Each parking place is free with probability p independently of whether other parking places are free or not. The driver cannot observe whether a parking place is free until he reaches it. If he parks k places from his destination, he incurs a cost k. If he reaches the destination without having parked the cost is C. The goal of the driver is to minimize the expected cost of parking.

Note that the driver knows p and C and how far he is from his destination. Help the driver by determining an optimal policy. For this, write the problem as an MDP, determine the Bellman optimality equation and then find the optimal policy.

Here is the problem that is due the coming Thursday.

Assume that we are given a normed vector space V.
Let T: V -> V be an L_T-Lipschitz operator, and let S: V -> V be an L_S-Lipschitz operator.
Prove that the composition of T and S is an (L_T L_S)-Lipschitz operator.