RLAI open page

	Reinforcement Learning and Artificial Intelligence (RLAI)
	CMPUT 609: Reinforcement Learning for Artificial Intelligence

The ambition of this web page is to be the official, central site for information, software, and handouts for the 2010 version of CMPUT 609 (a course at the University of Alberta). There are also pointers here to slides for the course and to related courses elsewhere.

For up-to-date course materials, see Sutton's home page, or go directly here.

The marking of the written assignments will be done by Mohammad Shafiei (shafieik@cs.ualberta.ca), so you must get an electronic copy to him by class time the day the assignment is due. Assignments must be turned in on time because answer sheets will be handed out on the day the assignment is due. There is a scanner in CSC B-23 that can be used, and that is accessible to all graduate students. If your answers are already in pdf form then you can email them directly to mohammad.

See the web page for the book and check out the "errata/notes" page.

Reading diary entries should be emailed to sutton@cs.ualberta.ca. He will read them and send back a crytic marking, points out of four. It is no longer necessary to bring a hardcopy of your entry to class on the day it is due.

Provisional schedule of classes and assignments
Course description, basic info on the course
Online version of the textbook
Csaba's technical report (updated on Feb 2, 2010 to correct some typos)
Research diary entries
fourteen declarative principles for an AI architecture
Proof of the futility of discounting when using function approximation
Python information
Exercise 2.55
RL-Library and RL-Glue- a standard for connecting RL agents and environments
Worldwide reinforcement learning research
Chapter II of Ross (1983). (A compact presentation of the main theorems, with proofs, of dynamic programming for MDPs)
Refurbished versions of the pseudocode algorithms in Figures 8.8 and 8.9 are here for Sarsa(lambda) with FA and here for Watkins's Q(lambda) with FA.
The reinforcement learning hypothesis
The value function hypothesis
Critterbot web site
Disco web site
compass world paper
half a manifesto

Slides

Web viewable (slightly old):

Source files (powerpoint) as a tar archive (Jan 06)

A Few of the Related Courses Elsewhere

Peter Stone's course at the University of Texas at Austin

a student's ideas on exercise 7.3

Exercise 2.5 asks for pseudocode, meaning a way of writing the code that makes the algorithmic ideas clear without presuming any particular language, and without worrying about all the details. There should be no need to explain data structures or implementation details of that type. For examples of the kind of pseudocode we are looking for, see that used in later chapters, such as in Figures 4.1 (page 92) or 6.9 (page 146). In addition, here is a template for the answer to this question:

    Repeat for a = 1 to n: ; initialization
        ...

    Repeat forever:
        a = ...
        r = bandit(a)
        ...

Here is the mark distribution for the exercises:

2.1: 2pts.
2.5: 8pts.
2.55: 28pts.
2.8: 2pts. (extra credit)

be sure to answer every sub-question in every exercise - that's how the marking is done.

   Questions 3.5, 3.8, & 3.11    - 4 pts. each
   Questions 3.9                 - 5 pts.
   Questions 3.10 & 3.17       - 6 pts. each

   Exercise 4.1 - 6 pts.
Exercise 4.2 - 7 pts.
Exercise 4.3 - 9 pts.
Exercise 4.5 - 8 pts.
Exercise 4.9 - 4 pts.

Exercise 5.1 - 3 pts.

Exercise 5.2 - 4 pts.
Exercise 5.5 - 4 pts.

   Exercise 7.2 - 3 pts.
Exercise 7.6 - 6 pts.

   Exercise 8.1 - 3 pts.
Exercise 8.2 - 4 pts.
Exercise 8.6 - 3 pts.
Exercise 8.7 - 2 pts.

   Exercise 9.1 - 4 pts.
Exercise 9.2 - 3 pts.
Exercise 9.3 - 3 pts.
Exercise 9.5 - 6 pts.
Exercise 9.6 - 2 pts. (Extra Credit)

a good survey paper on the neuroscience of reinforcement learning can be found here: http://dx.doi.org/10.2976/1.2732246. rich, Mon Feb 16 14:42:08 2009

Here is the problem that is due next Tuesday.

A man is standing on a queue waiting for service, with N people ahead of him. He knows the utility of waiting out the queue, r, and the probability p that a person will be served in unit time. On the other hand, he incurs a cost of c for every unit of time spent waiting. The problem is to determine his waiting policy if he wishes to maximize his expected return.

Of course, if he quits the queue, he will not be served.

The problem is to set up this as an MDP (states, actions, rewards), write the Bellman optimality equation and then solve it to identify the optimal policy.

If you have questions, do not hesitate to ask me. Csaba, Thu Jan 21 22:49:47 2010

Here is the next problem, due to next Thursday.

A driver is looking for parking on the way to his destination. Each parking place is free with probability p independently of whether other parking places are free or not. The driver cannot observe whether a parking place is free until he reaches it. If he parks k places from his destination, he incurs a cost k. If he reaches the destination without having parked the cost is C. The goal of the driver is to minimize the expected cost of parking.

Note that the driver knows p and C and how far he is from his destination. Help the driver by determining an optimal policy. For this, write the problem as an MDP, determine the Bellman optimality equation and then find the optimal policy. Csaba, Tue Jan 26 23:16:28 2010

Here is the problem that is due the coming Thursday.

Assume that we are given a normed vector space V.
Let T: V -> V be an L_T-Lipschitz operator, and let S: V -> V be an L_S-Lipschitz operator.
Prove that the composition of T and S is an (L_T L_S)-Lipschitz operator. Csaba, Tue Jan 26 23:20:20 2010

Extend this Page How to edit Style Subscribe Notify Suggest Help This open web page hosted at the University of Alberta. Terms of use 1802/16