RLAI open page

	Reinforcement Learning and Artificial Intelligence (RLAI)
	CMPUT 607: Reinforcement Learning in Practice

The ambition of this web page is to be the official, central site for information, software, and handouts for the above-named course at the University of Alberta. This course was taught in the winter/fall of 2005.

If you are taking the course in any capacity, please subscribe to this web page by clicking the "subscribe" link at the bottom of the page (unless you are already subscribed). Then you will be kept apprised of announcements related to the course.

You can add comments or questions to all web pages with an "Extend" link at the bottom of the page. If you click the notify box you might get a timely response.

Course description, basic info on the course
Online version of the RLAI textbook (Sutton & Barto, 1998)
Python information
New proposal for functionality of a reinforcement learning interface (version 7)
The reward hypothesis
The value function hypothesis
Graphing software in python
The RLtoolkit python package
Download Numarray for python here
Our RLbenchmark code archive (Feb 7)
A proof of the discounted RL problem degenerating to the average-reward case when one tries to balance the discounted values of states according to their frequency with which occur in the asymptotic stationary distribution is here.
Reading for Thurs 1/20: Tiles reference manual. Get to it from here.

3 thought questions due yesterday (but assigned today); email before 10pm monday 1/24

Reading for Tues 1/25: "Generalization in RL" paper (Sutton, 1996)

3 thought questions due monday 1/24

Reading for Thurs 1/27:

original TD(lambda) paper (background)
LSTD(lambda)
3 thought questions due Wednesday 1/26

discussion leader: Cosmin

Tuesday Feb 1 we considered some new problems and some more imaginative applications of tile coding.
On thursday Feb 3 we are going to consider RL applied to Kuhn poker as Mike Bowling presented in MLRG on monday. Take a look at the little handout in advance - copies in the ML lab, CSC 2-65
Due Feb 7, 10pm (monday). Three thoughtful questions or comments on the paper to be discussed the next day, or alternatively a brief review of that paper.
Feb 8. The paper to be discussed this day is "Online learning with random representations" by Steve Whitehead and yours truly.
Due Feb 9, 9pm (Wednesday). Proposals due for a class project. Email to rich is fine, and references to the current literature are not required, but the proposed project should be clear and feasible. The project should pertain to reinforcement learning in practice, or "real-life" reinforcement learning, meaning the application of RL ideas to a problem that is existant and established independent of the RL ideas. The project should pertain to the match -- how the RL can be made applicable to the problem without losing the point of its solution. On the other hand, the problem must be a suitable one in the sense of involving online learning and decision-making. The project could be a straightforward application, perhaps a slightly idealized one (as long as it grapples with some real issue not lost in the idealization), or it could be a more systematic theoretical or empirical study of an issue in applying/adapting RL to a real-life problem. [Here's a link for discussing project proposals]
Feb 10. This class will finish discussion of the random rep'ns paper. http://www.cs.ualberta.ca/%7Esutton/papers/sutton-92b.pdf
Due Feb 14: Let's try this a little differently. Instead of questions on the papers to be discussed the next day, just comment on them in free form. Right a brief paper review if you want to. Remember, the real point is to get you thinking about what the paper means to you. Writing something can make a vague feeling about the paper much firmer in your mind, such that you may be able to better articulate it in class.
Due Feb 15. The papers to be discussed today concern automatic setting/adaptation of step-size parameters, i.e., IDBD and K1. These are for incremental supervised learning of course, but carry over particularly well to RL. And a set of related issues: average reward (as opposed to discounting), policy gradient methods (i.e., actor-critic), and continuous actions This is a lot to cover, and there are not very good papers on it, so we won't finish these topics. But we will get them started.
Due Feb 17. RL Entrepreneur Game: Proposals and investments. No powerpoint needed, but please prepare a one-page precis (summary) of your proposal that addresses the four elements:
1. application idea in real world < 5 years
2. business plan
3. feasibility argument
4. technology challenge (reason for research funding)
I recommend that you explicitly identify these elements in your one page precis. Remember this is a sales pitch. You have to convince the venture capitalist that you plan is feasible and you are serious about it. Ties are optional. Make 13 copies of your one-page precis and bring them to class (for the investors).
Due Monday Feb 28: 3 comments or questions on tomorrow's reading
Due Tuesday March 1: The reading for this day is "TD models: Modeling the world at a mixture of time scales".
Due Wednesday March 2: 3 comments or questions on tomorrow's reading
Due Thursday March 3: The reading for this day is the first part (sections 1-3, up to page 19) of "Between MDPs and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning".
Due Monday March 7: 3 comments or questions on tomorrow's reading
Due Tuesday March 8: The reading for this day is the second part (sections 4-, from page 19) of "Between MDPs and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning".
Due Wednesday March 9: 3 comments or questions on tomorrow's reading
Due Thursday March 10: The reading for this day is "Programming Robots Using Reinforcement Learning and Teaching" by Long-Ji Lin. It is not available online, but you can find copies in the Machine Learning Lab (CSC 2-65). If you hurry up, you might be lucky and get one of the copies that don't have the second page upside down ;)
Also on Thursday, March 10: Programming problem 1 (Ship Steering) will be described to be due next class
Due Tuesday, March 15: Agent solving Programming problem 1 (Ship Steering). Deliver a single python file defing an agent as an email attachment sent to rich@richsutton.com. Your program will be benchmarked in class on the programming problem, and peoples' solutions will be discussed.
Due Thursday, March 17: Another attempt at Programming problem 1 (Ship Steering). Deliver agent as above. Also, check out the Schwartz paper on R-learning for average reward.
Due Tuesday, March 22: Another attempt at Programming problem 1 (Ship Steering). Deliver agent as above. Also, check out the Williams paper on REINFORCE.
Due Wednesday, March 23: 3 thoughtful questions on tomorrow's reading, by 10pm.
Due Thursday, March 24: Read the policy gradient paper here.
Due Monday, March 28: 3 thoughtful questions on tomorrow's paper.
Due Tuesday, March 29: Read the paper on Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System.
Due Wednesday, Mar 30: Read the excerpts from Adaptive Signal Processing handed out in class (and available in the rlai lab). Three thoughtful questions on the readings due by 10pm
Thursday, Mar 31: We will discuss the reading.
Due monday, April 4, 10pm: Three thoughtful questions or comments, or a review, of the reading for tomorrow. Send these to bowling@cs.ualberta.ca AND to rich@richsutton.com.
Tuesday, April 5: Michael Bowling on RL in robotics. Discussion of the paper on simultaneous adversarial multi-robot learning.
Due wednesday, April 6, 10pm: Three thoughtful questions or comments, or a review, of the reading for tomorrow. Send these to bowling@cs.ualberta.ca AND to rich@richsutton.com.
Thursday, April 7: More Michael Bowling. Discussion of the paper on autonomous helicopter flight via reinforcement learning.
Tuesday, April 12: Nathan and Adam will report on their experiences using RL in the game of Hearts.
Due Wednesday, April 13, 10 pm: Three thoughtful questions or comments, or a review, of the reading for tomorrow. Send these to rich@richsutton.com AND to David at silver@cs.ualberta.ca.
Thursday, April 14: Last class. David will lead discussion and present some of experience and results applying RL to the game of Go. The reading is a paper by Nici Schraudolph, which can be found here.

Programming Problem 1: Ship Steering

Write an agent to solve the ship steering problem. The code for this is here: Ship5, Mar 22. You may find the module manualagent.py useful for experimentation. You should make a new version of shipagent.py (which is now just a slightly modified version of manualagent). It probably wouldn't help you much, but please do not look at shipenv.py. You run the program by shipbenchmark.py.

this code fragment gives tha main io variables (actions and sensations):

def init():

    "return ranges for actions and sensations, in specification language 2"

    # actions:[thust,   rudder],
sens:[h,          
hdot,        v] 

    return 2, [(-1,+2), (-90,+90)],   [(-180,+180), (-180,+180), (-10,+40)]

Results of RL Entrepreneur Rounds 1 and 2:

    AI Inc
        Brian Tanner CEO
        Investors:
            Nathan $2m
            Rich     $1m
            Jonas $.875m
            Sverrir $.75m
            David   $.75m
            Alborz $.5625m
            Cosmin $.25m
        Total capitalization: $5.8125 million

    Fruit Machine Inc
        David Silver CEO
        Investors:
            Aloak $1.5m
            Miloje $1.25m
            Eddie $1.25m
            Rich    $1m
            Jun     $.25
        Total capitalization: $5.25 million

    Alife Inc
        Cosmin Paduraru CEO
        Investors:
            Brian     $1.25m
            David     $1m
            Adam     $.75m
            Jun         $.75m
            Katsunari $.25
            Sverrir      $.25
        Total capitalization: $4.25 million

    Learning for Learning (L4L) Enterprises
        Nathan Sturtevant CEO
        Investors:
            Alborz $1m
            Adam   $1m
            Cosmin $.75m
            Kevin    $.5m
        Total capitalization: $3.25 million

investments:
Rich:     AI 500+250, Fruit 500+500
Alborz:    L4L 500+500, AI 500+62.5
Miloje:     Fruit 500+500, Fruit 250
Cosmin:     AI 250, L4L 500+250
Jun:        Fruit 250, Alife 500+250
Kevin:    l4l 250, l4l 250
Katsunari:    Alife 250, Fruit 250
David:    AI 500+125, Alife 500+500
Sverrir:    AI 500+250, Alife 250
Jonas:    AI 500+125, AI 250
Eddie:    Fruit 500+125, Fruit 500+125
Aloak:    Fruit 500+250,Fruit 500+250
Brian:    Alife 500+500, Alife 250
Nathan:     AI 500+500, AI 500+500
Adam:     Alife 500+250, L4L 500+500

NSERC Matching Funds Table

Direct Investment    Addn'l Match    Match    Total w/match
$0.5m                                    $0.5m        $1m
$1.0m                    $0.5m        $1m            $2m
$1.5m                    $0.25m        $1.25m        $2.75m
$2.0m                      $0.25m        $1.5m        $3.5m
$2.5m                      $0.125m        $1.625m        $4.125m
$3.0m                      $0.125m        $1.75m        $4.75m
$3.5m                      $0.0625m        $1.8125m        $5.3125m
$4.0m                      $0.0625m        $1.875m        $5.875m
----
All after-shakeout investments are worth half of face value, with no matching, i.e.,$250K each

You can find our gridworld with mines at http://www.cs.ualberta.ca/~cosmin/mines.txt, together with a map to test it on at http://www.cs.ualberta.ca/~cosmin/map.txt. We tested it a bit and it seems to work with Brian and Aloak's agent.

Sverrir and Cosmin Anonymous, Mon Jan 31 20:33:45 2005

So in the last class I asked Rich about how large we should set the memory size for tile coding. In simulated environments, it is not a big deal, and as Rich said you can occupy memory as large as you can, but I cant do that with my AIBO. In my last experiment I had to wait 2-3 seconds each time I wanted to save the weight vector(it was around 300000 elements). So at last I decided to save my weight vector right after the end of the episode but now I am facing a non-episodic task, so I think in these situations we have a trade of between the number of collisions and time cycle. Alborz, Fri Feb 4 10:56:15 2005

Rich: If i understand you right, Alborz, you have a special situation in which you occasionally want to copy and transfer the weight vector. That process will scale with its size, which is the memory size, and so you may well want to make it much smaller. Rich, Sun Feb 6

Remember to send in thoughtful questions or comments on tomorrow's readings (IDBD and K1, say 2 questions each).

If you are having trouble thinking up questions, here are a couple to get you started:

1. Could IDBD/K1 be used with TD(lambda) linear learner rather than an LMS linear learner?

2. Could the meta-gradient idea be used for other generalization parameters beyond alpha? How about lambda, or the width of the tiles of basis functions in a tile-coding or RBF function approximator? How about for selecting among the hidden units in a random representation?

3. What is the relationship of K1 to LSTD(lambda)

BTW, you can't use these questions as yours. But you could use thoughtful answers, even provisional answers, as your thoughtful comments.

Rich

p.s. thanks to whoever fixed the link to the K1 pape Rich, Mon Feb 14 19:05:55 2005

Adam has made a matlab version of the quick, supervised-learning "throwaway" code. It is available at http://www.cs.ualberta.ca/~sutton/rlip/code.zip. Thanks Adam! rich, Mon Feb 14 19:51:46 2005

Here is the average-reward proof: http://www.cs.ualberta.ca/%7Esutton/rlip/average-reward-proof.pdf Anonymous, Wed Mar 16 11:56:28 2005

On Tuesday, March 29: Read the paper on Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun Systems at: http://www.eecs.umich.edu/~baveja/Papers/RLDSjair.pdf

Adam Adam, Thu Mar 24 16:44:43 2005

Extend this Page How to edit Style Subscribe Notify Suggest Help This open web page hosted at the University of Alberta. Terms of use 7045/18