Home Reinforcement Learning and Artificial Intelligence (RLAI)
CMPUT 607: Reinforcement Learning in Practice
Rich Sutton Jan 15 2005
The ambition of this web page is to be the official, central site for information, software, and handouts for the above-named course at the University of Alberta.  This course was taught in the winter/fall of 2005.

If you are taking the course in any capacity, please subscribe to this web page by clicking the "subscribe" link at the bottom of the page (unless you are already subscribed).  Then you will be kept apprised of announcements related to the course.

You can add comments or questions to all web pages with an "Extend" link at the bottom of the page.  If you click the notify box you might get a timely response.

Programming Problem 1: Ship Steering

Write an agent to solve the ship steering problem.  The code for this is here: Ship5, Mar 22.  You may find the module manualagent.py useful for experimentation. You should make a new version of shipagent.py (which is now just a slightly modified version of manualagent).  It probably wouldn't help you much, but please do not look at shipenv.py.  You run the program by shipbenchmark.py.

this code fragment gives tha main io variables (actions and sensations):

def init():
    "return ranges for actions and sensations, in specification language 2"
    # actions:[thust,   rudder], sens:[h,           hdot,        v]
    return 2, [(-1,+2), (-90,+90)],   [(-180,+180), (-180,+180), (-10,+40)]

Results of RL Entrepreneur Rounds 1 and 2:

    AI Inc
        Brian Tanner CEO
            Nathan $2m
            Rich     $1m
            Jonas  $.875m
            Sverrir $.75m
            David   $.75m
            Alborz  $.5625m
            Cosmin $.25m
        Total capitalization: $5.8125 million

    Fruit Machine Inc
        David Silver CEO
            Aloak  $1.5m
            Miloje  $1.25m
            Eddie  $1.25m
            Rich    $1m
            Jun     $.25
        Total capitalization: $5.25 million

    Alife Inc
        Cosmin Paduraru CEO
            Brian     $1.25m
            David     $1m
            Adam     $.75m
            Jun         $.75m
            Katsunari $.25
            Sverrir      $.25
        Total capitalization: $4.25 million

    Learning for Learning (L4L) Enterprises
        Nathan Sturtevant CEO
            Alborz  $1m
            Adam   $1m
            Cosmin $.75m
            Kevin    $.5m
        Total capitalization: $3.25 million

Rich:     AI 500+250, Fruit 500+500
Alborz:    L4L 500+500, AI 500+62.5
Miloje:     Fruit 500+500, Fruit 250
Cosmin:     AI 250, L4L 500+250
Jun:        Fruit 250, Alife 500+250
Kevin:    l4l 250, l4l 250
Katsunari:    Alife 250, Fruit 250
David:    AI 500+125, Alife 500+500
Sverrir:    AI 500+250, Alife 250
Jonas:    AI 500+125, AI 250
Eddie:    Fruit 500+125, Fruit 500+125
Aloak:    Fruit 500+250,Fruit 500+250
Brian:    Alife 500+500, Alife 250
Nathan:     AI 500+500, AI 500+500
Adam:     Alife 500+250, L4L 500+500

NSERC Matching Funds Table

Direct Investment    Addn'l Match    Match    Total w/match
$0.5m                                        $0.5m        $1m
$1.0m                        $0.5m        $1m            $2m
$1.5m                        $0.25m        $1.25m        $2.75m
$2.0m                        $0.25m        $1.5m        $3.5m
$2.5m                        $0.125m        $1.625m        $4.125m
$3.0m                        $0.125m        $1.75m        $4.75m
$3.5m                        $0.0625m        $1.8125m        $5.3125m
$4.0m                        $0.0625m        $1.875m        $5.875m
All after-shakeout investments are worth half of face value, with no matching, i.e.,$250K each

You can find our gridworld with mines at http://www.cs.ualberta.ca/~cosmin/mines.txt, together with a map to test it on at http://www.cs.ualberta.ca/~cosmin/map.txt. We tested it a bit and it seems to work with Brian and Aloak's agent.

Sverrir and Cosmin 

So in the last class I asked Rich about how large we should set the memory size for tile coding. In simulated environments, it is not a big deal, and as Rich said you can occupy memory as large as you can, but I cant do that with my AIBO. In my last experiment I had to wait 2-3 seconds each time I wanted to save the weight vector(it was around 300000 elements). So at last I decided to save my weight vector right after the end of the episode but now I am facing a non-episodic task, so I think in these situations we have a trade of between the number of collisions and time cycle. 

Rich: If i understand you right, Alborz, you have a special situation in which you occasionally want to copy and transfer the weight vector.  That process will scale with its size, which is the memory size, and so you may well want to make it much smaller.   Rich, Sun Feb 6

Remember to send in thoughtful questions or comments on tomorrow's readings (IDBD and K1, say 2 questions each).

If you are having trouble thinking up questions, here are a couple to get you started:

1. Could IDBD/K1 be used with TD(lambda) linear learner rather than an LMS linear learner?

2. Could the meta-gradient idea be used for other generalization parameters beyond alpha? How about lambda, or the width of the tiles of basis functions in a tile-coding or RBF function approximator? How about for selecting among the hidden units in a random representation?

3. What is the relationship of K1 to LSTD(lambda)

BTW, you can't use these questions as yours. But you could use thoughtful answers, even provisional answers, as your thoughtful comments.


p.s. thanks to whoever fixed the link to the K1 pape 

Adam has made a matlab version of the quick, supervised-learning "throwaway" code. It is available at http://www.cs.ualberta.ca/~sutton/rlip/code.zip. Thanks Adam! 

Here is the average-reward proof: http://www.cs.ualberta.ca/%7Esutton/rlip/average-reward-proof.pdf 

On Tuesday, March 29: Read the paper on Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun Systems at: http://www.eecs.umich.edu/~baveja/Papers/RLDSjair.pdf


Extend this Page   How to edit   Style   Subscribe   Notify   Suggest   Help   This open web page hosted at the University of Alberta.   Terms of use  7045/18