Reinforcement Learning and
Artificial
Intelligence (RLAI)
Research diary entries
As part of studying for this
course you will compile a log or diary of your thoughts and questions
on the readings. For each reading, you will make an entry in the
diary, submit it by email, and bring a hardcopy to class (for
distributed marking). Each entry should be readable
english prose (not notes) and should be a few paragraphs -- say half a
page or a page long. The bottom line is that is should show that
you read the chapter and then went one step further and actually
thought about the content in some meaningful way.
In the body of the email message
you should write your thoughts on the contents of the chapter.
What did you learn from it? What did you like or not like?
What more does it make you wonder about? As a first step, you may
find it useful to briefly summarize in your own words what the chapter
was about. As a second and third step, you might write down two
"thought questions" about the overall content of the
reading. A thought question is a way of helping you to think
about what
you have read and to react to it in some meaningful way. Your questions
can be general, even vague, but they should be heartfelt. For
each question, list at least two possible answers or kinds of
answers. After that, say what answer you think is best, and why.
Grading: maximum of 4 points per question
2 points for turning something in on time of roughly
the right format
1 point if the entry makes it clear that you read
the chapter
1 point if the entry makes it clear that you thought about the content of
the chapter Due by class time of lecture where
chapter is covered.
email to sutton@cs.ualberta.ca with the subject line: "RL RDE for
chapter X". Bring a hardcopy to class.
Examples
of thought questions
Here are two questions of
the right form, but which don't show any real
engagement with the content of the reading. You would get only 2
points for an RDE of the right form that ended with these.
1. Is reinforcement learning really
applicable to more than a narrow range of topics? A) Yes B) No
2. Do we need to know about policy
gradient search for the midterm? A) Yes B) No C) I'm not
telling
Below are some 4-point
questions/answers. Some of these are quite
long, but that is not essential or even desirable. They just have
to show you have put some thought into the chapter, preferably about
its whole, overall content as opposed to asking particularly about one
small part of it.
1. How could curiosity fit in the
reinforcement learning framework? A) The concept
of curiosity could be applied to the exploration policy (as a
meta-layer, somehow). B) A "curiosity
satisfaction" number could be added
to the reward signal from the environment to encourage an agent to
learn about new things.
2. When you are formulating an
application as a reinforcement learning problem, what is a good way to
inject domain knowledge? A) Constrain the
action set in certain states (don't allow bad moves) B) Add reward to
behaviors that you know (heuristically) are good. C) Change the
initial value function. D) Describe the
state space with added parameters representing knowledge about the
domain. E) Use a model
for extra data
1. In section X, why do we do only
a one step backup of the value function? A) No particular
reason, we'll get to other options. B) The
improvement in the error of the value
function is most significant for the first update. With discounted
reward, the improvement decreases on each step and it is more time
efficient to focus on the most important updates. C) If there is a
stochastic reward signal and the
update is a step in the wrong direction, or if there was significant
error in the value in the first place and the update only took a small
step in the right direction, the benefit of propagating the new value
back is not worth the work. D) In certain
cases, it would be worth propagating
it back, like when data is sparse. Propagating the changes back on each
step is not as clear as repeating the trajectory. E) It requires a
transition model (like for Dynamic
Programming). It doesn't make sense to just use the trajectory
experienced. F) Each step
would take an increasing amount of
computation, and we only have a time step's worth of computation
resources. But this could be addressed by putting a limit on the number
of updates per time step.
2. In the elevator control problem,
the elevators were treated as
completely independent agents and each elevator controller did not
consider the behavior of the other elevators. Doesn't that lead to
problems like more than one elevator rushing to pick up the same person? A) Yes, but the
simplification made the elevator
controllers better, which balanced out occasional wrong decisions.
Multiple elevators picking up the same person is not as serious a
problem as a person not being picked up, and the value function
reflects this. B) Yes, but it
happens rarely enough that it doesn't matter. C) No. Other
constraints added to the problem (not
allowing an elevator to stop when another elevator was stopped)
prevented this from happening. D) No. Having a
shared reward signal adds enough
interaction, and the elevators select actions independently that end up
working together.