Reinforcement Learning Research Diary

	Reinforcement Learning and Artificial Intelligence (RLAI)
	Research diary entries

As part of studying for this course you will compile a log or diary of your thoughts and questions on the readings. For each reading, you will make an entry in the diary, submit it by email, and bring a hardcopy to class (for distributed marking). Each entry should be readable english prose (not notes) and should be a few paragraphs -- say half a page or a page long. The bottom line is that is should show that you read the chapter and then went one step further and actually thought about the content in some meaningful way.

In the body of the email message you should write your thoughts on the contents of the chapter. What did you learn from it? What did you like or not like? What more does it make you wonder about? As a first step, you may find it useful to briefly summarize in your own words what the chapter was about. As a second and third step, you might write down two "thought questions" about the overall content of the reading. A thought question is a way of helping you to think about what you have read and to react to it in some meaningful way. Your questions can be general, even vague, but they should be heartfelt. For each question, list at least two possible answers or kinds of answers. After that, say what answer you think is best, and why.

Grading: maximum of 4 points per question
    2 points for turning something in on time of roughly the right format
    1 point if the entry makes it clear that you read the chapter
    1 point if the entry makes it clear that you thought about the content of the chapter
Due by class time of lecture where chapter is covered.

email to sutton@cs.ualberta.ca with the subject line: "RL RDE for chapter X". Bring a hardcopy to class.

Examples of thought questions

Here are two questions of the right form, but which don't show any real engagement with the content of the reading. You would get only 2 points for an RDE of the right form that ended with these.

1. Is reinforcement learning really applicable to more than a narrow range of topics?
    A) Yes
    B) No

2. Do we need to know about policy gradient search for the midterm?
    A) Yes
    B) No
    C) I'm not telling

Below are some 4-point questions/answers. Some of these are quite long, but that is not essential or even desirable. They just have to show you have put some thought into the chapter, preferably about its whole, overall content as opposed to asking particularly about one small part of it.

1. How could curiosity fit in the reinforcement learning framework?
    A) The concept of curiosity could be applied to the exploration policy (as a meta-layer, somehow).
    B) A "curiosity satisfaction" number could be added to the reward signal from the environment to encourage an agent to learn about new things.

2. When you are formulating an application as a reinforcement learning problem, what is a good way to inject domain knowledge?
    A) Constrain the action set in certain states (don't allow bad moves)
    B) Add reward to behaviors that you know (heuristically) are good.
    C) Change the initial value function.
    D) Describe the state space with added parameters representing knowledge about the domain.
    E) Use a model for extra data

1. In section X, why do we do only a one step backup of the value function?
    A) No particular reason, we'll get to other options.
    B) The improvement in the error of the value function is most significant for the first update. With discounted reward, the improvement decreases on each step and it is more time efficient to focus on the most important updates.
    C) If there is a stochastic reward signal and the update is a step in the wrong direction, or if there was significant error in the value in the first place and the update only took a small step in the right direction, the benefit of propagating the new value back is not worth the work.
    D) In certain cases, it would be worth propagating it back, like when data is sparse. Propagating the changes back on each step is not as clear as repeating the trajectory.
    E) It requires a transition model (like for Dynamic Programming). It doesn't make sense to just use the trajectory experienced.
    F) Each step would take an increasing amount of computation, and we only have a time step's worth of computation resources. But this could be addressed by putting a limit on the number of updates per time step.

2. In the elevator control problem, the elevators were treated as completely independent agents and each elevator controller did not consider the behavior of the other elevators. Doesn't that lead to problems like more than one elevator rushing to pick up the same person?
    A) Yes, but the simplification made the elevator controllers better, which balanced out occasional wrong decisions. Multiple elevators picking up the same person is not as serious a problem as a person not being picked up, and the value function reflects this.
    B) Yes, but it happens rarely enough that it doesn't matter.
    C) No. Other constraints added to the problem (not allowing an elevator to stop when another elevator was stopped) prevented this from happening.
    D) No. Having a shared reward signal adds enough interaction, and the elevators select actions independently that end up working together.

Extend this Page How to edit Style Subscribe Notify Suggest Help This open web page hosted at the University of Alberta. Terms of use 930/0