Reinforcement Learning Thought Questions, Winter 2007

	Reinforcement Learning and Artificial Intelligence (RLAI)
	Thought questions

For each reading (or as indicated in the schedule) you should submit two thought questions in hardcopy in class on the day they are due. Send them by email to sutton@cs.ualberta.ca only if this is not possible for you. (this is a change in policy.) A thought question is a question about the overall content of a reading. It is meant to force you to actually think about what you have read and to react to it in some meaningful way. Your questions can be general, even vague, but they should be heartfelt. For each question, list at least two possible answers or kinds of answers. Be prepared to read one of your questions and its possible answers aloud in class. There will be a random draw to determine which students will read their questions. If you can't attend a class, ask another student to read your question should your name come up in the random draw. If you don't attend and there is no one else to read your question then you will get zero points for that reading's thought questions.

This year what you send in doesn't have to be in the form of questions. the grading will be as follows (X2 for the 2 questions):
    0     if you do nothing
    1     for sending something
    2     for saying something about the content of the chapter
    3     for convincing me that you read and thought about the contents of the chapter
Due by class time of lecture where chapter is covered.

Points to remember:

If you have to submit by email:
- paste questions into the body of the email (no attachments)
- Put "[thought]" in the subject

Questions must be in before the start of class for full marks, but you still must do them even if they are late.

Give answers YOU think are correct. If you are restating the content of the text book, you must add your own thoughts.

Brainstorm. A great thought question is the start of a research idea, not a potential exam question.

Think about the chapter as a whole as well as picky details. It may help to think of "What caught my attention in this chapter? How would I begin to investigate this interesting question?"

I will monitor this page. Extend it with questions you want to discuss.

Examples of thought questions

Here are two questions of the right form, but which don't show any real engagement with the content of the reading. You would get 1 point for each of these.

1. Is reinforcement learning really applicable to more than a narrow range of topics?
    A) Yes
    B) No

2. Do we need to know about policy gradient search for the midterm?
    A) Yes
    B) No
    C) I'm not telling

Below are some 3-point questions/answers. Some of these are quite long, but that is not essential or even desirable. They just have to show you have put some thought into the chapter, preferably about its whole, overall content as opposed to asking particularly about one small part of it.

1. How could curiosity fit in the reinforcement learning framework?
    A) The concept of curiosity could be applied to the exploration policy (as a meta-layer, somehow).
    B) A "curiosity satisfaction" number could be added to the reward signal from the environment to encourage an agent to learn about new things.

2. When you are formulating an application as a reinforcement learning problem, what is a good way to inject domain knowledge?
    A) Constrain the action set in certain states (don't allow bad moves)
    B) Add reward to behaviors that you know (heuristically) are good.
    C) Change the initial value function.
    D) Describe the state space with added parameters representing knowledge about the domain.
    E) Use a model for extra data

1. In section X, why do we do only a one step backup of the value function?
    A) No particular reason, we'll get to other options.
    B) The improvement in the error of the value function is most significant for the first update. With discounted reward, the improvement decreases on each step and it is more time efficient to focus on the most important updates.
    C) If there is a stochastic reward signal and the update is a step in the wrong direction, or if there was significant error in the value in the first place and the update only took a small step in the right direction, the benefit of propagating the new value back is not worth the work.
    D) In certain cases, it would be worth propagating it back, like when data is sparse. Propagating the changes back on each step is not as clear as repeating the trajectory.
    E) It requires a transition model (like for Dynamic Programming). It doesn't make sense to just use the trajectory experienced.
    F) Each step would take an increasing amount of computation, and we only have a time step's worth of computation resources. But this could be addressed by putting a limit on the number of updates per time step.

2. In the elevator control problem, the elevators were treated as completely independent agents and each elevator controller did not consider the behavior of the other elevators. Doesn't that lead to problems like more than one elevator rushing to pick up the same person?
    A) Yes, but the simplification made the elevator controllers better, which balanced out occasional wrong decisions. Multiple elevators picking up the same person is not as serious a problem as a person not being picked up, and the value function reflects this.
    B) Yes, but it happens rarely enough that it doesn't matter.
    C) No. Other constraints added to the problem (not allowing an elevator to stop when another elevator was stopped) prevented this from happening.
    D) No. Having a shared reward signal adds enough interaction, and the elevators select actions independently that end up working together.

Extend this Page How to edit Style Subscribe Notify Suggest Help This open web page hosted at the University of Alberta. Terms of use 1017/0