Comprehensive midterm exam covering chapters 2-8.

Review last year’s exam in the dropbox.
See the note there about bringing one page of your notes into the exam.

Review w1-w5.


The successful exam taker:

Should know the meaning and interrelationships of rewards, returns, values, and policies
Should know the three great solution methods, DP, MC, and TD, their essences and differences
  and their major relative strengths and weaknesses
  and their unification by n-step methods (Chap 7) and sample-based planning methods (Chap 8)
Should know how to go back and forth between backup diagrams, algorithm names, and their equations
Should know the 4 value functions and their:
  definitions, 
  Bellman Eqs, 
  backup diagrams, 
  DP methods for computing them
  TD methods for learning them
  MC methods for learning them
Should be able to work an example MDP or MRP and compute values and optimal policies
Should know what the environment dynamics function p(s’,r|s,a) is 
Should know what a bandit problem is
Should know what a finite MDP is
Should know about GPI and how it leads to optimality
  -- policy iteration, the dance of policy and value, pi and q_pi
How to improve a policy from its value function
How to evaluate a given policy
How rewards express our objectives
How to convert a verbal description to an MDP
Should know the meaning of the step-size and discount-rate parameters
Should know the meaning and impact of the parameter n in n-step methods
Should know about the explore/exploit dilemma
  e.g., in action selection (e.g., bandits)
  how it lead to on-policy and off-policy strategies for learning solutions to MDPs
Should know the senses in which learning and planning are interchangeable
Should know the difference between learning and planning
Should know about bootstrapping and the Markov property
  how it is powerful
  how it is limited
Should know how samples can be averaged incrementally