 
    
    
         
t discrete time step
T final time step of an episode
state at t
action at t
reward at t, dependent, like
, on
and

return (cumulative discounted reward) following t
n-step return (Section 7.1)

-return (Section 7.2)
policy
action taken in state s under deterministic policy

probability of taking action a in state s under stochastic policy

set of all nonterminal states
set of all states, includng the terminal state
set of actions possible in state s
probability of transition from state s to state
under action a
expected immediate reward on transition
under action a
value of state s under policy
(expected return)
value of state s under the optimal policy
V,
estimates of
or

value of taking action a in state s under policy

value of taking action a in state s under the optimal policy
Q,
estimates of
or

temporal-difference error at t
eligibility trace for state s at t
eligibility trace for a state-action pair
discount-rate parameter
step-size parameters
decay-rate parameterfor eligibility traces