t discrete time step
T final time step of an episode
state at t
action at t
reward at t, dependent, like
, on
and
![]()
return (cumulative discounted reward) following t
n-step return (Section 7.1)
![]()
-return (Section 7.2)
policy
action taken in state s under deterministic policy
![]()
probability of taking action a in state s under stochastic policy
![]()
set of all nonterminal states
set of all states, includng the terminal state
set of actions possible in state s
probability of transition from state s to state
under action a
expected immediate reward on transition
under action a
value of state s under policy
(expected return)
value of state s under the optimal policy
V,
estimates of
or
![]()
value of taking action a in state s under policy
![]()
value of taking action a in state s under the optimal policy
Q,
estimates of
or
![]()
temporal-difference error at t
eligibility trace for state s at t
eligibility trace for a state-action pair
discount-rate parameter
step-size parameters
decay-rate parameterfor eligibility traces