Action-value methods for our n-armed bandit problem were first proposed by
Thathachar and Sastry (1985). These are often called
estimator algorithms in the learning automata literature. The term
action value is due to Watkins (1989). The first to use
-greedy methods may also have been Watkins (1989, p. 187), but the idea is so
simple that some earlier use seems likely.