Action-value methods for our n-armed bandit problem were first proposed by Thathachar and Sastry (1985). These are often called estimator algorithms in the learning automata literature. The term action value is due to Watkins (1989). The first to use -greedy methods may also have been Watkins (1989, p. 187), but the idea is so simple that some earlier use seems likely.