The One-Step Trap (in AI Research)
Rich Sutton
Written up for X on July 18, 2024
The one-step trap is the common mistake of thinking that all or
most of an AI agent’s learned predictions can be one-step ones,
with all longer-term predictions generated as needed by iterating
the one-step predictions. The most important place where the trap
arises is when the one-step predictions constitute a model of the
world and of how it evolves over time. It is appealing to think
that one can learn just a one-step transition model and then “roll
it out” to predict all the longer-term consequences of a way of
behaving. The one-step model is thought of as being analogous to
physics, or to a realistic simulator.
The appeal of this mistake is that it contains a grain of truth:
if all one-step predictions can be made with perfect accuracy,
then they can be used to make all longer-term prediction with
perfect accuracy. However, if the one-step predictions are not
perfectly accurate, then all bets are off. In practice, iterating
one-step predictions usually produces poor results. The one-step
errors compound and accumulate into large errors in the long-term
predictions. In addition, computing long-term predictions from
one-step ones is prohibitively computationally complex. In a
stochastic world, or for a stochastic policy, the future is not a
single trajectory, but a tree of possibilities, each of which must
be imagined and weighted by its probability. As a result, the
computational complexity of computing a long-term prediction from
one-step predictions is exponential in the length of the
prediction, and thus generally infeasible.
The bottom line is that one-step models of the world are hopeless,
yet extremely appealing, and are widely used in POMDPs, Bayesian
analyses, control theory, and in compression theories of AI.
The solution, in my opinion, is to form temporally abstract models
of the world using options and GVFs, as in the following
references.
Sutton, R.S., Precup, D., Singh, S. (1999). Between MDPs and
semi-MDPs: A Framework for Temporal Abstraction in Reinforcement
Learning. Artificial Intelligence 112:181-211.
Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M.,
White, A., Precup, D. (2011). Horde: A scalable real-time
architecture for learning knowledge from unsupervised sensorimotor
interaction. In Proceedings of the Tenth International Conference
on Autonomous Agents and Multiagent Systems, Taipei, Taiwan.
Sutton, R. S., Machado, M. C., Holland, G. Z., Timbers, D. S. F.,
Tanner, B., & White, A. (2023). Reward-respecting subtasks for
model-based reinforcement learning. Artificial Intelligence 324.