||Reinforcement Learning and
| Artificial intelligence is solved
These are some notes by Rich Sutton on the topic of artificial
intelligence and what more remains before we can consider it
solved. The notes are from Dec 2003, but they concern ideas we
are still investigating, namely intrinsic motivation and knowledge
representation. The ambition of this page is to get people to think more about these ideas and comment on them here.
One could say that the problem of AI is solved, essentially solved, by
RL as in the textbook. We can do learning. We can do
planning. We understand the way these do, and do not,
interact. We can gain knowledge about the world and use it
flexibly, as in Dyna. What more could you want? You might
want to do all these things more efficiently, but that maybe is a
detail. Are we really done with the what and the why of what is
computed? I like this question even though the answer,
ultimately, is no. It reminds us of our ambition and guides us in
the right future directions.
Before we will have completed the outline of AI, we will have to fully
address at least one more major issue. Let us call it the issue
of knowledge. What kind of knowledge should there be? This
question is the ultimate computational theory question, the ultimate
what? and why?
What is knowledge? I would like to say that it is always
experience or compilations of experience, but that is too
strong. There is also policy knowledge. And one would
think there is also knowledge that is just compiled computations, such
as “I thought about that a long time and I couldn’t find any way to
make it work”. But right now I am thinking about that most basic
kind of knowledge, that about the way the world behaves, not about
ourselves. This is "what would happen if" knowledge. Let us
call this world knowledge.
The problem of world knowledge defies the conventional separation into
state and transition knowledge. There is just prediction and
accuracy of prediction, and ability to predict key events of
interest. There is no state-wise partition of what you
But then what should you know? What should you predict? The
null hypothesis would have to be that what we strive to predict would
be ultimately determined by our genetic inheritance, and that other
things would become of interest because of their relationship to
Thus, things can become of interest, for prediction, for any of 3 reasons:
1. Because they have been predesignated, perhaps arbitrarily, as being
of interest. Rewards are like this, but not just rewards.
We might view these other things as the designer’s guesses about what
might, at some point in the future, be useful toward obtaining rewards.
2. Because they have been found to be causally related, or sometimes causally related, to things of interest for Reason 1 above.
3. Because it has been found that they can be learned about. This
is in large part a modulation of the first two. Those are
intrinsic reasons for interest. This one is about the
fruitfulness of trying to learn about them. This is
curiosity. This is the reward that comes just from
Ok. So world knowledge is the ability to predict inputs, or more
typically functions of inputs, that are of interest as just defined.
The functions of inputs might be things like their discounted cumulated
sum over time (even for non-rewards). Such composite measures may
be far more important to us than individual signals at particular time
The individual inputs should be called observations.
This brings up an important question: Does it necessarily all
come down to predicting next observations? Or is there a
meaningful alternative at a higher scale? [below it is proposed
that the answers to these two questions are YES and NO.]
I think there is a clear, identifiable, big-science kind of problem in
the creation of a world model, also know as a large-ish collection of
knowledge, that is completely grounded in a causal, temporal sequence
of observations and actions, sensors and effectors. Where all
knowledge is predictions about the future of the sequence. A
grand challenge. We know in some sense that this must be
possible. It is a direct challenge which should be accessible
even to people without knowledge of RL. And we want the knowledge
in a form suitable for planning. It should at least permit
simulation of future experience, presumably at a high level.
This grand challenge could be a good basis for collaborative work.
Grounded world knowledge
Knowledge is predictions
Bits to bits. Data to data.
It has appeal for robocists, for psycho-philosophers like me, via the
emphasis on experience and on having a life, via the call for
verification, via the call for pulling the parts of AI together.
Let us call it the Grounded World Modeling Problem (GWMP). It has these key features:
1. You have sensors rather than state information.
2. You want the model to be suitable for planning.
3. You want it to be learnable/verifyable (because it is grounded.)
4. It can express a wide range of world knowledge.
5. All the knowledge is expressed as predictions about future experience.
A conceptual breakthrough (perhaps) in the predictive modeling
problem: There is an outstanding question: do all predictive
statements come down, ultimately, to one-step predictions? Of
course there are multi-step predictions. But to evaluate them,
does it always come down to the accuracy of next-step predictions?
I am thinking more and more that the answer is YES. No more
complex/structured/interesting notion is needed. All the rest can
be done by TD. In particular, we may be able to define K (the
quantity of knowledge in a model, aka the accuracy of a model) as the
expected accuracy of the sequence of one-step predictions given the
equiprobable policy. Transient K can be handled as in
The beginnings of a new idea: We have a proposed definition of K
from yesterday. But there remains the question of the action
selections. It seems inadequate to base K on the equiprobable
random policy. This leaves us caring about all sorts of crazy
random dances that have no point, that don’t get us anywhere.
The beginnings of a solution is to note these last few phrases.
We care not so much about prediction for all possible actions as about
being able to cause all possible sensations. If we have learned
one set of ways of behaving which lets us control the sensations
completely, perhaps produce any desired/possible sensation sequence,
then we have learned all we need to know about the world. Note
that there may be much more to learn. We may not know the
sensations which would follow many dances, but we do not need to.
We know how to absolutely control, if not predict, all the bits of
interest. This is all that one could ever need in any subsequent
In talking with Satinder again, we refined this one more step.
The above criterion for full knowledge is too strong in that it asks
for complete control whereas typically this will not be possible.
Suppose we had a sufficient statistic. This means that we
can predict the probability of any sequence or, equivalently, that we
can predict the probability of any next observation after any possible
sequence. If we can control the observations as well as one could
with this, then we say we have full knowledge of the world.
From: Michael Littman
Date: November 4, 2004 4:10:03 PM MST
In your message, you asked whether all knowledge prediction is
one-step prediction. I think a better question might be, is one-step
prediction accuracy a sufficient signal for learning. I believe the
answer to this question is no. Here's a thought experiment.
First, consider the well defined problem of predicting the next
character in an English text. Shannon and others used this task as a
way to estimate the entropy of English. Current estimates are roughly
1.1 bits per character. Machine learning methods can achieve on the
order of 1.22 bits per character. This doesn't look like a
significant difference. It certainly doesn't look as significant as
the difference in understanding and world knowledge between a person
and current machine-learning methods.
Further, imagine if Steven Hawkings and I were both tested on
cosmology papers. I suspect the difference in per character entropy
would be insignificant in spite of the fact that his understanding
would be much deeper and substantial than mine. A learning system
with the drive of improving its per character entropy wouldn't have
much motivation to learn deep understanding. A little perhaps, but
hardly enough to get excited about.
On the other hand, consider this example, due to Geoff Gordon. (We
had a chat at the symposium on this topic.) We can present Hawkings
and I a cosmology paper character by character and repeatedly ask
"will the next theorem use an application of the uncertainty
principle?" This is a prediction that a neutral observer could verify
and would likely show a very significant difference between the two
I'd like to highlight a few things from this example. First, the
prediction is abstract, both in temporality and in the knowledge
level. I think this is good and it agrees with out intuitions about
the importance of abstraction in creating and using knowledge.
Second, the verifier itself appears to need high-level knowledge.
This seems to be a stumbling block. Can we come up with a way of
dealing with this issue? I guess one thing we could use for leverage
is the fact that the prediction verifier doesn't need as much
knowledge as the predictor. For example, a lesser physicist than
Hawkings can check whether the predictions are correct. In fact, I
could probably check my own predictions (and use these checks as a
signal for learning).
So, the prediction story, the knowledge story, and the abstraction
story are intertwined in a way that we should try to articulate.