RLAI Reinforcement Learning and Artificial Intelligence (RLAI)
Artificial intelligence is solved
--Rich Sutton, October 25, 2004
These are some notes by Rich Sutton on the topic of artificial intelligence and what more remains before we can consider it solved.  The notes are from Dec 2003, but they concern ideas we are still investigating, namely intrinsic motivation and knowledge representation.  The ambition of this page is to get people to think more about these ideas and comment on them here.

One could say that the problem of AI is solved, essentially solved, by RL as in the textbook.  We can do learning.  We can do planning.  We understand the way these do, and do not, interact.  We can gain knowledge about the world and use it flexibly, as in Dyna.  What more could you want?  You might want to do all these things more efficiently, but that maybe is a detail.  Are we really done with the what and the why of what is computed?  I like this question even though the answer, ultimately, is no.  It reminds us of our ambition and guides us in the right future directions.


Before we will have completed the outline of AI, we will have to fully address at least one more major issue.  Let us call it the issue of knowledge.  What kind of knowledge should there be?  This question is the ultimate computational theory question, the ultimate what? and why?

What is knowledge?  I would like to say that it is always experience or compilations of experience, but that is too strong.   There is also policy knowledge.  And one would think there is also knowledge that is just compiled computations, such as “I thought about that a long time and I couldn’t find any way to make it work”.  But right now I am thinking about that most basic kind of knowledge, that about the way the world behaves, not about ourselves.  This is "what would happen if" knowledge.  Let us call this world knowledge. 


The problem of world knowledge defies the conventional separation into state and transition knowledge.  There is just prediction and accuracy of prediction, and ability to predict key events of interest.  There is no state-wise partition of what you know. 

But then what should you know?  What should you predict?  The null hypothesis would have to be that what we strive to predict would be ultimately determined by our genetic inheritance, and that other things would become of interest because of their relationship to that.   

Thus, things can become of interest, for prediction, for any of 3 reasons:

1. Because they have been predesignated, perhaps arbitrarily, as being of interest.  Rewards are like this, but not just rewards.  We might view these other things as the designer’s guesses about what might, at some point in the future, be useful toward obtaining rewards.

2. Because they have been found to be causally related, or sometimes causally related, to things of interest for Reason 1 above.

3. Because it has been found that they can be learned about.  This is in large part a modulation of the first two.  Those are intrinsic reasons for interest.  This one is about the fruitfulness of trying to learn about them.  This is curiosity.  This is the reward that comes just from learning.  

Ok.  So world knowledge is the ability to predict inputs, or more typically functions of inputs, that are of interest as just defined.

The functions of inputs might be things like their discounted cumulated sum over time (even for non-rewards).  Such composite measures may be far more important to us than individual signals at particular time steps.

The individual inputs should be called observations.

This brings up an important question:  Does it necessarily all come down to predicting next observations?  Or is there a meaningful alternative at a higher scale?  [below it is proposed that the answers to these two questions are YES and NO.]


I think there is a clear, identifiable, big-science kind of problem in the creation of a world model, also know as a large-ish collection of knowledge, that is completely grounded in a causal, temporal sequence of observations and actions, sensors and effectors.  Where all knowledge is predictions about the future of the sequence.  A grand challenge.  We know in some sense that this must be possible.  It is a direct challenge which should be accessible even to people without knowledge of RL.  And we want the knowledge in a form suitable for planning.  It should at least permit simulation of future experience, presumably at a high level.

This grand challenge could be a good basis for collaborative work.
Grounded world knowledge
    Knowledge is predictions
    Bits to bits. Data to data.
It has appeal for robocists, for psycho-philosophers like me, via the emphasis on experience and on having a life, via the call for verification, via the call for pulling the parts of AI together.


Let us call it the Grounded World Modeling Problem (GWMP).  It has these key features:
1. You have sensors rather than state information.
2. You want the model to be suitable for planning.
3. You want it to be learnable/verifyable (because it is grounded.)
4. It can express a wide range of world knowledge.
5. All the knowledge is expressed as predictions about future experience.


A conceptual breakthrough (perhaps) in the predictive modeling problem:  There is an outstanding question: do all predictive statements come down, ultimately, to one-step predictions?  Of course there are multi-step predictions.  But to evaluate them, does it always come down to the accuracy of next-step predictions?

I am thinking more and more that the answer is YES.  No more complex/structured/interesting notion is needed.  All the rest can be done by TD.  In particular, we may be able to define K (the quantity of knowledge in a model, aka the accuracy of a model) as the expected accuracy of the sequence of one-step predictions given the equiprobable policy.  Transient K can be handled as in average-reward-case RL.


The beginnings of a new idea:  We have a proposed definition of K from yesterday.  But there remains the question of the action selections.  It seems inadequate to base K on the equiprobable random policy.  This leaves us caring about all sorts of crazy random dances that have no point, that don’t get us anywhere.

The beginnings of a solution is to note these last few phrases.  We care not so much about prediction for all possible actions as about being able to cause all possible sensations.  If we have learned one set of ways of behaving which lets us control the sensations completely, perhaps produce any desired/possible sensation sequence, then we have learned all we need to know about the world.  Note that there may be much more to learn.  We may not know the sensations which would follow many dances, but we do not need to.  We know how to absolutely control, if not predict, all the bits of interest.  This is all that one could ever need in any subsequent planning problem.


In talking with Satinder again, we refined this one more step.  The above criterion for full knowledge is too strong in that it asks for complete control whereas typically this will not be possible.  Suppose we had a sufficient  statistic.  This means that we can predict the probability of any sequence or, equivalently, that we can predict the probability of any next observation after any possible sequence.  If we can control the observations as well as one could with this, then we say we have full knowledge of the world.

    From:       Michael Littman
    Date:        November 4, 2004 4:10:03 PM MST

   In your message, you asked whether all knowledge prediction is
one-step prediction.  I think a better question might be, is one-step
prediction accuracy a sufficient signal for learning.  I believe the
answer to this question is no.  Here's a thought experiment.

   First, consider the well defined problem of predicting the next
character in an English text.  Shannon and others used this task as a
way to estimate the entropy of English.  Current estimates are roughly
1.1 bits per character.  Machine learning methods can achieve on the
order of 1.22 bits per character.  This doesn't look like a
significant difference.  It certainly doesn't look as significant as
the difference in understanding and world knowledge between a person
and current machine-learning methods.

   Further, imagine if Steven Hawkings and I were both tested on
cosmology papers.  I suspect the difference in per character entropy
would be insignificant in spite of the fact that his understanding
would be much deeper and substantial than mine.  A learning system
with the drive of improving its per character entropy wouldn't have
much motivation to learn deep understanding.  A little perhaps, but
hardly enough to get excited about.

   On the other hand, consider this example, due to Geoff Gordon.  (We
had a chat at the symposium on this topic.)  We can present Hawkings
and I a cosmology paper character by character and repeatedly ask
"will the next theorem use an application of the uncertainty
principle?"  This is a prediction that a neutral observer could verify
and would likely show a very significant difference between the two

   I'd like to highlight a few things from this example.  First, the
prediction is abstract, both in temporality and in the knowledge
level.  I think this is good and it agrees with out intuitions about
the importance of abstraction in creating and using knowledge.
Second, the verifier itself appears to need high-level knowledge.

   This seems to be a stumbling block.  Can we come up with a way of
dealing with this issue?  I guess one thing we could use for leverage
is the fact that the prediction verifier doesn't need as much
knowledge as the predictor.  For example, a lesser physicist than
Hawkings can check whether the predictions are correct.  In fact, I
could probably check my own predictions (and use these checks as a
signal for learning).

   So, the prediction story, the knowledge story, and the abstraction
story are intertwined in a way that we should try to articulate.

Extend this Page   How to edit   Style   Subscribe   Notify   Suggest   Help   This open web page hosted at the University of Alberta.   Terms of use  5071/2