Reinforcement Learning and Information Access

or

What is the Real Learning Problem in Information Access?


by Rich Sutton
University of Massachusetts
rich@cs.umass.edu


Presented at the AAAI Stanford Spring Symposium on
Machine Learning and Information Access
March 26, 1996

with many thanks to Rik Belew and Jude Shavlik


Introductory Patter

In this talk we will try to take a new look at the learning problem in information access. How is it structured? What training information is really available, or likely to be available? Will there be delays between decision making and the receipt of relevant feedback? I am a newcomer to information access, but I have experience in reinforcement learning, and one of the main lessons of reinforcement learning is that it is really important to understand the true nature of the learning problem you want to solve.

Here is an example that illustrates the whole idea. In 1989 Gerry Tesauro at IBM built the world's best computer player of backgammon. It was a neural network trained from 15,000 examples of human-expert moves. Then he tried a reinforcement learning approach. He trained the same network not from expert examples, but simply by playing it against itself and observing the outcomes. After a month of self-play, the program became the new world champ of computers. Now it is an extremely strong player, on a par with the world's best grandmasters, who are now learning from it! The self-play approach worked so well primarily because it could generate new training data itself. The expert-trained network was always limited by its 15,000 examples, laboriously constructed by human experts. Self-play training data may be individually less informative, but so much more of it can be generated so cheaply that it is a big win in the long run.

The same may be true for information access. Right now we use training sets of documents labeled by experts as relevant or not relevant. Such training data will always be expensive, scarce, and small. How much better it would be if we could generate some kind of training data online, from the normal use of the system. The data may be imperfect and unclear, but certainly it will be plentiful! It may also be truer in an important sense. Expert-labeled training sets are artificial, and do not accurately mirror real usage. In backgammon, the expert-trained system could only learn to mimic the experts, not to win the game. Only the online-trained system was able to learn to play better than the experts. Its training data was more real.

This then is the challenge: to think about information access and uncover the real structure of the learning problem. How can learning be done online? Learning thrives on data, data, data! How can we get the data we need online, from the normal operation of the system, without relying on expensive, expert-labeled training sets?

This talk proceeds in three parts. The first is an introduction to reinforcement learning. The second examines how parts of the learning problem in information access are like those solved by reinforcement learning methods. But the information access problem doesn't map exactly onto the reinforcement learning problem. It has a special structure all it own. In the third part of the talk we examine some of this special structure and what kind of new learning methods might be applied to it.

The rest below are approximations to the slides presented in the talk.


Conclusions (in advance)


Reinforcement Learning


Classical Machine Learning - Supervised Learning

	situation1  --->  action1     then correct-action1
	situation2  --->  action2     then correct-action2
		      .
		      .
		      .






Reinforcement Learning

	        situation1  --->  action1
	reward2	situation2  --->  action2 
	reward3	situation3  --->  action3 
	                     .
	                     .
	                     .










It's not just a harder problem, it's a real problem


Applications of RL


Key Ideas of RL Algorithms

Value Functions

TD Methods


A Large Space of RL Algorithms


Major Components of an RL Agent







Policy - what to do

Reward - what is good

Value - what is good because it predicts reward

Model - what follows what


Info-Access Applications of RL

Anytime you have decisions to be made Anytime you want to make long-term predictions

Classical IR Querying/Routing/Filtering as RL

	Situation = Query or user model + Documents
	Actions	  = Present document?  Rankings
	Reward	  = User feedback on presented docs

Pro RL:
	Feedback is selective
	and does not exactly fit SL framework

Con RL:
	Feedback does not exactly fit RL framework
	Problem is not sequential

e.g.,
Bartell, Cottrell & Belew, 1995
Boyan, Freitag & Joachim 1996
SchŸtze, Hull & Pederson, 1995

MultiStep Info-Access Problems

But in a sense all these are the same

Learning a complex, interactive, goal-directed, input-sensitive, sequence of steps

That's exactly what RL is good for.

The Multi-Step, Sequential Nature of IA


Imagine an Ideal Info-Access System


Shortcutting


Compare...

The classical context

No way the queries can be used to learn about the docs

The Web

There will always be more readings than writings

Thus, we can learn about the docs

Popularity Ratings, Priors on Documents

Q. How do you decide what to access today?

A. Recommendations:


"Its hard to find the good stuff on the web"

But in classical IR there is no concept of good stuff

Differences and Similarities between Users


Summary