Reinforcement Learning and Artificial Intelligence (RLAI) Temporal-Difference Networks (TD Nets)

Edited by Eddie Rafols

The ambition of this page is to provide a centralized location to discuss and present current TD Network research and plans for the future.

Temporal-Difference Networks are an approach to knowledge representation by defining knowledge as a set of future predictions.

More information regarding TD Networks can be found in the TD Networks paper. [pdf]

The group meetings are on hiatus as we focus on submitting conference papers on specific topics relating to TD Nets.

Submitted Papers, Planned Papers and Papers in Preparation:
Using Predictive Representations to Improved Generalization in Reinforcement Learning - Submitted
'More TD Nets'
TD Nets with Options

There will be no meeting on Dec 14

Tuesday, December 7th:

• Eddie presented an updated version of the equations for TD nets with options, revised to follow the timing conventions of the latest version of the TD nets paper.  The updated version is now available here (Updated Dec. 12/04).
• Rich would like to present his ideas regarding continuous (or small) time TD networks, and start to discuss possible implications for future work.  A version of these equations are here.
• Brian would like to discuss his new results to do with batch training of TD Networks and possible directions for a discovery algorithm.

For Tuesday, November 30th:
- I (Eddie) will be presenting the mathematics underlying TD Nets and Options.  I should have some handouts prepared so that everyone can follow along on their own papers.  I may post the pdf file prior to the meeting depending on when it gets done.  We can argue about the timing (I'm not convinced that I have everything subscripted correctly).
- Rich has some cool new equations that extend TD Nets and Options to continuous time that, I believe, he would like to show off in this meeting as well.

Monte Carlo Networks, TD(1) and TD(0) Networks
Brian
Just figured I'd fill some space up here with what's new with me.  We can now do Monte Carlo, TD(1) (MC targets when you have em, TD targets when you don't), and TD(0) networks.  We can augment our state with history.  We can take leverage conjunctions of predictions and actions.  There is a lot we can do.  It is time to have fun! I've been playing with a couple of basic grid worlds and my display program - and some of it is being learned and some is not.  I want to find out how much this matters.

A few weeks ago we talked about implementing a perfect 'fake' TD Net and experimenting with what it could buy us as a state representation.  I'd like to come up with an interesting problem domain - something where we have a reward, and we have observations, and we can see what happens when we try to throw an agent on top of the TD Net.  I'm thinking not only can we learn the TD Net online, but we can also run the agent online in order to maximize the reward.  To me, this is super exciting - and I can't wait to try some things out.  There are some technical hurdles that we'll have to overcome - but it seems like an exciting place to be.  Also, once we get this running, we can see how much options help us.  Seems like we are in striking distance to some very cool stuff!

PS: I'm thinking of something less ambitious than our big grid world, but something with a decent number of states and interesting behavior required.

On Nov 15 2004, there was a special Monday meeting, 1:00--2:00.  We discussed Cosma Shalizi's paper "Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences", in preparation for his talk the next day.

On Nov 16 there was a special meeting - the talk of Cosma Shalizi:

Title: Building Predictive Hidden-State Models from Time Series, with Neuronal Applications

Abstract: I present a new method for nonlinear prediction of discrete time series, under minimal structural and statistical assumptions. First, I give a mathematical construction for optimal predictors of such processes, in the form of hidden Markov models; this formalism is closely related to the "predictive state representation" of Littman, Sutton and Singh. This leads to an algorithm, CSSR (Causal-State Splitting Reconstruction), which approximates the ideal predictor from data. I will discuss the reliability of CSSR, its data requirements, its performance in simulations, and it strengths and weaknesses in comparison to variable-length Markov models and cross-validated hidden Markov models.  Finally, I describe two in-progress applications to neuronal systems: measuring synchrony across neurons, and decoding populations of spike trains.

Meeting October 26 2004

1) Discussion of Brian's further thoughts on solving 2-state, switch/stay problem.  May be possible to conjunct on last action and handle everything.  Makes sense, does it generalize to options?

Brian: Just to elaborate; the problem that I've been encountering has to do with freeing ourselves of history and making predictions strictly based on previous predictions and current action/observation.  Even in the simplest of systems, this hasn't been happening.  It appears that the cause is the lack of conjunction between actions and predictions.  Basically - we have a set of predictions that are action conditioned, and we want now to calculate this set of predictions for the next time step.  Well, without knowing which action was actually taken, it seems that this may not be possible.  It seems intuitive that we perhaps should provide conjunction of predictions with the observations/actions that they are conditioned on.

2) Discussion of standardizing on a scalable, flexible gridworld.  Perhaps several sensory systems (Markov, bit2bit, quad walls, colors, motivational reservoirs).

3) Perhaps provide a perfect, cheating TD net, producing a predictive representations.  See if you can then solve problem, maxing reward, planning, with that TD state representation.

Brian: It seems that if this were possible, if we could act 'optimally' with a sensor deprived agent in a complex environment using TD Networks as a state representation, then we could make a strong case that TD Networks are a powerful idea that deserves more attention. (Or at least attention from people other than us) :)

For next week:

1. Brian will try his idea and report
2. We will get a dense gridworld going with things moving on the screen.  Cosmin and Eddie are going to get together on this.
3. Brian: Should we consider a different name than TD Networks?  We have discussed in the past that the idea could be implemented with Monte Carlo, and that eventually some combination of MC and TD will be best.  Should we change the name now, before it's too late?

Eddie: In the RL Kids meeting, we brought up the issue of trying TD Nets in non-gridworld tasks.  I think that it is important to look at this so that we don't end up with algorithms that are only relevant to gridworlds.  Brian brought up the idea of trying TD Nets for Rock-Paper-Scissors.  This could be an interesting problem as this is an example that goes beyond the current two action/two observation implementation of TD Nets.  If we treat the target of the TD Net as win or lose, this should be an interesting example for TD Nets with more than two actions.  In addition to this, if we take the problem of predicting the next observation (rock, paper or scissors), we are, in effect, creating an opponent modeler.

There are several issues that were brought up regarding TD Nets:
1) How to handle more than two actions
2) How to handle more than two observations
3) How to handle non-symmetric question networks/how to best choose a network for a given problem.
(I know that I am forgetting something here, so if anyone can amend this list to what we had sketched on the board, please correct this)

An issue that was brought about that we need to consider is that, right now, TDNets are being used as a binary classifier.  It seems that we need to decide how TD Nets should work when we are dealing with a multi-class problem.  The example that was brought up was if we needed to predict colours (perhaps red, blue or green).  We need to decide how the network would handle this.  For each colour, currently, we are making the prediction of whether any individual colour is expected, but what happens if we have a high activation for both red and blue?  These predictions are supposed to be mutually exclusive so what happens if we think both are likely to occur?  Since our predictions are not constrained to sum up to one, we can find ourselves in the situation where the TD Net is predicting that two colours may be active while in the problem, they should be mutually exclusive.

I can't resist to give my opinion on this, but I believe that discretizing the prediction space can help solve this multi-class problem.  If it is known that a node may have to take on multiple values, I think that it would be appropriate to tile the prediction space.  If each different 'intermediate prediction' could be represented by a different bit in the input vector, I believe that we would gain a significant amount of representational power.

To give an example, if we had a node that, optimally, would take on values of 0, 1/3, 2/3 or 1, depending on what 'state' the agent was in, by tiling the prediction space, we would be able to assign separate weights to each of these different predictions.  As it is, if we have one weight for the node that could settle on these four different values, I don't believe that it would be possible for the weight to converge on a proper value.

Perhaps I am biased to representing everything as a unit vector, but in my opinion, this seems to be a simple way to make learning easier and make the task of proving convergence simpler as well.

Alborz: I am interested in Brian's idea about the fake TD-Net. I would like to have a discussion with him about this issue.