||Reinforcement Learning and
Networks (TD Nets)
Edited by Eddie Rafols
The ambition of this
page is to provide a centralized location to discuss and present
current TD Network research and plans for the future.
Temporal-Difference Networks are an approach to knowledge
representation by defining knowledge as a set of future predictions.
More information regarding TD Networks can be found in the TD
Networks paper. [pdf]
The group meetings are on hiatus as we focus on submitting
papers on specific topics relating to TD Nets.
Submitted Papers, Planned Papers
and Papers in
Representations to Improved Generalization in Reinforcement Learning
'More TD Nets'
TD Nets with Options
There will be no meeting on Dec
Tuesday, December 7th:
- Eddie presented an updated version of the equations
for TD nets with options, revised to follow the timing conventions of
latest version of the TD nets paper. The updated version is now
(Updated Dec. 12/04).
- Rich would like to present his ideas regarding continuous (or
small) time TD networks, and start to discuss possible implications for
future work. A version of these equations are here.
- Brian would like to discuss his new results to do with batch
training of TD Networks and possible directions for a discovery
Tuesday, November 30th:
- I (Eddie) will be presenting the mathematics underlying TD Nets and
Options. I should have some handouts prepared so that everyone
can follow along on their own papers. I may post the pdf file
prior to the meeting depending on when it gets done. We can argue
timing (I'm not convinced that I have everything subscripted correctly).
- Rich has some cool new equations that extend TD Nets and Options to
continuous time that, I believe, he would like to show off in this
meeting as well.
Carlo Networks, TD(1) and TD(0) Networks
Just figured I'd fill some space up here with what's new with me.
can now do Monte Carlo, TD(1) (MC targets when you have em, TD targets
when you don't), and TD(0) networks. We can augment our state
history. We can take leverage conjunctions of predictions and
actions. There is a lot we can do. It is time to have fun!
playing with a couple of basic grid worlds and my display program - and
some of it is being learned and some is not. I want to find out
much this matters.
A few weeks ago we talked about implementing a perfect 'fake' TD Net
and experimenting with what it could buy us as a state
I'd like to come up with an interesting problem domain - something
where we have a reward, and we have observations, and we can see what
happens when we try to throw an agent on top of the TD Net. I'm
thinking not only can we learn the TD Net online, but we can also run
the agent online in order to maximize the reward. To me, this is
exciting - and I can't wait to try some things out. There are
technical hurdles that we'll have to overcome - but it seems like an
exciting place to be. Also, once we get this running, we can see
much options help us. Seems like we are in striking distance to
very cool stuff!
PS: I'm thinking of something less ambitious than our big grid world,
but something with a decent number of states and interesting behavior
On Nov 15 2004, there was a special Monday meeting,
1:00--2:00. We discussed Cosma Shalizi's paper "Blind Construction of
Optimal Nonlinear Recursive Predictors for Discrete Sequences", in
preparation for his talk the next day.
On Nov 16 there was a special meeting - the talk of Cosma Shalizi:
Title: Building Predictive Hidden-State Models from Time Series, with
Abstract: I present a new method for nonlinear prediction of discrete
time series, under minimal structural and statistical assumptions.
First, I give a mathematical construction for optimal predictors of
such processes, in the form of hidden Markov models; this formalism is
closely related to the "predictive state representation" of Littman,
Sutton and Singh. This leads to an algorithm, CSSR (Causal-State
Splitting Reconstruction), which approximates the ideal predictor from
data. I will discuss the reliability of CSSR, its data requirements,
its performance in simulations, and it strengths and weaknesses in
comparison to variable-length Markov models and cross-validated hidden
Markov models. Finally, I describe two in-progress applications
to neuronal systems: measuring synchrony across neurons, and decoding
populations of spike trains.
Meeting October 26 2004
1) Discussion of Brian's
further thoughts on solving 2-state,
switch/stay problem. May be possible to conjunct on last action
and handle everything. Makes sense, does it generalize to options?
Just to elaborate; the problem that I've been encountering has to do
with freeing ourselves of history and making predictions strictly based
on previous predictions and current action/observation. Even in
the simplest of systems, this hasn't been happening. It appears
that the cause is the lack of conjunction between actions and
predictions. Basically - we have a set of predictions that are
action conditioned, and we want now to calculate this set of
predictions for the next time step. Well, without knowing which
action was actually taken, it seems that this may not be
possible. It seems intuitive that we perhaps should provide
conjunction of predictions with the observations/actions that they are
2) Discussion of
standardizing on a scalable, flexible gridworld.
Perhaps several sensory systems (Markov, bit2bit, quad walls, colors,
3) Perhaps provide a
perfect, cheating TD
net, producing a predictive representations. See if you can then
solve problem, maxing reward, planning, with that TD state
It seems that if this were possible, if we could act 'optimally' with a
sensor deprived agent in a complex environment using TD Networks as a
state representation, then we could make a strong case that TD Networks
are a powerful idea that deserves more attention. (Or at least
attention from people other than us) :)
For next week:
- Brian will try his idea and report
- We will get a dense gridworld going with things moving on the
screen. Cosmin and Eddie are going to get together on this.
- Brian: Should we consider
a different name than TD Networks? We have discussed in the past
that the idea could be implemented with Monte Carlo, and that
eventually some combination of MC and TD will be best. Should we
change the name now, before it's too late?
Eddie: In the RL Kids meeting,
we brought up the issue of trying TD Nets in non-gridworld tasks.
I think that it is important to look at this so that we don't end up
with algorithms that are only relevant to gridworlds. Brian
brought up the idea of trying TD Nets for Rock-Paper-Scissors.
This could be an interesting problem as this is an example that goes
beyond the current two action/two observation implementation of TD
Nets. If we treat the target of the TD Net as win or lose, this
should be an interesting example for TD Nets with more than two
actions. In addition to this, if we take the problem of
predicting the next observation (rock, paper or scissors), we are, in
effect, creating an opponent modeler.
There are several issues that were brought up regarding TD Nets:
1) How to handle more than two actions
2) How to handle more than two observations
3) How to handle non-symmetric question networks/how to best choose a
network for a given problem.
(I know that I am forgetting something here, so if anyone can amend
this list to what we had sketched on the board, please correct this)
An issue that was brought about that we need to consider is that, right
now, TDNets are being used as a binary classifier. It seems that
we need to decide how TD Nets should work when we are dealing with a
multi-class problem. The example that was brought up was if we
needed to predict colours (perhaps red, blue or green). We need
to decide how the network would handle this. For each colour,
currently, we are making the prediction of whether any individual
colour is expected, but what happens if we have a high activation for
both red and blue? These predictions are supposed to be mutually
exclusive so what happens if we think both are likely to occur?
Since our predictions are not constrained to sum up to one, we can find
ourselves in the situation where the TD Net is predicting that two
colours may be active while in the problem, they should be mutually
I can't resist to give my opinion on this, but I believe that
discretizing the prediction space can help solve this multi-class
problem. If it is known that a node may have to take on multiple
values, I think that it would be appropriate to tile the prediction
space. If each different 'intermediate prediction' could be
represented by a different bit in the input vector, I believe that we
would gain a significant amount of representational power.
To give an example, if we had a node that, optimally, would take on
values of 0, 1/3, 2/3 or 1, depending on what 'state' the agent was in,
by tiling the prediction space, we would be able to assign separate
weights to each of these different predictions. As it is, if we
have one weight for the node that could settle on these four different
values, I don't believe that it would be possible for the weight to
converge on a proper value.
Perhaps I am biased to representing everything as a unit
vector, but in my opinion, this seems to be a simple way to make
learning easier and make the task of proving convergence simpler as