Reinforcement Learning and Artificial Intelligence (RLAI)

The value-function hypothesis

The ambition of this web page is to state, refine, clarify and, most of all, promote discussion of, the following scientific hypothesis:

All efficient methods for solving sequential decision problems determine (learn or compute) value functions as an intermediate step.

Is this true? False? A definition? Unfalsifiable? You are encouraged to comment on the hypothesis, even in minimal ways. For example, you might submit an extension "Yes" to indicate that you believe the hypothesis, or similarly "No" or "Not sure".  These minimal responses will be collected and tallied at some point, and you may want to change yours later, so please include your name in some way.

Definitions: A value function is an estimate of expected cumulative future reward, usually as a function of state or state-action pair.  The reward may be discounted, with lesser weight being given to delayed reward, or it may be cumulative only within individual episodes of interaction with the environment.  Finally, in the average-reward case, the values are all relative to the mean reward received when following the current policy.

I think the necessity of value-functions is something we have learned through long experience over the last 20 years.  Dynamic programming computes value functions.  All the most effective reinforcement learning methods estimate value functions.  We can't prove directly that value functions are necessary, it is just an experience thing at this point.

People are eternally proposing that value functions aren't necessary, that policies can be found directly, as in "policy search" methods (don't ask me what this means), but in the end the systems that perform the best always use values.  And not just relative values (of actions from each state, which are essentially a policy), but absolute values giving an genuine absolute estimate of the expected cumulative future reward.

To prove the value-function hypothesis in any general sense would be big news.  In my opinion this is one of the most important open problems in artificial intelligence, and one that could have an essentially analytic (mathematical) solution.
-Rich

If one accept the Reward hypothesis (as I do) then the value-function hypothesis immediately follows. If we accept that all our goals can be achieved by maximizing our scalar reward then the value function is essential. Our goal is to maximize a future reward, we can't achieve this without knowing/ estimating our expected future reward at each state. Otherwise our learning would be undirected and meaningless. How could we achieve our goal of winning if when we lose we have no sense of which choices, which state-action pairs lead to our failure? Value functions are essential to assigning credit to good and poor decisions.

Is there a clear distinction between sequential decision problems and search in a problem space? If not, could sequential decision making problems be reformulated as optimisation problems. If so, computing value functions as an intermediate step is equivalent to partially solving an optimisation problem. This works fine for problems with an optimal substructure; but it is  difficult to prove for problems that don't have an optimal substructure. However, for dynamic, distributed environments, it is impractical for a  single agent to generate an optimal solution to a sequential decision problem as it lacks global knowledge. Hence, agents must construct partial solutions as an intermediate step to constructing a global solution.

Doesn't this hypothesis depend on the specific representation used to encode the value function?  If it is a multilayer connectionist approximation, then the value-function hypothesis is probably not generally true because constant updates can lead the overall approximation to local optima.  Of course, in some problems that will not happen, but in some it will because a modification of weights to accommodate a particular state/action value will often deleteriously alter a different/unrelated state/action value, since the connectionist solution representation is distributed.  For this reason, it will sometimes be better to search directly through the space of such structures (i.e. policy search), since  such a search is less susceptible to such oscillations.   In fact, it is likely that human behavior is a combination of local value-function-like updates and more holistic policy perturbations, which are a more creative form of exploration.  Both likely have a role to play.  So the the word "All" could be changed to "Some" to keep the hypothesis mostly intact.

What do you exactly mean by "determine (learn or compute) value functions"

I known it seems to be obvious but I consider that it is not so.

Can you distinguish between a "learned or computed" value from a guess or completely random generated value?

So are you referring exclusively to exact methods that will "solve" the problem exactly well all the times? i.e. with convergence guaranteed? What does it means solving?

So I have to conclude that this is "enclosed or implicit" in the "efficient" conditional that you put at the begining of the phrase or in the "solving" word.