RLAI Reinforcement Learning and Computer Go (RLGO)
Value function

The ambition of this web page is to propose and discuss the win/lose value function hypothesis:

Any strong Computer Go program must compute a win/lose value function as an intermediate step, that corresponds directly to the expected result at the end of the game (win, lose or draw).



Definitions:

A win/lose value function is the expected reward at the end of the game, where the rewards are defined to be: win = +1, draw = 0, lose = -1. All non-terminal states have zero reward.

A score value function is the expected reward at the end of the game, where the reward is defined to be the score differential (positive for winning scores and negative for losing scores).

A heuristic function is not correlated to expected reward, and attempts to judge each move or position on some unspecified scale.



Most existing Computer Go programs attempt to estimate a score value function (if they estimate a value function at all). According to the win/lose value function hypothesis this cannot lead to a strong Computer Go program. Any program that chooses its moves according to the expected score cannot play strong Go, since it has no concept of the risk associated with a move or position.

Some Computer Go programs use an unspecified heuristic function to rank and select moves. According to the hypothesis, any program that chooses its moves according to a heuristic function can only play strong Go if the heuristic function approximates a value function (and in particular the win/lose value function). This is because an unspecified heuristic function cannot make any judgement about the future value of a position, and so any judgements it makes will be unverifiable later on in the game, preventing us from learning and improving the heuristic.

Note that the win/lose value function hypothesis is a specialised form of the value function hypothesis.


Is planning part of the value function? Or do we plan over a value function? 

It does seem like estimating a win/lose value function is a good objective, slightly but perhaps significantly better than score values, though of course score values might be a good feature for input to the win/lose value function. 

Extend this Page   How to edit   Style   Subscribe   Notify   Suggest   Help   This open web page hosted at the University of Alberta.   Terms of use  814/0