Reinforcement Learning and Computer Go (RLGO) The Question Hypothesis

The ambition of this web page is to propose and discuss the following hypothesis:

All useful features of a state in Go can be interpreted as answers to questions about the current and future value of observations.

Motivation

If the question hypothesis is correct, then we only need one algorithm to construct all of our features: an algorithm to answer a question. If the question hypothesis is incorrect, we will need to use a variety of different algorithms to construct our features.

The question hypothesis could be considered to be a specialised version of the Empirical Knowledge Hypothesis

Assume that we make some observations Z as the game proceeds. Observations can be made about both the state (for example an eye) and state transitions (for example connecting or capturing a group). Any concrete fact can be used, as long as it can be deterministically computed from the state of the board.

The current value of an observation may be a useful feature. But more generally, we may be interested in asking questions about an observation at some future time (or times). These correspond naturally to predictive features that we may wish to use.

Definition

A question q is defined as:

q(z, c, πb, πw, w)

where

z is an observation about a state or state transition
c is the player to play first
πb is the policy followed by black
πw is the policy followed by white
w(k) is some non-negative weighting function defined over each time-step from now (k=0) to the future (up to k=∞)

We define the outcome ω of an observation z to be the weighted average value of z from now (time t) until the end of the game.

ωt(z,w) = ∑(k=0 to ∞)w(k).zt+k
------------------------
∑(k=0 to ∞)w(k)

The answer to a question q at time t is defined as the expected value of the outcome for observation z where c is to play first, black follows policy πb and white follows policy πw.

at(q) = E(πb, πw, c) { ωt(z) }

Example questions

Each of the following features can be expressed as a question of the above form, by making appropriate choices for z, c, πb, πw, and w.
• Does my group have an eye already?
• Can I make an eye if that is my only goal?
• Will I make an eye at x1 and an eye at x2 if I try to capture the opponent at x3?
• Will this point become territory for black at the end of the game?
• Is my group in the corner likely to come under attack soon?
• Can this stone be captured next move if I do nothing?
• Can I capture an opponent group without my own group getting captured?
• Can I capture an opponent group whilst responding to any of the moves that I believe threaten to capture my own group?
It is worth noting that the complexity of a question is often determined by the complexity of the policies, rather than the complexity of the observation! Note that the last question includes dependencies in the policy.

How does segmentation fit into this framework? For example, how do we ask the question "what is the group of stones connected to x?"

One answer to this is to use implicit segmentation. The group of stones connected to x is described implicitly by a scope operator representing the required correlation to x. When we wish to ask a question about with the group of stones connected to x, we use a policy that restricts moves to the group about x.

Note that a correlation feature can be constructed by answering the question "will these two intersections be part of the same region at the end of the game?". As usual, we can specify different policies for the correlation (e.g. whether we are trying to maximise correlation or whether we have some other aim).

You say that this is a specialized version of the Empirical Knowledge Hypothesis. It might be even more related to the Predictive Representations Hypothesis. The web page for that has not been written yet, but it is essentially that predictive features--the answers to questions about the future--may be particularly good when it comes to generalization between states.

But I would not say all useful features have to be answers to predictive questions. That seems too strong and too limiting. An extreme case is just the locations of the stones. These features are the basis for all the others, but are not themselves predictive, just descriptive. I can also imagine that there might be other things, non-predictive functions of them, that might be useful. The total number of stones on the board, for example.

Current observations (such as the location of stones, or the number of stones on the board) can also be described in the question framework, simply by setting all weights to zero except the current time-step. I guess the misleading part is that 'the present' is included in 'the future' as a special case. I have updated the wording of the question hypothesis to make this clearer.

I think we need better definitions of what constitutes a 'useful feature' and what should be accepted as 'questions and answers'.

First of all what constitutes a useful feature?
A feature is useful if it helps to improve our play? Or: A feature is useful if it can be used in an easy calculation (linear space, time in the number of features?) to compute an optimal (or just 'good enough'??) move.

If I understand well, the essential components of a question are:
- A binary valued(?) function defined over a state/state-transition,
- some policies assumed for the players,
- and some nonnegative weights (why should the weights be normalized??)

I think it is critical to say something about what policies are we allowed to choose. If there is no restriction on this set, then the `question hypothesis' can be shown to hold:

Simply choose the optimal policies for both players and ask the question:
- the binary valued function takes value 1 iff player #1 wins
- the policies are the optimal policies for both players
- the weights are all one (could use discounting, but why?)

Another question is when the binary valued function is changed to a function that takes the value of 1 iff player #2 wins

Same for draw.

Now, the answers to these questions just give us everything we need to know about the positions in order to play optimally.
These are useful features!

Hence, in order to make the `question hypothesis' more interesting, we should maybe restrict what policies we are allowed to choose in the questions.

Still, in this case, if we did choose such a policy-set then I have the feeling that the `question hypothesis' will be hard to make a non-trivial hypothesis (it either fails to easily, or can be shown to hold easily because we have so much flexibility).