Date: Mon, 1 Mar 1999 12:45:33 +0100 (MEZ)
From: Daniel Polani <polani@Informatik.UniMainz.DE>
To: rich@cs.umass.edu
Subject: Question about Gambler's Problem in your RL Book
Dear Prof. Sutton,
I've been giving a course on Reinforcement Learning this winter and
there arose an interesting question in conjunction with the gambler's
problem in your RL book.
There was some discussion in my course why the optimal policy as
shown in Fig. 4.6 should look like that. So I decided to code the
problem and noticed that the results are not always stable (you
indicate that in Exercise 4.8). In fact, the optimization is quite
stable when the problem size is odd, but it isn't for even problem
sizes.
The optimal policies in the even case usually assume a kind of
fractal
characteristics, which is very volatile during iteration, while there
is some fixed underlying structure of filled triangles (of which your
Figure shows four). I noticed that the number of filled triangles is the
same as the power of 2 in the problem size. In other words, if problem
size is 2^n * k, where k is odd, then the solutions close to the
optimal policy have 2^n filled triangles in them plus some volatile
(instable) fractal structure. In particular, if problem size is odd,
then n = 0, 2^n = 1 and you find (stably) a single filled triangle in
it.
Now to my questions:
1. Can you confirm my above observations?
2. Do you know of any theoretical treatment of this question?
3. Although my observations seem generally consistent with your result
in
Fig. 4.6, I could not reproduce your precise result for
the optimal
policy, probably due to the precision loss in the
calculation. Did
you have similar problems calculating the solution, or is
your
result derived from theoretical considerations? If not,
how did you
avoid precision problems?
Thank you very much for your time answering this questions,

Daniel Polani
 Institut fuer
Informatik
 Staudingerweg 9 
 Fachbereich
Mathematik
 55099 Mainz 
 Johannes
GutenbergUniversitaet

Germany 
 
 polani@informatik.unimainz.de, Tel.
+6131/393608


http://www.informatik.unimainz.de/PERSONEN/Polani.html

P.S.: Thank you very much for your smooth, consistent and unifying
introduction book into RL, it was very
useful for me to prepare
my RL course.
To: polani@informatik.mathematik.unimainz.de
From: Rich Sutton <rssutton@att.com>
Subject: Re: Question about Gambler's Problem in your RL Book
Dear Daniel Polani,
Thanks for your kind words and descriptions of your experiences teaching reinforcement learning and using our book. I am afraid I can't give you any definitive answer on the gambler's problem. I remember one student in a class I taught made a careful study of it, mainly computationally, but I don't recall his exact conclusions. What I do remember, I think, is that the solution shown in the book is just one of many optimal policies for the task. This is perhaps one of the main points of the exercise  to drive home the fact that there are potentially many optimal policies whereas the optimal values are unique. The policy one finds often has to do with trivial things like numerical resolution or the order in which one's program breaks ties for maximal action value. This is probably why you did not get necessarily exactly the result in Figure 4.6. I believe this is still an optimal policy, however. You might check this using the optimal value function you found. See if my policy locally max's your value function to within the numerical resolution of your machine. The code I used is available from the web page for the book if you want to check the order I used to break ties (or for a bug in my code!).
I don't know of any theoretical analysis of this problem. We took the problem from a standard text, probably a DP text, but I couldn't say which now and I am disappointed to see that we didn't note which in our book. You might check the exercises in some of the older Bertsekas DP texts. But in any event, we slightly changed the problem so any prior analysis might not apply. You probably now understand the problem as well as anybody does.
Best regards,
Rich Sutton
From: Daniel Loeb
To: Rich Sutton
I have a question about "The Gambler's Problem"
which appears at example 4.3 in your book
"Reinforcement Learning: An Introduction".
Your solution implies there is only one optimal solution
to the gambler's problem.
For example, starting with $70 it is only optimal to bet $5.
I believe that it is optimal to bet either $5 or $20 or $30.
Am I missing something?
Here is my complete solution.
o = your solution
* = other solutions
o
* *
* *
* *
* *
* *
o o
* *
* *
* *
* *
o o
o o
o o o o o o o o
o o o o o o o o
$0 $25 $50 $75 $100
In particular, the simple policy of always making the
largest allowed bet is optimal.
It seems to be the complete set of optimal policy for any
p under 50%.
The number of levels of subdivisions seems to depend on
the number of powers of 2 which go into the goal ($100).
I wonder if a complete solution is known for infinitely
divisible money. You can get a good idea what that might look
like by choosing a goal which is a large power of two.
Here is the code I used (in Matlab) to find my solutions:
function [b, v] = optcasino(n,prob)
figure;
title(prob);
hold on;
v = zeros(1,n1);
delta = 1;
iter = 0;
while ( delta>1e50 )
iter = iter + 1
[b,v, delta] = casino(v,prob);
plot(1:(n1), v);
end
hold on;
% bar(b);
plot(b(:,1), b(:,2), 'm*');
=======
function [bnew, vnew, delta] = casino(vold, prob)
n = 1+length(vold);
vold = [ 0, vold, n ];
bnew = [];
for i = 1:(n1)
bets = [1:min(i,ni)];
vs = prob*vold(1+i+bets) + (1prob)*vold(1+ibets);
if ( i==15 )
vs  max(vs);
end
vnew(i) = max(vs);
goodbets = find( vs >= vnew(i)*(11e9) );
gbets = ones(length(goodbets),2);
gbets(:,1) = i;
gbets(:,2) = goodbets';
bnew = [ bnew; gbets ];
end
delta = max( abs( [0,vnew,n]  vold ) )

Yours, Daniel Loeb, Susquehanna International Group, LLP
http://deptinfo.labri.fr/~loeb
Work loeb@sig.com 401 City Line, Bala Cynwyd PA
19004 610/6172671
Home loeb@delanet.com
Date: Fri, 30 Nov 2001 09:00:47 0500
To: Dan Loeb <loeb@sig.com>
From: Andy Barto <barto@cs.umass.edu>
Subject: Re: Reinforcement Learning: Example 4.3
Cc: Rich Sutton <sutton@research.att.com>
Dan,
Many thanks for comment about example 4.3. Indeed, we write as if the
solution of Fig. 4.6 is the only one, when there is a whole class of
them. You've done a nice job of characterizing the entire class. We are
aware of this, and students in our course have similarly described all
the solutions. But thanks very much for your message. A future edition
of the book will do this better.
Sincrely,
A. Barto
PS: I haven't thought about the case of infinitely divisible money
From: Francesco.Decomite@lifl.fr
Subject: Gambler's problem
Date: May 26, 2004 6:46:57 AM MDT
To: rich@richsutton.com
dear Sir,
While studying your book, I am trying to rewrite and test the
algorithms.
I am faced with a problem for about a week now, working on the
gambler's problem (exemple 4.3 page 101) :
No matter how hard I tried, I never obtain the wellshaped lower graph on figure 4.6 page 103, but only approximations.
Even when using your code, the shape is approximative. Another
point is that, as far as p<0.5, all the graphs are looking similar.
I join you a graph obtained from your code with two values of p(I just
added a function : lp to output the final policy) and divided epsilon
by 100 : well, I attach also the modified code)
my question then is : do you obtain the graph by analytics methods, or
other reasoning ways ? If so, can you give me some hints ?
Best regards
F. de Comite

Francesco De Comite
tel (33) 03 28 77 85 72
fax (33) 03 28 77 85 37
www.lifl.fr/~decomite
;;; Gambler's problem. The gambler has a stake s between 0 and 100. At each
;;; play he wagers an integer <= s. He wins that much with prob p, else he
;;; loses that much. If he builds his stake to 100 he wins (thus he never
;;; wagers more than ( 100 s)); if his stake falls to 0 he loses.
;;; Thus, the stake s is the state the actions are the size of the bid.
;;; Here we implement value iteration.
(defVar v (makearray 101 :initialelement 0))
(setf (aref V 100) 1)
(defVar p .12)
(defun backupaction (s a)
(+ (* p (aref V (+ s a)))
(* ( 1 p) (aref V ( s a)))))
(defun vi (&optional (epsilon .0000000001))
"Value iteration to the criterion epsilon"
(loop collect (loop for i from 1 to 99 collect (list i (aref v i)))
while (< epsilon
(loop for s from 1 below 100
for oldV = (aref V s)
do (setf (aref V s)
(loop for a from 1 upto (min s ( 100 s))
maximize (backupaction s a)))
maximize (abs ( oldV (aref V s)))))
))
(defun policy (s &optional (epsilon .0000000001))
(loop with bestvalue = 1
with bestaction
for a from 1 upto (min s ( 100 s))
for thisvalue = (backupaction s a)
do (when (> thisvalue (+ bestvalue epsilon))
(setq bestvalue thisvalue)
(setq bestaction a))
finally (return bestaction)))
(defun lp()
(loop for i from 1 to 99 collect (list i (policy i)))
)