[math-fun] Lee Sedol wins game 4, humanity may not yet be complete toast (but almost)

13 Mar 2016

      That was interesting and revealed how the monte carlo approach to go is flawed.

LS had to be brilliant and play some "improbable" moves the machine may not
have appreciated until too late.  He did. As a result, LS got a won
position by a small
but clear (to strong players) margin (don't know exactly, but let's
say 2-5 points), but
to do so LS used up all of his 2-hour time budget plus two of his
three overtimes.
Meanwhile AG had only used up 1 hour.  From then on, the rules were LS
had to make every move within 1 minute.

So at that point, if LS had been up against a top human (TH), TH would
have tried
to complicate and pose LS one or two difficult problems, which
hopefully he'd be unable to solve in 1 minute.  Hence LS would either
lose on time or risk giving up his advantage via a mistake.  There was
still enough scope for creating complexity for this.  Also, TH could
have used his spare hour to think a lot of it out ahead of time, then
sprung it on LS suddenly with quick moves in 1 second each.  That
would mean LS could not think
on TH's time when it got difficult.

But no.  AG instead played a few totally stupid and pointless
time-waste moves (even I could see that), as well as playing a lot of
trivially-easy-to-answer forcing moves.
That took the heat off and gave LS an easy ride to the win.  Also AG's
time-usage algorithm is very uniform.  Completely uniform would be to
always use exactly 1 minute per move (say), and while AG is not
completely uniform, it is fairly close to it, much
more so than a human.  A human will play an easy move in 2 seconds.
AG will grind on for 30 sec.

So what is the matter with monte carlo go?  AG works as follows.
It looks ahead and then evaluates a leaf position P.  The eval(P)
is a linear combination of
(i) a neural network(P) estimating win probability, and
(ii) play out N "random games" from P to end (or what it thinks is the
end, anyhow)
finding how what percentage it would win.   The random games use
random moves generated from the approximately-human move probability
distribution from its "policy neural network."
(iii) the coefficients in the linear combination depend on N and are
tuned to maximize experimental performance.

This approach works badly when the position is "clearly won" or clearly lost.
Because then, even for large N, essentially all the random games are
wins (or losses)
meaning part (ii) generates little or no useful information, and most
of the "signal" is actually random "noise."  Part (i) to some extent
saves them from that pathology, but
not enough, especially since they trained it to output win
probability, not to output predicted final score.

As a result, monte carlo go will in won positions decide between move
A and move B ,
assuming they both win, largely at random.  Ditto in lost positions.

That's not the smartest way to play won positions  -- although usually
good enough -- and
definitely it's not the smartest way to play lost positions when your
opponent is in severe time trouble!

How could this problem be cured?
Well, the cheap and easy fix would be, also make the neural net
predict final score
and final territory, which by the way would be 361 bits of learnable info per
training position, not merely 1 bit, so it'd learn 361 times faster.
Also make the monte
carlo in step (ii) know about score not just winning.

Then in a lost position, let's say some node's winprob estimate was P=0.05.
Really, due to random noise, (which should be estimated by the computer) it
is, say 0.05+-0.03.   My point is, if & when the noise is comparable
to or larger than the
signal, then it is worth looking at final score estimates too.
Although those are the
wrong signal (winning is the "only thing" said football coach) they
happen to be a better
quality source of signal, supplying more bits, enabling better
signal/noise ratio if you used it appropriately rather than ignoring
it.

I believe this would improve the machine's play in won and especially
in lost positions.

Now the more correct fix is to employ "BPIP search" (invented by me
about 20 years ago) not UCT monte carlo search, which is sort of a flawed
randomized approximation to BPIP.  Nobody in the AI community paid
much attention to BPIP search, and now they are paying for that.

BPIP involves eval(P) which outputs a probability distribution, not a
mere number.

-- 
Warren D. Smith
http://RangeVoting.org  <-- add your endorsement (by clicking
"endorse" as 1st step)

Warren D Smith

Jon Ziegler

tags

participants (2)