[math-fun] Lee Sedol wins game 4, humanity may not yet be complete toast (but almost)
That was interesting and revealed how the monte carlo approach to go is flawed. LS had to be brilliant and play some "improbable" moves the machine may not have appreciated until too late. He did. As a result, LS got a won position by a small but clear (to strong players) margin (don't know exactly, but let's say 2-5 points), but to do so LS used up all of his 2-hour time budget plus two of his three overtimes. Meanwhile AG had only used up 1 hour. From then on, the rules were LS had to make every move within 1 minute. So at that point, if LS had been up against a top human (TH), TH would have tried to complicate and pose LS one or two difficult problems, which hopefully he'd be unable to solve in 1 minute. Hence LS would either lose on time or risk giving up his advantage via a mistake. There was still enough scope for creating complexity for this. Also, TH could have used his spare hour to think a lot of it out ahead of time, then sprung it on LS suddenly with quick moves in 1 second each. That would mean LS could not think on TH's time when it got difficult. But no. AG instead played a few totally stupid and pointless time-waste moves (even I could see that), as well as playing a lot of trivially-easy-to-answer forcing moves. That took the heat off and gave LS an easy ride to the win. Also AG's time-usage algorithm is very uniform. Completely uniform would be to always use exactly 1 minute per move (say), and while AG is not completely uniform, it is fairly close to it, much more so than a human. A human will play an easy move in 2 seconds. AG will grind on for 30 sec. So what is the matter with monte carlo go? AG works as follows. It looks ahead and then evaluates a leaf position P. The eval(P) is a linear combination of (i) a neural network(P) estimating win probability, and (ii) play out N "random games" from P to end (or what it thinks is the end, anyhow) finding how what percentage it would win. The random games use random moves generated from the approximately-human move probability distribution from its "policy neural network." (iii) the coefficients in the linear combination depend on N and are tuned to maximize experimental performance. This approach works badly when the position is "clearly won" or clearly lost. Because then, even for large N, essentially all the random games are wins (or losses) meaning part (ii) generates little or no useful information, and most of the "signal" is actually random "noise." Part (i) to some extent saves them from that pathology, but not enough, especially since they trained it to output win probability, not to output predicted final score. As a result, monte carlo go will in won positions decide between move A and move B , assuming they both win, largely at random. Ditto in lost positions. That's not the smartest way to play won positions -- although usually good enough -- and definitely it's not the smartest way to play lost positions when your opponent is in severe time trouble! How could this problem be cured? Well, the cheap and easy fix would be, also make the neural net predict final score and final territory, which by the way would be 361 bits of learnable info per training position, not merely 1 bit, so it'd learn 361 times faster. Also make the monte carlo in step (ii) know about score not just winning. Then in a lost position, let's say some node's winprob estimate was P=0.05. Really, due to random noise, (which should be estimated by the computer) it is, say 0.05+-0.03. My point is, if & when the noise is comparable to or larger than the signal, then it is worth looking at final score estimates too. Although those are the wrong signal (winning is the "only thing" said football coach) they happen to be a better quality source of signal, supplying more bits, enabling better signal/noise ratio if you used it appropriately rather than ignoring it. I believe this would improve the machine's play in won and especially in lost positions. Now the more correct fix is to employ "BPIP search" (invented by me about 20 years ago) not UCT monte carlo search, which is sort of a flawed randomized approximation to BPIP. Nobody in the AI community paid much attention to BPIP search, and now they are paying for that. BPIP involves eval(P) which outputs a probability distribution, not a mere number. -- Warren D. Smith http://RangeVoting.org <-- add your endorsement (by clicking "endorse" as 1st step)
On Mar 13, 2016, at 11:16 AM, Warren D Smith <warren.wds@gmail.com> wrote: Nobody in the AI community paid much attention to BPIP search, and now they are paying for that.
Yes, I see. They have created a world champion go player on their first public try. To worldwide acclaim. I’m sure they’re all wringing their hands wishing they had done things your way. Maybe you should offer to help? Regards, Jon
participants (2)
-
Jon Ziegler -
Warren D Smith