[math-fun] The probability that N independent statistical tests fail simultaneously (review)

18 Nov 2014

      So my preliminary paper
 "The chance that N independent statistical tests fail simultaneously"
 http://rangevoting.org/CombinedTestFail.html

has now obtained 5 mathematicians in a row not obtaining the right
formula for F2,
followed by 2 disputants on the ElectionIntegrity email list who both
claimed statistical expertise, disputing it... followed
by Paul F. Velleman, an actual professor of statistics at Cornell, telling
me that not only was I right, in fact this was a well known literature area
that I was probably just rediscovering the wheel about.  The magic key words he
suggested are "multiple comparison problem."

Amazing.

Well, Velleman is certainly more right than the others.
It turns out there are several entire books on this area of statistics,
and the whole "multiple comparison problem" is, at least generally
speaking, a well known phenomenon.  Evidently, however, it is not well
known enough.  In particular,
at least one election-integrity paper by Josh Mitteldorf and others,
is invalidated because he did not know about this effect, and it would
not surprise me if every paper on that topic he ever wrote or
co-wrote, also is invalidated, plus quite likely some of Mitteldorf's
other papers in non-election areas.  (His response? He emailed me "I
don't have time to continue this discussion."  You know what Josh?  If
a goodly fraction of my life's scientific work had just been
invalidated, I'd find the effing time, especially if somebody was very
generously and helpfully pointing it out to you.)
And Mitteldorf is by no means the only victim -- a large number of
medical experimental papers also contain wrong statistical
calculations due to their authors not knowing about this effect, which
probably means lives have been lost.   I would guess thousands of
papers are invalidated.

QUOTE from Yoav Benjamini & Yosef Hochberg:
 Controlling the false discovery rate: a practical and powerful
 approach to multiple testing". J. Royal Statistical
 Society, Series B 57,1 (1995) 125-133:

"Even though MCPs have been in use since the early 1950s
and in spite of advocacy for their use (e.g. mandatory for some journals,
as well as in institutions like the FDA) researchers have not yet widely
adopted these procedures. In medical research for example, Godfrey
(1985), Pocock et al (1987) and Smith et al (1987) examined samples of
reports of comparative studies from major medical journals. They found
that researchers overlook various kinds of multiplicity, and as a
result reporting tends to exaggerate treatment differences"

Want more? Here's a second QUOTE from the review paper
D.A.Berry: The difficult and ubiquitous problems of multiplicities,
Pharmaceutical Statistics 6 (2007) 155-160:
 "Most scientists are oblivious to the problems of multiplicities. Yet
they are everywhere. In one or more of its forms, multiplicities are
present in every statistical application. They may be out in the open
or hidden. And even if they are out in the open, recognizing them is
but the first step in a difficult process of inference. Problems of
multiplicities are the most difficult that we statisticians face. They
threaten the validity of every statistical conclusion."

Anybody get the picture yet?

So now, I am trying to look into the literature Velleman so helpfully
pointed me toward.
 Books on this topic include:

 Rupert G. Miller Jr:
 Simultaneous statistical inference, Springer-Verlag, 2nd ed, 1981.
  QA276 .M474

 Jason C.Hsu: Multiple comparisons: Theory and methods. London, UK:
 Chapman and Hall 1996.

 Larry E. Toothaker: Multiple comparisons for researchers,
 Newbury Park, Calif. : Sage Publications, 1991.
 Q180.55.M4 T66

 Shanti Swarup Gupta &  Deng-Yuan Huang:
 Multiple statistical decision theory :recent developments,
 New York : Springer-Verlag, c1981.
 QA279.7 .G87

 Peter H.Westfall & S.S.Young: Resampling-based multiple testing:
 Examples and methods for p-value adjustment. New York, NY: Wiley 1993.

There also are several web pages devoted to this area, including wikipedia's
 and
 "Beware of multiple comparisons"
 http://www.graphpad.com/guides/prism/6/statistics/index.htm?beware_of_multip...

 In addition to those books just on this topic, apparently at least 10
 general-purpose statistics guidebooks at least make some mention of
 the Multiple Comparisons Problem.  E.g.

 Statistics for anthropology (Cambridge U.P.) / Lorena Madrigal 2012

 Using statistical methods in social work practice : a complete SPSS guide
 (Lyceum Books 2006) / Soleman H. Abu-Bader

 Modern data analysis / edited by Robert L. Launer, Andrew F. Siegel.
 New York : Academic Press, 1982

 Handbook of Biological Statistics / John McDonald

 Another good magic keyphrase is "false discovery rate."

 What do I think of all this literature?
 I'm trying to figure that out...  I'll let you know after I read some
 more of it.

 A simple and safe idea which many sources recommend is, if you are
 doing T different tests then your p-level cutoff for statistical
 significance (e.g. if seeking 99.9% confidence,
 it would be 0.001) should be divided by T for each test, then proceed.
 That'll protect you.
 I already knew that since I was a child, but it is a weak idea.  If
 you want to wring
 the most confidence from your tests, you need stronger methods,
 i.e. need a better understanding than just that.   A large fraction
 (in fact virtually all of the ones I looked at so far) of the
 statistics theory papers on this topic DO NOT GIVE CLEARLY STATED
 THEOREMS, WITH PROOFS.
 I think for a topic clearly tricky like this, that is unacceptable
 behavior.  So I will say straight off that the workers in this area
 have, in the vast majority, done poor work.

 The allegedly strongest result in one line of work on this is
 Yosef Hochberg:
 A Sharper Bonferroni Procedure for Multiple Tests of Significance,
  Biometrika 75,4 (1988) 800-802
 which is available electronically:
 http://www-stat.wharton.upenn.edu/~steele/Courses/956/Resource/MultipleCompa...
 and
 http://svn.donarmstrong.com/don/trunk/projects/research/linkage/papers/multi...

 and it seems to me still to be a quite weak result.

-- 
Warren D. Smith
http://RangeVoting.org  <-- add your endorsement (by clicking
"endorse" as 1st step)

[math-fun] The probability that N independent statistical tests fail simultaneously (review)

Warren D Smith