So my preliminary paper "The chance that N independent statistical tests fail simultaneously" http://rangevoting.org/CombinedTestFail.html has now obtained 5 mathematicians in a row not obtaining the right formula for F2, followed by 2 disputants on the ElectionIntegrity email list who both claimed statistical expertise, disputing it... followed by Paul F. Velleman, an actual professor of statistics at Cornell, telling me that not only was I right, in fact this was a well known literature area that I was probably just rediscovering the wheel about. The magic key words he suggested are "multiple comparison problem." Amazing. Well, Velleman is certainly more right than the others. It turns out there are several entire books on this area of statistics, and the whole "multiple comparison problem" is, at least generally speaking, a well known phenomenon. Evidently, however, it is not well known enough. In particular, at least one election-integrity paper by Josh Mitteldorf and others, is invalidated because he did not know about this effect, and it would not surprise me if every paper on that topic he ever wrote or co-wrote, also is invalidated, plus quite likely some of Mitteldorf's other papers in non-election areas. (His response? He emailed me "I don't have time to continue this discussion." You know what Josh? If a goodly fraction of my life's scientific work had just been invalidated, I'd find the effing time, especially if somebody was very generously and helpfully pointing it out to you.) And Mitteldorf is by no means the only victim -- a large number of medical experimental papers also contain wrong statistical calculations due to their authors not knowing about this effect, which probably means lives have been lost. I would guess thousands of papers are invalidated. QUOTE from Yoav Benjamini & Yosef Hochberg: Controlling the false discovery rate: a practical and powerful approach to multiple testing". J. Royal Statistical Society, Series B 57,1 (1995) 125-133: "Even though MCPs have been in use since the early 1950s and in spite of advocacy for their use (e.g. mandatory for some journals, as well as in institutions like the FDA) researchers have not yet widely adopted these procedures. In medical research for example, Godfrey (1985), Pocock et al (1987) and Smith et al (1987) examined samples of reports of comparative studies from major medical journals. They found that researchers overlook various kinds of multiplicity, and as a result reporting tends to exaggerate treatment differences" Want more? Here's a second QUOTE from the review paper D.A.Berry: The difficult and ubiquitous problems of multiplicities, Pharmaceutical Statistics 6 (2007) 155-160: "Most scientists are oblivious to the problems of multiplicities. Yet they are everywhere. In one or more of its forms, multiplicities are present in every statistical application. They may be out in the open or hidden. And even if they are out in the open, recognizing them is but the first step in a difficult process of inference. Problems of multiplicities are the most difficult that we statisticians face. They threaten the validity of every statistical conclusion." Anybody get the picture yet? So now, I am trying to look into the literature Velleman so helpfully pointed me toward. Books on this topic include: Rupert G. Miller Jr: Simultaneous statistical inference, Springer-Verlag, 2nd ed, 1981. QA276 .M474 Jason C.Hsu: Multiple comparisons: Theory and methods. London, UK: Chapman and Hall 1996. Larry E. Toothaker: Multiple comparisons for researchers, Newbury Park, Calif. : Sage Publications, 1991. Q180.55.M4 T66 Shanti Swarup Gupta & Deng-Yuan Huang: Multiple statistical decision theory :recent developments, New York : Springer-Verlag, c1981. QA279.7 .G87 Peter H.Westfall & S.S.Young: Resampling-based multiple testing: Examples and methods for p-value adjustment. New York, NY: Wiley 1993. There also are several web pages devoted to this area, including wikipedia's and "Beware of multiple comparisons" http://www.graphpad.com/guides/prism/6/statistics/index.htm?beware_of_multip... In addition to those books just on this topic, apparently at least 10 general-purpose statistics guidebooks at least make some mention of the Multiple Comparisons Problem. E.g. Statistics for anthropology (Cambridge U.P.) / Lorena Madrigal 2012 Using statistical methods in social work practice : a complete SPSS guide (Lyceum Books 2006) / Soleman H. Abu-Bader Modern data analysis / edited by Robert L. Launer, Andrew F. Siegel. New York : Academic Press, 1982 Handbook of Biological Statistics / John McDonald Another good magic keyphrase is "false discovery rate." What do I think of all this literature? I'm trying to figure that out... I'll let you know after I read some more of it. A simple and safe idea which many sources recommend is, if you are doing T different tests then your p-level cutoff for statistical significance (e.g. if seeking 99.9% confidence, it would be 0.001) should be divided by T for each test, then proceed. That'll protect you. I already knew that since I was a child, but it is a weak idea. If you want to wring the most confidence from your tests, you need stronger methods, i.e. need a better understanding than just that. A large fraction (in fact virtually all of the ones I looked at so far) of the statistics theory papers on this topic DO NOT GIVE CLEARLY STATED THEOREMS, WITH PROOFS. I think for a topic clearly tricky like this, that is unacceptable behavior. So I will say straight off that the workers in this area have, in the vast majority, done poor work. The allegedly strongest result in one line of work on this is Yosef Hochberg: A Sharper Bonferroni Procedure for Multiple Tests of Significance, Biometrika 75,4 (1988) 800-802 which is available electronically: http://www-stat.wharton.upenn.edu/~steele/Courses/956/Resource/MultipleCompa... and http://svn.donarmstrong.com/don/trunk/projects/research/linkage/papers/multi... and it seems to me still to be a quite weak result. -- Warren D. Smith http://RangeVoting.org <-- add your endorsement (by clicking "endorse" as 1st step)