[math-fun] Getting a complete set

5 Aug 2004


      hihi, all -

i thought about this same problem some years ago, and will try to recreate
where i ended - the original problem was to estimate the size of a picture
archive on the web, from the repetition (presumed independent) of single
experiment that extracted (presumably uniformly randomly) and presented one
picture

the question was when to stop looking at the set, so i wanted a reasonable
estimate for the total size N of the set

clearly, until there is a repeated element, there is no maximum likelihood
estimate for the size N (since larger sets are more likely to have n distinct
selections, for any fixed N>n>1)

what was interesting to me is that as soon as there is one repeat, a maximum
likelihood estimate for N can be made, which turns out to be quadratic in the
number n of selections made up to and including the repeated one (the
expression is something like O(n*n/3), but the specific formula was slightly
different for different parities of n, or maybe for different remainders for n
(mod 3), i forget)

however, the curve is VERY flat near that maximum, so the confidence interval
estimate is very wide - i expected that more selections with more repeats
should help get a sharper estimate for N

i verified experimentally that more repeats make the peak narrower, but i
could not quantify the improvement enough to get an analytic expression

i tried to find the problem statement in some published application, and the
closest i could get was the population estimation problem in ecological
sampling

more later,
cal


Chris Landauer
Aerospace Integration Science Center
The Aerospace Corporation
cal@aero.org