[math-fun] Getting a complete set
hihi, all - i thought about this same problem some years ago, and will try to recreate where i ended - the original problem was to estimate the size of a picture archive on the web, from the repetition (presumed independent) of single experiment that extracted (presumably uniformly randomly) and presented one picture the question was when to stop looking at the set, so i wanted a reasonable estimate for the total size N of the set clearly, until there is a repeated element, there is no maximum likelihood estimate for the size N (since larger sets are more likely to have n distinct selections, for any fixed N>n>1) what was interesting to me is that as soon as there is one repeat, a maximum likelihood estimate for N can be made, which turns out to be quadratic in the number n of selections made up to and including the repeated one (the expression is something like O(n*n/3), but the specific formula was slightly different for different parities of n, or maybe for different remainders for n (mod 3), i forget) however, the curve is VERY flat near that maximum, so the confidence interval estimate is very wide - i expected that more selections with more repeats should help get a sharper estimate for N i verified experimentally that more repeats make the peak narrower, but i could not quantify the improvement enough to get an analytic expression i tried to find the problem statement in some published application, and the closest i could get was the population estimation problem in ecological sampling more later, cal Chris Landauer Aerospace Integration Science Center The Aerospace Corporation cal@aero.org
--- Chris Landauer <cal@rush.aero.org> wrote:
hihi, all -
i thought about this same problem some years ago, and will try to recreate where i ended - the original problem was to estimate the size of a picture archive on the web, from the repetition (presumed independent) of single experiment that extracted (presumably uniformly randomly) and presented one picture
the question was when to stop looking at the set, so i wanted a reasonable estimate for the total size N of the set
clearly, until there is a repeated element, there is no maximum likelihood estimate for the size N (since larger sets are more likely to have n distinct selections, for any fixed N>n>1)
what was interesting to me is that as soon as there is one repeat, a maximum likelihood estimate for N can be made, which turns out to be quadratic in the number n of selections made up to and including the repeated one (the expression is something like O(n*n/3), but the specific formula was slightly different for different parities of n, or maybe for different remainders for n (mod 3), i forget)
however, the curve is VERY flat near that maximum, so the confidence interval estimate is very wide - i expected that more selections with more repeats should help get a sharper estimate for N
i verified experimentally that more repeats make the peak narrower, but i could not quantify the improvement enough to get an analytic expression
i tried to find the problem statement in some published application, and the closest i could get was the population estimation problem in ecological sampling
more later, cal
Chris Landauer Aerospace Integration Science Center The Aerospace Corporation cal@aero.org
This is a straightfoward Bayesian problem, although it could get messy to work through the combinatorics. Assuming N objects, calculate the probability P(n1,n2,...|N) of drawing n1 objects once, n2 objects twice, etc., where the total number of draws is n1 + 2 n2 + 3 n3 + ... . Let P0(N) be your prior probability for N. Then the posterior probability for N is P(N|n1,n2,...) = A P0(N) P(n1,n2,...|N), where A = 1/sum[all N, P0(N) P(n1,n2,...|N)] is a normalization constant (assuming your observations are sufficient to make the posterior proper). Note that you may ignore in P(n1,...|N) any factor independent of N. Gene __________________________________ Do you Yahoo!? Yahoo! Mail Address AutoComplete - You start. We finish. http://promotions.yahoo.com/new_mail
participants (2)
-
Chris Landauer -
Eugene Salamin