[math-fun] Differential Privacy and Poll Sampling

11 Nov 2016

      The pollsters in the recent presidential election didn't do a very good job of predicting the result.

Apparently, one of the better polls -- the LATimes poll -- was heavily criticized for having one sample that appeared to be an outlier, but who affected the result quite significantly in a way that produced (in this particular case) a more *correct* prediction.

It occurred to me that it might be a good idea to use some of the ideas from *differential privacy* to reduce the possibility of having a single sample point significantly affect the result.  After all, the whole point of differential privacy is to make sure that a single sample point can't be identified using a small number of questions.

The polls have also been criticized for "oversampling" -- i.e., applying the wrong weighting factors to different sample points.

So, my modest proposal is to *oversample* rare combinations, but then *underweight* these oversamplings in such a way as to produce the overall correct weighting.  The amount of oversampling should be calculable from the basic equations of differential privacy, which try to minimize any change due to the inclusion or elimination of any single sample point.

https://en.wikipedia.org/wiki/Differential_privacy

"The definition [of differential privacy] gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly."

Henry Baker

Dan Asimov

tags

participants (2)