[math-fun] Zipf, codes & cumulative probability
f'(n) is a relative probability distribution; f''(x) is similar and almost a probability density function; F''(x) is a corresponding, almost cumulative probability function. (F'' is the integral of f''.)
Let F(x) be F''(x) adjusted so that F(.5) = 0 and F( m + .5 ) = 1. Let f(x) be d F(x)/dx and G(y) is the inverse of F(x).
There's an argument for making f(x) = F(n+.5) - F(n-.5) since that's the approximate chunk of code space this method gives to n. One version of my method needs to invert f(), though, and I'm not sure that's always so easy. (If I thought this would all be as easy (& applicable to as many kinds of distributions) with differences and sums instead of derivatives and integrals, I'd do that since it's more correct.)
Here's an existence proof of a code. [...] To encode n, generate F(n-1) (if n>1), F( n ), and F(n+1) (if n < m) to at least a couple more bits of precision than -log2(f(n)) would call for. Use just enough bits of F(n) to distinguish from both neighbors (or the one neighbor if n = 1 or n = m).
I believe the code you get that way uses the whole code space.
No, it does get skips (although never overlaps). Example: 1 00 2 01 3 1010 << 100* is skipped 4 1011 5 110 6 11100 7 11101 8 1111
It'll be fairly efficient, but not as good as a Huffman code.
Yes, particularly when there isn't a skip! Here are some numbers for small m: 1 .. 2 : 2.15066010309 uniq_bin 0.901524769097 huffman 0.901524769097 1 .. 3 : 2.15066010309 uniq_bin 0.982262472148 huffman 0.982262472148 1 .. 4 : 2.15066010309 uniq_bin 0.996698112389 huffman 0.996698112389 1 .. 5 : 2.15066010309 uniq_bin 0.973892050742 huffman 0.985497700151 1 .. 6 : 2.15066010309 uniq_bin 0.981575348361 huffman 0.981575348361 1 .. 7 : 2.15066010309 uniq_bin 0.981487394329 huffman 0.981487394329 1 .. 8 : 2.15066010309 uniq_bin 0.888319512382 huffman 0.97949565162 ^^^ this is the code above with a skip ...and some large m's.: 1 .. 16 : uniq_bin 0.978314000279 huffman 0.988963192851 1 .. 256 : uniq_bin 0.99063060087 huffman 0.993030526135 1 .. 4096 : uniq_bin 0.994895289226 huffman 0.9972730518 1 .. 65536 : uniq_bin 0.994403026895 huffman 0.996899550792 1 .. 1048576 : uniq_bin 0.991023380916 huffman 0.997858670277 These are "comparing apples to apples" by using F(n+.5) - F(n-.5) as the probabilities for both methods.
Another interesting thing to think about is that no finite-length code can grow more slowly than log2(n) in the limit.
It has to be ever-so-slightly bigger than log2(n). Something like size(n) = log2(n) + size( log2(n) ) Notice that that goes berserk if you recur enough times, so it can't be right. It looks like Elias might have been the one to have thought this through, but maybe not. The whole idea of "optimum" codes is a little strange: Huffman's proof that his method is optimum (among "prefix-free codes") is a sort of touchstone for entropy coders, but the differences in efficiency are like the .991 vs. .998 above. It's just that when you're analyzing an algorithm, once you know it's optimal, you don't have to think about that issue anymore. The difference between one universal code that encodes a number like 2^(2^(2^(2^(2^(2^2))))) with 2^(2^(2^(2^(2^2)))) + 2^(2^(2^(2^2))) + 2^(2^(2^2)) + 100 bits compared to one that uses 2^(2^(2^(2^(2^2)))) + 2^(2^(2^(2^2))) + 2^(2^(2^2)) + 50 bits is hypernegligible. But fascinating. But, granted that it's a perverse question when applied to coding efficiency, what's the mathematical question equivalent to asking for the absolutely most bit-squeezingly efficient code for super-super-super...astronomical numbers, after stripping away the veils of bits and efficiency? Something like, the slowest way to approach an asymptote, or the most nearly non-converging sequence? --Steve
participants (1)
-
Steve Witham