[math-fun] Zipf, codes & cumulative probability

21 Jul 2008

      ...
f'(n) is a relative probability distribution;
f''(x) is similar and almost a probability density function;
F''(x) is a corresponding, almost cumulative probability function.
(F'' is the integral of f''.)
...
Let F(x) be F''(x) adjusted so that F(.5) = 0 and F( m + .5 ) = 1.
Let f(x) be d F(x)/dx and G(y) is the inverse of F(x).
There's an argument for making f(x) = F(n+.5) - F(n-.5) since that's
the approximate chunk of code space this method gives to n.
One version of my method needs to invert f(), though, and I'm not
sure that's always so easy.

(If I thought this would all be as easy (& applicable to as many
kinds of distributions) with differences and sums instead of
derivatives and integrals, I'd do that since it's more correct.)
...
Here's an existence proof of a code.  [...] To encode n, generate
         F(n-1)  (if n>1),
         F( n ),   and
         F(n+1)  (if n < m)
to at least a couple more bits of precision than -log2(f(n)) would call
for.  Use just enough bits of F(n) to distinguish from both neighbors
(or the one neighbor if n = 1 or n = m).
I believe the code you get that way uses the whole code space.
No, it does get skips (although never overlaps).   Example:

1 00
2 01
3 1010 << 100* is skipped
4 1011
5 110
6 11100
7 11101
8 1111
...
It'll
be fairly efficient, but not as good as a Huffman code.
Yes, particularly when there isn't a skip!
Here are some numbers for small m:
1 .. 2 : 2.15066010309 uniq_bin 0.901524769097   huffman 0.901524769097
1 .. 3 : 2.15066010309 uniq_bin 0.982262472148   huffman 0.982262472148
1 .. 4 : 2.15066010309 uniq_bin 0.996698112389   huffman 0.996698112389
1 .. 5 : 2.15066010309 uniq_bin 0.973892050742   huffman 0.985497700151
1 .. 6 : 2.15066010309 uniq_bin 0.981575348361   huffman 0.981575348361
1 .. 7 : 2.15066010309 uniq_bin 0.981487394329   huffman 0.981487394329
1 .. 8 : 2.15066010309 uniq_bin 0.888319512382   huffman 0.97949565162
                                  ^^^ this is the code above with a skip
...and some large m's.:
1 .. 16 :  uniq_bin 0.978314000279   huffman 0.988963192851
1 .. 256 :  uniq_bin 0.99063060087   huffman 0.993030526135
1 .. 4096 :  uniq_bin 0.994895289226   huffman 0.9972730518
1 .. 65536 :  uniq_bin 0.994403026895   huffman 0.996899550792
1 .. 1048576 :  uniq_bin 0.991023380916   huffman 0.997858670277

These are "comparing apples to apples" by using F(n+.5) - F(n-.5) as
the probabilities for both methods.
...
Another interesting thing to think about is that no finite-length code
can grow more slowly than log2(n) in the limit.
It has to be ever-so-slightly bigger than log2(n).  Something like
      size(n) = log2(n) + size( log2(n) )

Notice that that goes berserk if you recur enough times, so it can't be
right.  It looks like Elias might have been the one to have thought this
through, but maybe not.

The whole idea of "optimum" codes is a little strange: Huffman's proof
that his method is optimum (among "prefix-free codes") is a sort of
touchstone for entropy coders, but the differences in efficiency are
like the .991 vs. .998 above.  It's just that when you're analyzing
an algorithm, once you know it's optimal, you don't have to
think about that issue anymore.

The difference between one universal code that encodes a number like
    2^(2^(2^(2^(2^(2^2))))) with
       2^(2^(2^(2^(2^2)))) + 2^(2^(2^(2^2))) + 2^(2^(2^2)) + 100 bits
compared to one that uses
       2^(2^(2^(2^(2^2)))) + 2^(2^(2^(2^2))) + 2^(2^(2^2)) + 50 bits

is hypernegligible.  But fascinating.

But, granted that it's a perverse question when applied to coding
efficiency, what's the mathematical question equivalent to asking
for the absolutely most bit-squeezingly efficient code for
super-super-super...astronomical numbers, after stripping away the
veils of bits and efficiency?  Something like, the slowest way to
approach an asymptote, or the most nearly non-converging sequence?

  --Steve

Steve Witham

tags

participants (1)