[math-fun] Re: More ZipfHuffing

20 Jun 2006

      While Huffman encoding does extremely well on Zipf distributions, with
the efficiency increasing towards 100% as the size symbol set increases,
the problem is that the encoding is particular to the size of the symbol
set.

David Eppstein proposed the use of an encoding which is independent of
the number of symbols -- i.e., the encoding of the symbol of relative
frequency 1/i (non-normalized) is encoded in 2*il(i)-1 bits -- e.g.,
the symbol of relative frequence (r.f.) 1/1 is encoded in 1 bit, that of
r.f. 1/2 is encoded in 3 bits, that of r.f. 1/3 is encoded in 3 bits.
Here, "il(i)" means "integer-length(i)", the function in Common Lisp
that counts the number of bits in the base-2 representation of i.  This
Eppstein encoding thus always has an _odd_ number of bits, and is thus
not the highest efficiency encoding.

This Eppstein encoding is pretty good, in that from 2 to 15 symbols, the
efficiency increases from 55% to 89%.  However, from 16 symbols and higher,
the efficiency continues to drop, albeit very slowly.  Thus, for 2^20
symbols the efficiency drops below 70% to 69.6%, which is slightly below
that for a straightforward block encoding of 20 bits per symbol.
Asymptotically, this encoding appears to tend to 50% -- i.e., twice as
many bits as should be necessary.

Is there a better encoding of Zipf alphabets such that the encoding of
the symbol of relative probability 1/i is _independent_ of the alphabet
size and the encoding tends towards 100% efficiency asymptotically?

Henry Baker

franktaw＠netscape.net

Henry Baker

franktaw＠netscape.net

Henry Baker

tags

participants (2)