[math-fun] Re: More ZipfHuffing
While Huffman encoding does extremely well on Zipf distributions, with the efficiency increasing towards 100% as the size symbol set increases, the problem is that the encoding is particular to the size of the symbol set. David Eppstein proposed the use of an encoding which is independent of the number of symbols -- i.e., the encoding of the symbol of relative frequency 1/i (non-normalized) is encoded in 2*il(i)-1 bits -- e.g., the symbol of relative frequence (r.f.) 1/1 is encoded in 1 bit, that of r.f. 1/2 is encoded in 3 bits, that of r.f. 1/3 is encoded in 3 bits. Here, "il(i)" means "integer-length(i)", the function in Common Lisp that counts the number of bits in the base-2 representation of i. This Eppstein encoding thus always has an _odd_ number of bits, and is thus not the highest efficiency encoding. This Eppstein encoding is pretty good, in that from 2 to 15 symbols, the efficiency increases from 55% to 89%. However, from 16 symbols and higher, the efficiency continues to drop, albeit very slowly. Thus, for 2^20 symbols the efficiency drops below 70% to 69.6%, which is slightly below that for a straightforward block encoding of 20 bits per symbol. Asymptotically, this encoding appears to tend to 50% -- i.e., twice as many bits as should be necessary. Is there a better encoding of Zipf alphabets such that the encoding of the symbol of relative probability 1/i is _independent_ of the alphabet size and the encoding tends towards 100% efficiency asymptotically?
I'm not sure exactly you define efficiency, but I'm pretty sure that this is impossible. If the first k values use half the available code space, as n goes to infinity this half of the code space will be used for values with asymptotically zero (O(1/log(n)) of the probability distribution. I don't see how this can be anything but inefficient. Franklin T. Adams-Watters -----Original Message----- From: Henry Baker <hbaker1@pipeline.com> Is there a better encoding of Zipf alphabets such that the encoding of the symbol of relative probability 1/i is _independent_ of the alphabet size and the encoding tends towards 100% efficiency asymptotically?
Franklin, Steve, David, et al: After a little thinking, I reinvented ternary encoding (or arbitrary radix encoding for radices 2^k-1). The idea is similar to normal decimal notation, except that we use a radix of 3 or 7 or 15 or ... The reason for not using binary encoding is that we have to encode a "comma" or something to delimit the end of the integer, so we use up one of the "digits" for this purpose. The larger the radix, the smaller the coding loss. Here are ternary encodings for small integers: 1 1_ (4 bits, 2 bits per character) 2 2_ 4 bits 3 10_ 6 bits 4 11_ 6 bits 5 12_ 6 bits 6 20_ 6 bits 7 21_ 6 bits 8 22_ 6 bits 9 100_ 8 bits ... Actually, since the first digit is always 1 or 2, we can subtract 1 bit from the (ternary) encoding without any loss. The calculation below did not make this optimization. For a Zipf alphabet of 2^20 symbols, this ternary encoding scheme gets an efficiency of 88.4%. For a Zipf alphabet of 2^20 symbols, a radix-7 (septary?) scheme gets an efficiency of 97%. Once again, the smoothness of the Zipf distribution allows us to ignore some of the inefficiencies for the first few symbols. At 02:48 PM 6/20/2006, franktaw@netscape.net wrote:
I'm not sure exactly you define efficiency, but I'm pretty sure that this is impossible. If the first k values use half the available code space, as n goes to infinity this half of the code space will be used for values with asymptotically zero (O(1/log(n)) of the probability distribution. I don't see how this can be anything but inefficient.
Franklin T. Adams-Watters
-----Original Message----- From: Henry Baker <hbaker1@pipeline.com>
Is there a better encoding of Zipf alphabets such that the encoding of the symbol of relative probability 1/i is _independent_ of the alphabet size and the encoding tends towards 100% efficiency asymptotically?
You can improve this using Fibonacci (or Zeckendorf) encoding - every string ends in 11. So we have: 1 11 2 bits 2 011 3 bits 3 0011 4 bits 4 1011 4 bits 5 00011 5 bits 6 10011 5 bits 7 01011 5 bits 8 000011 6 bits etc. Note, by the way, that changing the radix as the number of symbols increases is *not* using the same encoding regardless of the number of symbols. Franklin T. Adams-Watters -----Original Message----- From: Henry Baker <hbaker1@pipeline.com> Franklin, Steve, David, et al: After a little thinking, I reinvented ternary encoding (or arbitrary radix encoding for radices 2^k-1). The idea is similar to normal decimal notation, except that we use a radix of 3 or 7 or 15 or ... The reason for not using binary encoding is that we have to encode a "comma" or something to delimit the end of the integer, so we use up one of the "digits" for this purpose. The larger the radix, the smaller the coding loss. Here are ternary encodings for small integers: 1 1_ (4 bits, 2 bits per character) 2 2_ 4 bits 3 10_ 6 bits 4 11_ 6 bits 5 12_ 6 bits 6 20_ 6 bits 7 21_ 6 bits 8 22_ 6 bits 9 100_ 8 bits ... Actually, since the first digit is always 1 or 2, we can subtract 1 bit from the (ternary) encoding without any loss. The calculation below did not make this optimization. For a Zipf alphabet of 2^20 symbols, this ternary encoding scheme gets an efficiency of 88.4%. For a Zipf alphabet of 2^20 symbols, a radix-7 (septary?) scheme gets an efficiency of 97%. Once again, the smoothness of the Zipf distribution allows us to ignore some of the inefficiencies for the first few symbols. At 02:48 PM 6/20/2006, franktaw@netscape.net wrote:
I'm not sure exactly you define efficiency, but I'm pretty sure that this is impossible. If the first k values use half the available code space, as n goes to infinity this half of the code space will be used for values with asymptotically zero (O(1/log(n)) of the probability distribution. I don't see how this can be anything but inefficient.
Franklin T. Adams-Watters
-----Original Message----- From: Henry Baker <hbaker1@pipeline.com>
Is there a better encoding of Zipf alphabets such that the encoding of the symbol of relative probability 1/i is _independent_ of the alphabet size and the encoding tends towards 100% efficiency asymptotically?
I ran some tests on Fibonacci/Zeckendorf v. ternary (w/optimization) encoding on Zipf distributions of n symbols: Coding efficiency n Fib Tern Sept 2^1 39.4% 30.6% 18.4% 2^2 64.0% 50.4% 35.8% 2^3 78.0% 67.2% 49.5% 2^4 86.0% 75.5% 58.4% 2^5 90.6% 82.6% 67.0% 2^6 92.7% 86.9% 73.4% 2^7 93.8% 89.6% 77.3% 2^8 94.2% 92.0% 81.7% 2^9 94.2% 93.0% 84.5% 2^10 94.0% 93.9% 86.6% 2^11 93.5% 94.7% 89.1% 2^12 93.1% 94.8% 90.3% 2^13 92.6% 95.1% 91.6% 2^14 92.1% 95.2% 93.3% 2^15 91.5% 95.1% 93.8% 2^16 91.0% 95.2% 94.6% 2^17 90.5% 95.1% 95.6% 2^18 90.0% 94.9% 95.9% 2^19 89.6% 94.9% 96.5% 2^20 89.1% 94.6% 97.0% For large n, Fibonacci encoding takes approx log2(n)/log2(phi) bits, while for large n, ternary encoding takes approx log2(n)/log2(sqrt(3)), and sept encoding takes approx log2(n)/log2(root(3,7)) bits. Since phi=1.47..., sqrt(3)=1.73..., and root(3,7)=1.913, ternary encoding will eventually beat Fibonacci encoding, and sept encoding will eventually beat ternary encoding. However, as you pointed out, for large enough n, they will all eventually have arbitrarily small efficiencies. There are some codes that grow as log2(n)+log2(log2(n))+log2(log2(log2(n)))+..., which do not lose efficiency as n becomes larger. At 11:22 AM 6/21/2006, franktaw@netscape.net wrote:
You can improve this using Fibonacci (or Zeckendorf) encoding - every string ends in 11. So we have:
1 11 2 bits 2 011 3 bits 3 0011 4 bits 4 1011 4 bits 5 00011 5 bits 6 10011 5 bits 7 01011 5 bits 8 000011 6 bits etc.
Note, by the way, that changing the radix as the number of symbols increases is *not* using the same encoding regardless of the number of symbols.
Franklin T. Adams-Watters
-----Original Message----- From: Henry Baker <hbaker1@pipeline.com>
Franklin, Steve, David, et al:
After a little thinking, I reinvented ternary encoding (or arbitrary radix encoding for radices 2^k-1). The idea is similar to normal decimal notation, except that we use a radix of 3 or 7 or 15 or ... The reason for not using binary encoding is that we have to encode a "comma" or something to delimit the end of the integer, so we use up one of the "digits" for this purpose. The larger the radix, the smaller the coding loss.
Here are ternary encodings for small integers:
1 1_ (4 bits, 2 bits per character) 2 2_ 4 bits 3 10_ 6 bits 4 11_ 6 bits 5 12_ 6 bits 6 20_ 6 bits 7 21_ 6 bits 8 22_ 6 bits 9 100_ 8 bits ...
Actually, since the first digit is always 1 or 2, we can subtract 1 bit from the (ternary) encoding without any loss. The calculation below did not make this optimization.
For a Zipf alphabet of 2^20 symbols, this ternary encoding scheme gets an efficiency of 88.4%. For a Zipf alphabet of 2^20 symbols, a radix-7 (septary?) scheme gets an efficiency of 97%.
Once again, the smoothness of the Zipf distribution allows us to ignore some of the inefficiencies for the first few symbols.
At 02:48 PM 6/20/2006, franktaw@netscape.net wrote:
I'm not sure exactly you define efficiency, but I'm pretty sure that this is impossible. If the first k values use half the available code space, as n goes to infinity this half of the code space will be used for values with asymptotically zero (O(1/log(n)) of the probability distribution. I don't see how this can be anything but inefficient.
Franklin T. Adams-Watters
-----Original Message----- From: Henry Baker <hbaker1@pipeline.com>
Is there a better encoding of Zipf alphabets such that the encoding of the symbol of relative probability 1/i is _independent_ of the alphabet size and the encoding tends towards 100% efficiency asymptotically?
participants (2)
-
franktaw@netscape.net -
Henry Baker