Hi Steve & David:
I duplicated Steve's results with my own version of Huffman encoding,
so I agree with his results 100%.
David's suggestion is slightly inferior to pure Huffman.
Huffman gives outstanding efficiency for these Zipf distributions --
e.g., 99.67% efficient for the first 16,384 elements of the series 1/i
(i=1 16384). So it's a little hard to argue for the need for a more
efficient encoding.
The reason for this exceptional efficiency seems to be that Zipf is
a very smooth distribution. Here its "long tail" actually helps, since
any bad encoding of the most frequent 2 or 3 symbols won't affect the
overall efficiency very much.
However, even the simplest encoding (14 bits) is 71% efficient, and
appears to be in the 70% range even for the first million elements
of the harmonic series. Once again, the long tail helps, because the
bulk of the tail is reasonbly efficiently encoded.
An interesting connection between Huffman and Egyptian fractions is the inner
step of Huffman:
Sort the symbols into ascending order of probabilities.
Take the symbols with the two smallest probabilities and "merge" them,
thus adding their probabilities. Then bubble this probability up the
list until it is sorted again. (This is obviously not the most efficient
algorithm; I'm focussed on the probability numbers themselves.) We are
obviously building up sums from Egyptian fractions (probabilities not
normalized), with the topmost sum being H_n (nth harmonic number).
At 12:17 AM 6/17/2006, Steve Witham wrote:
>Henry,
>
>Thanks for an interesting question.
>Lots of web pages mention Zipf and Huffman in the same sentence,
>but no one seems to have asked just what you asked before.
>
>Dusted off and was forced to debug a Huffman-coding routine.
>Here it is generating codes for various populations of thingies.
>The digits indicate the number of bits to represent each thingy.
>Line 16 matches the example I gave last night; my
>hand-calculations were close enough.
>
>Line 15 seems perfect. It represents an ideal of what normally
>(or at least sometimes) happens, but I don't think it ever happens
>quite that way again. But, I'm not actually sure, more exploring
>to do.
>
> --Steve
>
>1: 0
>2: 11
>3: 122
>4: 1233
>5: 12344
>6: 133344
>7: 1334444
>8: 22334444
>9: 223344455
>10: 2234444455
>11: 22344445555
>12: 233344445555
>13: 2333444555555
>14: 23344444555555
>15: 233444455555555
>16: 2334444555555566
>17: 23344445555556666
>18: 233444555555556666
>19: 2334445555555666666
>20: 23344455555566666666
>21: 233444555556666666666
>22: 2334455555556666666666
>23: 23344555555666666666666
>24: 234444555555666666666666
>25: 2344445555556666666666677
>26: 23444455555666666666666677
>27: 234444555556666666666667777
>28: 2344455555556666666666667777
>29: 23444555555566666666666777777
>30: 234445555556666666666666777777
>31: 2344455555566666666666677777777
>32: 23444555555666666666667777777777
>33: 234445555556666666666777777777777
>34: 2344455555666666666666777777777777
>35: 23444555556666666666677777777777777
>36: 234445555566666666667777777777777777
>37: 2344455555666666666777777777777777777
>38: 23444555566666666666777777777777777777
>39: 234445555666666666677777777777777777777
>40: 2344555555666666666677777777777777777777
>41: 23445555556666666666777777777777777777788
>42: 234455555566666666677777777777777777777788
>43: 2344555555666666666777777777777777777778888
>44: 23445555566666666666777777777777777777778888
>45: 234455555666666666667777777777777777777888888
>46: 2344555556666666666777777777777777777777888888
>47: 23445555566666666667777777777777777777788888888
>48: 234455555666666666677777777777777777778888888888
>49: 2344555556666666666777777777777777777888888888888
>50: 23445555566666666677777777777777777777888888888888
>51: 234455555666666666777777777777777777788888888888888
>52: 2344555556666666667777777777777777778888888888888888
>53: 23445555566666666677777777777777777888888888888888888
>54: 234455555666666667777777777777777777888888888888888888
>55: 2344555556666666677777777777777777788888888888888888888
>56: 23445555666666666677777777777777777788888888888888888888
>57: 234455556666666666777777777777777778888888888888888888888
>58: 2344555566666666677777777777777777778888888888888888888888
>59: 23445555666666666777777777777777777888888888888888888888888
>60: 234455556666666667777777777777777788888888888888888888888888
>61: 2344555566666666677777777777777778888888888888888888888888888
>62: 23445555666666667777777777777777778888888888888888888888888888
>63: 234455556666666677777777777777777888888888888888888888888888888
>64: 2344555566666666777777777777777788888888888888888888888888888888
>65: 23445555666666667777777777777777888888888888888888888888888888899
>66: 234455556666666677777777777777788888888888888888888888888888888899
>67: 2344555566666666777777777777777888888888888888888888888888888889999
>68: 23445555666666677777777777777777888888888888888888888888888888889999
>69: 234455556666666777777777777777778888888888888888888888888888888999999
>70: 2344555566666667777777777777777888888888888888888888888888888888999999
>71: 23445555666666677777777777777778888888888888888888888888888888899999999