Veit Elser <ve10@cornell.edu> wrote:
1. Extract sequences of letters, including spaces (as word separators), from actual text.
I tried that decades ago. It picks up far too much junk: Proper names, abbreviations, acronyms, jargon, computer codes, ham radio codes, misspelled words, foreign words, misspelled foreign words, etc. If I were to try it again today, now that lots of people send HTML email, it would probably tell me that "msonormal" was the most common English word. I've tried searching for word lists online that list each word by its frequency. In a perverse equivalent of Godel's theorem, it appears that every such list is either incomplete or contains trash. For instance http://norvig.com/ngrams/count_1w.txt starts promisingly: the 23135851162 of 13151942776 and 12997637966 to 12136980858 a 9081174698 in 8469404971 for 5933321709 is 4705743816 but if, for instance, I search it for the anagrams of "post" I get: post 392956436 stop 77749471 spot 26750929 tops 11771127 pots 3854743 opts 662207 tsop 205591 tpos 43379 ostp 41988 ptos 38390 otps 23858 ptso 21839 Any idea where I can find a clean and complete list? Thanks.