For arbitrarily long strings or texts, I use my own algorithm to measure their "distance". I choose a window size N (as in a "dissociated press" algorithm, see http://en.wikipedia.org/wiki/Dissociated_press) and count the number of occurrences of all N-character substrings in both source texts. The amount of agreement between the two texts is based on these counts. Balanced binary trees are used to keep track of all the counts, so the algorithm is O(n log n) given an input dataset size of n characters. I normalize the result to fall in the range (0.0...1.0). There are more details at this page: http://mrob.com/pub/math/sloandora/index.html which describes my search engine and exploration tool for the Sloane Integer Sequences database. Using this substring-counting metric (which I call a "character-level concordance metric") the exploration tool does a very good job of predicting which database entry the user should see given his ratings on previously-viewed entries. - Robert On Thu, Aug 26, 2010 at 15:24, Marc LeBrun <mlb@well.com> wrote:
A very rough kind of ³lexical distance² between strings (possibly infinite) is just the number of symbols by which they differ.
Is there an Official Name for this?
-- Robert Munafo -- mrob.com