Re: [math-fun] Lexical distance?

27 Aug 2010

      For arbitrarily long strings or texts, I use my own algorithm to measure
their "distance". I choose a window size N (as in a "dissociated press"
algorithm, see http://en.wikipedia.org/wiki/Dissociated_press) and count the
number of occurrences of all N-character substrings in both source texts.
The amount of agreement between the two texts is based on these counts.
Balanced binary trees are used to keep track of all the counts, so the
algorithm is O(n log n) given an input dataset size of n characters. I
normalize the result to fall in the range (0.0...1.0).

There are more details at this page:
http://mrob.com/pub/math/sloandora/index.html which describes my search
engine and exploration tool for the Sloane Integer Sequences database. Using
this substring-counting metric (which I call a "character-level concordance
metric") the exploration tool does a very good job of predicting which
database entry the user should see given his ratings on previously-viewed
entries.

- Robert

On Thu, Aug 26, 2010 at 15:24, Marc LeBrun <mlb@well.com> wrote:
...
A very rough kind of ³lexical distance² between strings (possibly infinite)
is just the number of symbols by which they differ.
Is there an Official Name for this?
-- 
  Robert Munafo  --  mrob.com