[math-fun] Re: number of longest common substrings problem

12 Jun 2004

      On Fri, 11 Jun 2004, N. J. A. Sloane wrote:
...
Given two strings X and Y of length n from some fixed
alphabet, let f(X,Y) = number of longest common substrings of X and Y.
What is a(n) = max f(X,Y) over all choices of X and Y?
There seem to be many different ways to interpret this problem. In 
particular internet seaches indicate there is a difference between common 
substrings and common subsequences. From Neil's description it seems he is 
referring to what some call common "subsequences".

Maple, for example, in its StringTools package has the two procedures
LongestCommonSubString(s1,s2) and LongestCommonSubSequence(s1,s2)
where s1 and s2 are strings.

These have the following descriptions (which seem to agree with other 
definitions found by googling around.)

--------------------------------------------------------------------
A substring of a string S is a contiguous sequence of the characters 
appearing in S. The empty string is a substring of every string. A 
subsequence of a string S is a sequence of characters from S, which may 
not be contiguous in S. Every substring of S is a subsequence of S. For 
example, "bc" is a substring of "abc", and "ac" is a subsequence of "abc" 
which is not a substring. 
The LongestCommonSubString(s1, s2) command returns from its input strings, 
s1 and s2, a common substring of maximum length. 
Many common substrings of maximum length may exist. Which among the 
candidates is returned depends upon the suffix structure of the pair of 
strings, but is deterministic. 
The LongestCommonSubSequence(s1, s2) command is similar, but searches for 
subsequences of the pair of input strings rather than substrings. 
--------------------------------------------------------------------

Once we know what a longest common subsequence or substring is, there is 
the problem of how to count them: with multiplicities or not?

Anyhow, I did one of the simplest interpretations, presumably NOT one that 
Aho intended, nevertheless perhaps new and of interest:

I'm counting (contiguous) substrings and not counting multiplicity:

What I did was to find the set CS(s1,s2) of common substrings (in the 
above sense) for binary strings s1 and s2 of length n and count the number 
A(s1,s2) of strings in CS(s1,s2) that are the longest. Then a(n) is the 
max of A(s1,s2) over all such strings. If I haven't made an error 
this gives the sequence a(n), n = 1,...,11:

1, 2, 2, 2, 3, 3, 4, 6, 6, 7, 7

For example the fact that a(7) = 4 is given by the two strings
s = 0001011,t = 0011010 for which the set of maximum length 
common substrings is {011,001,101,010} and the fact that no other pair of 
strings of length 7 has more than 4 common maximum length substrings.

Apparently this sequence is not is the OEIS. But it seems simple enough 
that maybe someone can find a formula for it. And maybe someone using 
something faster than Maple can compute more terms even if no one can 
figure out a formula.

--Edwin

[math-fun] Re: number of longest common substrings problem

Edwin Clark