Longest common substring in linear time

45

13

This challenge is about writing code to solve the following problem.

Given two strings A and B, your code should output the start and end indices of a substring of A with the following properties.

  • The substring of A should also match some substring of B.
  • There should be no longer substring of A that satisfies the first property.

For example:

A = xxxappleyyyyyyy

B = zapplezzz

The substring apple with indices 4 8 (indexing from 1) would be a valid output.

Functionality

You can assume the input will be on standard in or in a file in the local directory, that is your choice. The file format will simply be two strings, separated by a new line. The answer should be a full program and not just a function.

I would eventually like to test your code on two substrings taken from the strings in http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/ .

Score

This is code-golf with a twist. Your code must be run in O(n) time, where n is the total length of the input.

Languages and libraries

You can use any language which has a freely available compiler/interpreter/etc. for Linux. You should only use standard open source libraries not designed to solve this task. In case of dispute, I will count this as any library which either comes as standard with your language or one that you can install in a default ubuntu machine from a default repository.

Useful information

There are at least two ways to solve this problem in linear time. One is to first compute the suffix tree and the second is to first compute the suffix array and the LCP array.

user9206

Posted 2015-03-02T16:04:07.520

Reputation:

4O(n) time Are you sure it is possible? – Savenkov Alexey – 2015-03-02T16:12:20.440

2@Capture_A_Lag Yes! I know you can do it via a suffix tree or suffix array and maybe other ways too. – None – 2015-03-02T16:13:27.117

17@Lembik Sorry, but these are very complex algorithms and it's not really fun to golve 100+ lines of code. – FUZxxl – 2015-03-03T12:10:03.967

@FUZxxl I have updated the question with links to clear explanations with examples and useful source code. I hope that helps! – None – 2015-03-03T12:53:19.743

Not worth attempting. If you weren't worried about O(n) and planning to use huge strings to test, maybe then it would. – mbomb007 – 2015-03-05T19:26:37.530

@mbomb007 I wasn't planning on testing with huge strings! I just meant the strings will be substrings of the huge strings I linked to. This was to provide example input to be helpful. – None – 2015-03-05T19:28:54.517

Still, O(n) is pretty discouraging. – mbomb007 – 2015-03-05T19:35:57.763

@mbomb007 Surely not for 007 :) Would you be happy with n log n instead? I have tried to add lots of detail for the algorithms but can add more if that would help. – None – 2015-03-05T19:37:56.347

4The article at the second link you provide under "Useful information" says that "constructing [a suffix tree] requires O(N^2) time" – KSFT – 2015-03-06T21:46:01.120

@KSFT This part sets out the time and space complexity .... "In fact, the reduction in the number of nodes is such that the time and space requirements for constructing a suffix tree are reduced from O(N2) to O(N). " – None – 2015-03-07T09:55:02.713

Wow...I cannot understand any of those explanations. – KSFT – 2015-03-11T23:06:00.537

@KSFT The suffix tree ones or the suffix array ones? I can try to help if there is a particular point you are stuck on. – None – 2015-03-12T06:27:01.610

1I'm pretty close to a Java solution that works without any of the suffix tree brain surgery. I have been running tests simple strings and sentences. The time complexity of the solution is bounded well below O(n^2) but space complexity is high due to String objects and data structures required. I need to know these details before posting the solution: 1. Are you open to a Java solution? 2. If so, what is your machine config and JVM heap size? – Azee – 2015-05-19T19:13:32.213

Java is fine. I have a standard ubuntu set up with 8GB of RAM. – None – 2015-05-19T19:15:50.397

@Azee See Lembik's reply (you may not have got a notification because your answer was converted to a comment.) – Martin Ender – 2015-05-19T20:07:06.937

The first example file seems to have a lot of mistakes. It's difficult to separate my misinterpretation of the algorithm from the mistakes in the document... For example, Figure 6.1 has 3 misplaced $. Figure 6.2 has two extra $ and incorrect labelling of the top edge. Figure 6.4 is missing the final b on edge 2... and that's all the farther I've gotten so far. – mellamokb – 2015-06-14T00:25:22.550

@mellamokb Oh yes... Gusfield is a world expert but apparently not at drawing figures! – None – 2015-06-14T19:39:19.310

@Agawa001 I am not sure 100% sure what you mean. Say you have two strings of length 10^6 each. What is the longest your idea could possibly take? – None – 2015-06-16T18:34:17.280

I have just about completed a C# implementation (not golfed) that implements the suffix tree. Still working out all the little bugs, which are fairly difficult to track down... I'm still not quite grasping how to convert the suffix tree code into a code that finds longest common substring. – mellamokb – 2015-06-18T05:06:26.657

I'm not smart enough to write an O(n), but I did like the idea so wrote up a working solution. var a = "xxxappleyyyyyyy"; var b = "zapplezzz"; (a,b)=>{ var len = Math.min(b.length,a.length); while(len > 0){ for(var i1 = 0; i1 < a.length - len; i1++){ for(var i2 = 0; i2 < b.length - len; i2++){ var str1 = a.substring(i1,i1+len); var str2 = b.substring(i2,i2+len); if(str1 === str2){ return {Index1: i1+1, Index2: i1 + len, String: str1}; } } } len-- } } – applejacks01 – 2016-07-08T19:18:15.893

3@Lembik You should just make the question [fastest-code], where the program with the best worst-case in big-oh notation wins. Then you'll at least get some answers, and in the even that someone can solve it in O(n), they'll win. – mbomb007 – 2016-07-11T18:37:24.663

@mbomb007 It's a good idea. Not sure how to supply the test input though. – None – 2016-07-13T15:01:08.597

Done a simple program for the answer, but its O(N^2). – Syamesh K – 2016-07-22T09:38:50.787

What alphabet will the input strings be drawn from? [a-z]? [a-zA-Z]? [a-zA-Z0-9]? – Gareth – 2017-01-04T21:17:53.460

@Gareth [a-z] is ok. – None – 2017-01-04T22:28:29.147

9This must be the question with the most deleted answers per valid answer... – FlipTack – 2017-01-26T20:35:10.043

1@FlipTack: Agreed. I've protected it because it's attracting people who think "I can solve that" without reading the question, and my experience is that such questions tend end up needing protection eventually (additionally, the protection banner is, here, a clue that the question has obvious wrong answers). – None – 2017-02-07T02:50:26.473

1Isn't this always worst case O(nlog(n))? I mean, judging by the amount of attention obviously not, but can someone explain to me why? I've read the full wiki page on suffix trees and it only solidified my confusion on the worst case performance of suffix trees being O(n)... – Magic Octopus Urn – 2017-03-28T20:25:24.683

Answers

39

Python 2, 646 bytes

G=range;w=raw_input;z=L,m,h=[0]*3
s=w();t=len(s);s+='!%s#'%w();u=len(s);I=z*u
def f(s,n):
 def r(o):
    b=[[]for _ in s];c=[]
    for x in B[:N]:b[s[x+o]]+=x,
    map(c.extend,b);B[:N]=c
 M=N=n--~n/3;t=n%3%2;B=G(n+t);del B[::3];r(2);u=m=p=r(1)>r(0);N-=n/3
 for x in B*1:v=s[x:x+3];m+=u<v;u=v;B[x/3+x%3/2*N]=m
 A=1/M*z or f(B+z,M)+z;B=[x*3for x in A if x<N];J=I[r(0):n];C=G(n)
 for k in C:b=A[t]/N;a=i,j=A[t]%N*3-~b,B[p];q=p<N<(s[i:i-~b],J[i/3+b+N-b*N])>(s[j+t/M:j-~b],J[j/3+b*N]);C[k]=x=a[q];I[x]=k;p+=q;t+=1-q
 return C
S=f(map(ord,s)+z*40,u)
for i in G(u):
 h-=h>0;j=S[I[i]-1]
 while s[i+h]==s[j+h]:h+=1
 if(i<t)==(t<j)<=h>m:m=h;L=min(i,j)
print-~L,L+m

This uses the skew algorithm described in "Simple Linear Work Suffix Array Construction" by Kärkkäinen and Sanders. The C++ implementation included in that paper already feels a little "golfy", but there is still plenty of room for making it shorter. For example, we can recurse until arriving at an array of length one, instead of short circuiting as in the paper, without violating the O(n) requirement.

For the LCP part, I followed "Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications" by Kusai et al.

The program outputs 1 0 if the longest common substring is empty.

Here is some development code that includes an earlier version of the program that follows the C++ implementation a bit more closely, some slower approaches for comparison, and a simple test case generator:

from random import *

def brute(a,b):
    L=R=m=0

    for i in range(len(a)):
        for j in range(i+m+1,len(a)+1):
            if a[i:j] in b:
                m=j-i
                L,R=i,j

    return L+1,R

def suffix_array_slow(s):
    S=[]
    for i in range(len(s)):
        S+=[(s[i:],i)]
    S.sort()
    return [x[1] for x in S]

def slow1(a,b):
    # slow suffix array, slow lcp

    s=a+'!'+b
    S=suffix_array_slow(s)

    L=R=m=0

    for i in range(1,len(S)):
        x=S[i-1]
        y=S[i]
        p=s[x:]+'#'
        q=s[y:]+'$'
        h=0
        while p[h]==q[h]:
            h+=1
        if h>m and len(a)==sorted([x,y,len(a)])[1]:
            m=h
            L=min(x,y)
            R=L+h

    return L+1,R

def verify(a,b,L,R):
    if L<1 or R>len(a) or a[L-1:R] not in b:
        return 0
    LL,RR=brute(a,b)
    return R-L==RR-LL

def rand_string():
    if randint(0,1):
        n=randint(0,8)
    else:
        n=randint(0,24)
    a='zyxwvutsrq'[:randint(1,10)]
    s=''
    for _ in range(n):
        s+=choice(a)
    return s

def stress_test(f):
    numtrials=2000
    for trial in range(numtrials):
        a=rand_string()
        b=rand_string()
        L,R=f(a,b)
        if not verify(a,b,L,R):
            LL,RR=brute(a,b)
            print 'failed on',(a,b)
            print 'expected:',LL,RR
            print 'actual:',L,R
            return
    print 'ok'

def slow2(a,b):
    # slow suffix array, linear lcp

    s=a+'!'+b+'#'
    S=suffix_array_slow(s)

    I=S*1
    for i in range(len(S)):
        I[S[i]]=i

    L=R=m=h=0

    for i in range(len(S)):
        if I[i]:
            j=S[I[i]-1]
            while s[i+h]==s[j+h]:
                h+=1
            if h>m and len(a)==sorted([i,j,len(a)])[1]:
                m=h
                L=min(i,j)
                R=L+h
            h-=h>0

    return L+1,R

def suffix_array(s,K):
    # skew algorithm

    n=len(s)
    s+=[0]*3
    n0=(n+2)/3
    n1=(n+1)/3
    n2=n/3
    n02=n0+n2
    adj=n0-n1

    def radix_pass(a,o,n=n02):
        c=[0]*(K+3)
        for x in a[:n]:
            c[s[x+o]+1]+=1
        for i in range(K+3):
            c[i]+=c[i-1]
        for x in a[:n]:
            j=s[x+o]
            a[c[j]]=x
            c[j]+=1

    A=[x for x in range(n+adj) if x%3]+[0]*3

    radix_pass(A,2)
    radix_pass(A,1)
    radix_pass(A,0)

    B=[0]*n02
    t=m=0

    for x in A[:n02]:
        u=s[x:x+3]
        m+=t<u
        t=u
        B[x/3+x%3/2*n0]=m

    A[:n02]=1/n02*[0]or suffix_array(B,m)
    I=A*1
    for i in range(n02):
        I[A[i]]=i+1

    B=[3*x for x in A if x<n0]
    radix_pass(B,0,n0)

    R=[]

    p=0
    t=adj
    while t<n02:
        x=A[t]
        b=x>=n0
        i=(x-b*n0)*3-~b
        j=B[p]
        if p==n0 or ((s[i:i+2],I[A[t]-n0+1])<(s[j:j+2],I[j/3+n0]) if b else (s[i],I[A[t]+n0])<(s[j],I[j/3])):R+=i,;t+=1
        else:R+=j,;p+=1

    return R+B[p:n0]

def solve(a,b):
    # linear

    s=a+'!'+b+'#'
    S=suffix_array(map(ord,s),128)

    I=S*1
    for i in range(len(S)):
        I[S[i]]=i

    L=R=m=h=0

    for i in range(len(S)):
        if I[i]:
            j=S[I[i]-1]
            while s[i+h]==s[j+h]:
                h+=1
            if h>m and len(a)==sorted([i,j,len(a)])[1]:
                m=h
                L=min(i,j)
                R=L+h
            h-=h>0

    return L+1,R

stress_test(solve)

Mitch Schwartz

Posted 2015-03-02T16:04:07.520

Reputation: 4 899

This looks very promising! – None – 2017-01-25T09:01:49.847

1

Correct me if I'm wrong, but isn't this actually 739 bytes? I copied into https://mothereff.in/byte-counter and deleted 2 spaces from lines 6-9, but I'm not sure if that's correct.

– Patrick Roberts – 2017-01-26T00:34:51.900

2@PatrickRoberts Those are tabs. – Mitch Schwartz – 2017-01-26T01:12:19.240

2Nice answer! You might want to have a look at GSACA a novel linear time SACA from 2016. Reference implementation is 246 lines full of comments (170 without comments) and seems very golfable. You'll find it on github. – Christoph – 2017-02-01T09:15:41.377

We clearly need a version in Jelly now :) – None – 2017-02-14T15:55:10.883

I think that 0--n/3 can be n/3. – Yytsi – 2017-02-17T09:09:02.817

@TuukkaX Out of curiosity, do you remember if you were experiencing any specific emotions when you wrote that comment? Or, what was the mindset? – Mitch Schwartz – 2017-02-17T12:24:49.707

1@MitchSchwartz I'm currently trying to stay on noPMO, so I can't feel emotions strongly right now (probably due to inbalanced brain chemicals). At the time of reading the code quickly, my syntax golfing motor spotted that, and I don't remember feeling any specific emotions. Did you think of the same thing or why the question? :) Now I'm curious. – Yytsi – 2017-02-17T12:35:24.890

1@TuukkaX That's an interesting response that I wasn't expecting. Well, I'm not sure if I should be phrasing this in some special way, but the fact that your original comment wasn't actually correct played some part in why I decided to ask. :) – Mitch Schwartz – 2017-02-26T02:50:44.170

what is w and what to input in the python2 solution? why still using python2 as it outdates now? – Josef Klotzner – 2019-08-02T18:31:36.977

oh - understood ... you replaced "raw_input" by "w" ... then it is clear. this is not clever:

  1. whenever i "enter" a line i close and save the comment, but did not want to ....
  2. as i was not able to put my question into comments, i wrote it here as "aswer", for which i don't need any stupid whatsoever points, but ...

as i now answered my question myself it is useless to be placed here. ... No way to delete it .... sorry for inconvenience – Josef Klotzner – 2019-08-02T18:33:59.383

1@Ing.JosefKlotzner Python 2 is shorter than Python 3. We don't care about being up-to-date here, we care about better scores. For this question ([tag:code-golf]), a shorter answer is better. Python 2 scores better much of the time. Also, newlines are not possible in comments. – mbomb007 – 2019-08-02T19:49:25.567