45
13
This challenge is about writing code to solve the following problem.
Given two strings A and B, your code should output the start and end indices of a substring of A with the following properties.
- The substring of A should also match some substring of B.
- There should be no longer substring of A that satisfies the first property.
For example:
A = xxxappleyyyyyyy
B = zapplezzz
The substring apple
with indices 4 8
(indexing from 1) would be a valid output.
Functionality
You can assume the input will be on standard in or in a file in the local directory, that is your choice. The file format will simply be two strings, separated by a new line. The answer should be a full program and not just a function.
I would eventually like to test your code on two substrings taken from the strings in http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/ .
Score
This is code-golf with a twist. Your code must be run in O(n)
time, where n
is the total length of the input.
Languages and libraries
You can use any language which has a freely available compiler/interpreter/etc. for Linux. You should only use standard open source libraries not designed to solve this task. In case of dispute, I will count this as any library which either comes as standard with your language or one that you can install in a default ubuntu machine from a default repository.
Useful information
There are at least two ways to solve this problem in linear time. One is to first compute the suffix tree and the second is to first compute the suffix array and the LCP array.
- Here is a full and (perhaps over-)detailed explanation of linear time suffix tree construction (it turns out some of the figures are messed up unfortunately). There is also a very nice SO answer about linear time suffix tree construction at https://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english . It includes a link to source code too. Another detailed explanation can be found here, this time giving a full solution in C.
- Section 2 of http://www.cs.cmu.edu/~guyb/realworld/papersS04/KaSa03.pdf gives a linear time suffix array construction algorithm and Appendix A has C++ source code. This answer tells you how then to compute the longest common substring https://cs.stackexchange.com/questions/9555/computing-the-longest-common-substring-of-two-strings-using-suffix-arrays . Section 5 of https://courses.csail.mit.edu/6.851/spring12/scribe/lec16.pdf which also has an associated video lecture https://courses.csail.mit.edu/6.851/spring12/lectures/L16.html also explains the same algorithm starting at 1:16:00.
4
O(n) time
Are you sure it is possible? – Savenkov Alexey – 2015-03-02T16:12:20.4402@Capture_A_Lag Yes! I know you can do it via a suffix tree or suffix array and maybe other ways too. – None – 2015-03-02T16:13:27.117
17@Lembik Sorry, but these are very complex algorithms and it's not really fun to golve 100+ lines of code. – FUZxxl – 2015-03-03T12:10:03.967
@FUZxxl I have updated the question with links to clear explanations with examples and useful source code. I hope that helps! – None – 2015-03-03T12:53:19.743
Not worth attempting. If you weren't worried about
O(n)
and planning to use huge strings to test, maybe then it would. – mbomb007 – 2015-03-05T19:26:37.530@mbomb007 I wasn't planning on testing with huge strings! I just meant the strings will be substrings of the huge strings I linked to. This was to provide example input to be helpful. – None – 2015-03-05T19:28:54.517
Still,
O(n)
is pretty discouraging. – mbomb007 – 2015-03-05T19:35:57.763@mbomb007 Surely not for 007 :) Would you be happy with n log n instead? I have tried to add lots of detail for the algorithms but can add more if that would help. – None – 2015-03-05T19:37:56.347
4The article at the second link you provide under "Useful information" says that "constructing [a suffix tree] requires O(N^2) time" – KSFT – 2015-03-06T21:46:01.120
@KSFT This part sets out the time and space complexity .... "In fact, the reduction in the number of nodes is such that the time and space requirements for constructing a suffix tree are reduced from O(N2) to O(N). " – None – 2015-03-07T09:55:02.713
Wow...I cannot understand any of those explanations. – KSFT – 2015-03-11T23:06:00.537
@KSFT The suffix tree ones or the suffix array ones? I can try to help if there is a particular point you are stuck on. – None – 2015-03-12T06:27:01.610
1I'm pretty close to a Java solution that works without any of the suffix tree brain surgery. I have been running tests simple strings and sentences. The time complexity of the solution is bounded well below O(n^2) but space complexity is high due to String objects and data structures required. I need to know these details before posting the solution: 1. Are you open to a Java solution? 2. If so, what is your machine config and JVM heap size? – Azee – 2015-05-19T19:13:32.213
Java is fine. I have a standard ubuntu set up with 8GB of RAM. – None – 2015-05-19T19:15:50.397
@Azee See Lembik's reply (you may not have got a notification because your answer was converted to a comment.) – Martin Ender – 2015-05-19T20:07:06.937
The first example file seems to have a lot of mistakes. It's difficult to separate my misinterpretation of the algorithm from the mistakes in the document... For example, Figure 6.1 has 3 misplaced $. Figure 6.2 has two extra $ and incorrect labelling of the top edge. Figure 6.4 is missing the final
b
on edge 2... and that's all the farther I've gotten so far. – mellamokb – 2015-06-14T00:25:22.550@mellamokb Oh yes... Gusfield is a world expert but apparently not at drawing figures! – None – 2015-06-14T19:39:19.310
@Agawa001 I am not sure 100% sure what you mean. Say you have two strings of length 10^6 each. What is the longest your idea could possibly take? – None – 2015-06-16T18:34:17.280
I have just about completed a C# implementation (not golfed) that implements the suffix tree. Still working out all the little bugs, which are fairly difficult to track down... I'm still not quite grasping how to convert the suffix tree code into a code that finds longest common substring. – mellamokb – 2015-06-18T05:06:26.657
I'm not smart enough to write an O(n), but I did like the idea so wrote up a working solution. var a = "xxxappleyyyyyyy"; var b = "zapplezzz"; (a,b)=>{ var len = Math.min(b.length,a.length); while(len > 0){ for(var i1 = 0; i1 < a.length - len; i1++){ for(var i2 = 0; i2 < b.length - len; i2++){ var str1 = a.substring(i1,i1+len); var str2 = b.substring(i2,i2+len); if(str1 === str2){ return {Index1: i1+1, Index2: i1 + len, String: str1}; } } } len-- } } – applejacks01 – 2016-07-08T19:18:15.893
3@Lembik You should just make the question [fastest-code], where the program with the best worst-case in big-oh notation wins. Then you'll at least get some answers, and in the even that someone can solve it in O(n), they'll win. – mbomb007 – 2016-07-11T18:37:24.663
@mbomb007 It's a good idea. Not sure how to supply the test input though. – None – 2016-07-13T15:01:08.597
Done a simple program for the answer, but its O(N^2). – Syamesh K – 2016-07-22T09:38:50.787
What alphabet will the input strings be drawn from?
[a-z]
?[a-zA-Z]
?[a-zA-Z0-9]
? – Gareth – 2017-01-04T21:17:53.460@Gareth [a-z] is ok. – None – 2017-01-04T22:28:29.147
9This must be the question with the most deleted answers per valid answer... – FlipTack – 2017-01-26T20:35:10.043
1@FlipTack: Agreed. I've protected it because it's attracting people who think "I can solve that" without reading the question, and my experience is that such questions tend end up needing protection eventually (additionally, the protection banner is, here, a clue that the question has obvious wrong answers). – None – 2017-02-07T02:50:26.473
1Isn't this always worst case
O(nlog(n))
? I mean, judging by the amount of attention obviously not, but can someone explain to me why? I've read the full wiki page on suffix trees and it only solidified my confusion on the worst case performance of suffix trees beingO(n)
... – Magic Octopus Urn – 2017-03-28T20:25:24.683