Fastest Longest Common Subsequence Finder

11

1

Your task is to solve the Longest Common Subsequence problem for n strings of length 1000.

A valid solution to the LCS problem for two or more strings S1, … Sn is any string T of maximal length such that the characters of T appear in all Si, in the same order as in T.

Note that T does not have to be a substring of Si.

We've already solved this problem in the shortest amount of code. This time, size doesn't matter.

Example

The strings axbycz and xaybzc have 8 common subsequences of length 3:

abc abz ayc ayz xbc xbz xyc xyz

Any of these would be a valid solution for the LCS problem.

Details

Write a full program that solves the LCS problem, as explained above, abiding the following rules:

  • The input will consist of two or more strings of length 1000, consisting of ASCII characters with code points between 0x30 and 0x3F.

  • You have to read the input from STDIN.

    You have two choices for the input format:

    • Each string (including the last) is followed by a linefeed.

    • The strings are chained together with no separator and no trailing linefeed.

  • The number of strings will be passed as a command-line parameter to your program.

  • You have to write the output, i.e., any one of the valid solutions to the LCS, to STDOUT, followed by one linefeed.

  • Your language of choice has to have a free (as in beer) compiler/interpreter for my operating system (Fedora 21).

  • If you require any compiler flags or a specific interpreter, please mention it in your post.

Scoring

I will run your code with 2, 3, etc. strings until it takes longer than 120 seconds to print a valid solution. This means that you have 120 seconds for each value of n.

The highest amount of strings for which your code finished in time is your score.

In the event of a tied score of n, the submission that solved the problem for n strings in the shortest time will be declared the winner.

All submissions will be timed on my machine (Intel Core i7-3770, 16 GiB RAM, no swap).

The n strings of the (n-1)th test will be generated by calling rand n (and stripping the linefeeds, if requested), where rand is defined as follows:

rand()
{
    head -c$[500*$1] /dev/zero |
    openssl enc -aes-128-ctr -K 0 -iv $1 |
    xxd -c500 -ps |
    tr 'a-f' ':-?'
}

The key is 0 in the above code, but I reserve the right to change it to an undisclosed value if I suspect anybody of (partially) hardcoding the output.

Dennis

Posted 2015-07-17T18:41:17.290

Reputation: 196 637

Can we throw exceptions? – HyperNeutrino – 2015-08-07T20:49:14.647

@JamesSmith As long as the output is correct, sure. – Dennis – 2015-08-07T20:55:56.997

Since i'm reading with bufferedreader, can i throw ioexception from public static void main(...)? – HyperNeutrino – 2015-08-07T20:56:55.043

@JamesSmith I don't really know Java, so I don't know what that is, but don't worry about exceptions. – Dennis – 2015-08-07T21:04:14.773

4@JamesSmith Since code length does not matter for this challenge, can't you simply catch the exceptions? – Reto Koradi – 2015-08-08T06:05:46.597

I could, though I prefer to leave it thrown. – HyperNeutrino – 2015-08-08T21:20:04.060

Answers

5

C, n = 3 in ~7 seconds

Algorithm

The algorithm is a direct generalization of the standard dynamic programming solution to n sequences. For 2 strings A and B, the standard recurrence looks like this:

L(p, q) = 1 + L(p - 1, q - 1)           if A[p] == B[q]
        = max(L(p - 1, q), L(p, q - 1)) otherwise

For 3 strings A, B, C I use:

L(p, q, r) = 1 + L(p - 1, q - 1, r - 1)                          if A[p] == B[q] == C[r]
           = max(L(p - 1, q, r), L(p, q - 1, r), L(p, q, r - 1)) otherwise

The code implements this logic for arbitrary values of n.

Efficiency

The complexity of my code is O(s ^ n), with s the length of the strings. Based on what I found, it looks like the problem is NP-complete. So while the posted algorithm is very inefficient for larger values of n, it might actually not be possible to do massively better. The only thing I saw are some approaches that improve efficiency for small alphabets. Since the alphabet is moderately small here (16), that could lead to an improvement. I still predict that nobody will find a legitimate solution that goes higher than n = 4 in 2 minutes, and n = 4 looks ambitious already.

I reduced the memory usage in the initial implementation, so that it could handle n = 4 given enough time. But it only produced the length of the sequence, not the sequence itself. Check the revision history of this post for seeing that code.

Code

Since loops over n-dimensional matrices require more logic than fixed loops, I'm using a fixed loop for the lowest dimension, and only use the generic looping logic for the remaining dimensions.

#include <stdint.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

#define N_MAX 4

int main(int argc, char* argv[]) {
    int nSeq = argc - 1;
    if (nSeq > N_MAX) {
        nSeq = N_MAX;
    }

    char** seqA = argv + 1;

    uint64_t totLen = 1;
    uint64_t lenA[N_MAX] = {0};
    uint64_t offsA[N_MAX] = {1};
    uint64_t offsSum = 0;
    uint64_t posA[N_MAX] = {0};

    for (int iSeq = 0; iSeq < nSeq; ++iSeq) {
        lenA[iSeq] = strlen(seqA[iSeq]);
        totLen *= lenA[iSeq] + 1;

        if (iSeq + 1 < nSeq) {
            offsA[iSeq + 1] = totLen;
        }

        offsSum += offsA[iSeq];
    }

    uint16_t* mat = calloc(totLen, 2);
    uint64_t idx = offsSum;

    for (;;) {
        for (uint64_t pos0 = 0; pos0 < lenA[0]; ++pos0) {
            char firstCh = seqA[0][pos0];
            int isSame = 1;
            uint16_t maxVal = mat[idx - 1];

            for (int iSeq = 1; iSeq < nSeq; ++iSeq) {
                char ch = seqA[iSeq][posA[iSeq]];
                isSame &= (ch == firstCh);

                uint16_t val = mat[idx - offsA[iSeq]];
                if (val > maxVal) {
                    maxVal = val;
                }
            }

            if (isSame) {
                mat[idx] = mat[idx - offsSum] + 1;
            } else {
                mat[idx] = maxVal;
            }

            ++idx;
        }

        idx -= lenA[0];

        int iSeq = 1;
        while (iSeq < nSeq && posA[iSeq] == lenA[iSeq] - 1) {
            posA[iSeq] = 0;
            idx -= (lenA[iSeq] - 1) * offsA[iSeq];
            ++iSeq;
        }
        if (iSeq == nSeq) {
            break;
        }

        ++posA[iSeq];
        idx += offsA[iSeq];
    }

    int score = mat[totLen - 1];

    char* resStr = malloc(score + 1);
    resStr[score] = '\0';

    for (int iSeq = 0; iSeq < nSeq; ++iSeq) {
        posA[iSeq] = lenA[iSeq] - 1;
    }

    idx = totLen - 1;
    int resIdx = score - 1;

    while (resIdx >= 0) {
        char firstCh = seqA[0][posA[0]];
        int isSame = 1;
        uint16_t maxVal = mat[idx - 1];
        int maxIdx = 0;

        for (int iSeq = 1; iSeq < nSeq; ++iSeq) {
            char ch = seqA[iSeq][posA[iSeq]];
            isSame &= (ch == firstCh);

            uint16_t val = mat[idx - offsA[iSeq]];
            if (val > maxVal) {
                maxVal = val;
                maxIdx = iSeq;
            }
        }

        if (isSame) {
            resStr[resIdx--] = firstCh;
            for (int iSeq = 0; iSeq < nSeq; ++iSeq) {
                --posA[iSeq];
            }
            idx -= offsSum;
        } else {
            --posA[maxIdx];
            idx -= offsA[maxIdx];
        }
    }

    free(mat);

    printf("score: %d\n", score);
    printf("%s\n", resStr);

    return 0;
}

Instructions for Running

To run:

  • Save the code in a file, e.g. lcs.c.
  • Compile with high optimization options. I used:

    clang -O3 lcs.c
    

    On Linux, I would try:

    gcc -Ofast lcs.c
    
  • Run with 2 to 4 sequences given as command line arguments:

    ./a.out axbycz xaybzc
    

    Single quote the command line arguments if necessary, since the alphabet used for the examples contains shell special characters.

Results

test2.sh and test3.sh are the test sequences from Dennis. I don't know the correct results, but the output looks at least plausible.

$ ./a.out axbycz xaybzc
score: 3
abc

$ time ./test2.sh 
score: 391
16638018802020>3??3232270178;47240;24331395?<=;99=?;178675;866002==23?87?>978891>=9<6<9381992>>7030829?255>6804:=3>:;60<9384=081;0:?66=51?0;5090724<85?>>:2>7>3175?::<9199;5=0:494<5<:7100?=95<91>1887>33>67710==;48<<327::>?78>77<6:2:02:<7=5?:;>97<993;57=<<=:2=9:8=118563808=962473<::8<816<464<1==925<:5:22?>3;65=>=;27?7:5>==3=4>>5>:107:20575347>=48;<7971<<245<??219=3991=<96<??735698;62?<98928

real  0m0.012s
user  0m0.008s
sys   0m0.003s

$ time ./test3.sh 
score: 269
662:2=2::=6;738395=7:=179=96662649<<;?82?=668;2?603<74:6;2=04=>6;=6>=121>1>;3=22=3=3;=3344>0;5=7>>7:75238;559133;;392<69=<778>3?593?=>9799<1>79827??6145?7<>?389?8<;;133=505>2703>02?323>;693995:=0;;;064>62<0=<97536342603>;?92034<?7:=;2?054310176?=98;5437=;13898748==<<?4

real  0m7.218s
user  0m6.701s
sys   0m0.513s

Reto Koradi

Posted 2015-07-17T18:41:17.290

Reputation: 4 870

Apologies if that wasn't clear, but you have to print the LCS, not just its length. – Dennis – 2015-08-14T21:07:54.257

@Dennis I see. Some of my optimizations were in vain then. I'll have to go back to a version that stores the full matrix so that I can reconstruct the string. That won't be able to run for n=4, but should still finish below 10 seconds for n=3. I think I was around 6-7 seconds when I still had the full matrix. – Reto Koradi – 2015-08-14T21:34:08.990

Again, sorry. The question wasn't very clear about this... When you post your output, I'll be able to compare it to BrainSteel's. The length your program reports exceeds the length of his output by 5 for n=2. By the way, I had to define N_MAX as a macro and add the compiler flag -std=c99 to compile your code with GCC. – Dennis – 2015-08-14T21:38:00.940

@Dennis No problem. It said that the solution "is a string", so that should have been clear enough. I almost exclusively use C++, so I'm never sure what's allowed in C. This code started out as C++, but once I realized that I wasn't really using any C++ features, I switched it over to C. clang on my Mac was happy with it, but it probably uses a different C version by default, or is just more lenient. – Reto Koradi – 2015-08-14T22:08:06.173

1@Dennis Ok, I added the traceback logic so that I can produce the string. Takes about 7 seconds now for n=3. – Reto Koradi – 2015-08-15T17:28:58.463

3

This answer is currently broken due to a bug. Fixing soon...

C, 2 strings in ~35 seconds

This is very much a work in progress (as shown by the horrible messiness), but hopefully it sparks some good answers!

The code:

#include "stdlib.h"
#include "string.h"
#include "stdio.h"
#include "time.h"

/* For the versatility */
#define MIN_CODEPOINT 0x30
#define MAX_CODEPOINT 0x3F
#define NUM_CODEPOINT (MAX_CODEPOINT - MIN_CODEPOINT + 1)
#define CTOI(c) (c - MIN_CODEPOINT)

#define SIZE_ARRAY(x) (sizeof(x) / sizeof(*x))

int LCS(char** str, int num);
int getshared(char** str, int num);
int strcount(char* str, char c);

int main(int argc, char** argv) {
    char** str = NULL;
    int num = 0;
    int need_free = 0;
    if (argc > 1) {
        str = &argv[1];
        num = argc - 1;
    }
    else {
        scanf(" %d", &num);
        str = malloc(sizeof(char*) * num);
        if (!str) {
            printf("Allocation error!\n");
            return 1;
        }

        int i;
        for (i = 0; i < num; i++) {
            /* No string will exceed 1000 characters */
            str[i] = malloc(1001);
            if (!str[i]) {
                printf("Allocation error!\n");
                return 1;
            }

            scanf(" %1000s", str[i]);

            str[i][1000] = '\0';
        }

        need_free = 1;
    }

    clock_t start = clock();

    /* The call to LCS */
    int size = LCS(str, num);

    double dt = ((double)(clock() - start)) / CLOCKS_PER_SEC;
    printf("Size: %d\n", size);
    printf("Elapsed time on LCS call: %lf s\n", dt);

    if (need_free) {
        int i;
        for (i = 0; i < num; i++) {
            free(str[i]);
        }
        free(str);
    }

    return 0;
}

/* Some terribly ugly global variables... */
int depth, maximum, mapsize, lenmap[999999][2];
char* (strmap[999999][20]);
char outputstr[1000];

/* Solves the LCS problem on num strings str of lengths len */
int LCS(char** str, int num) {
    /* Counting variables */
    int i, j;

    if (depth + getshared(str, num) <= maximum) {
        return 0;
    }

    char* replace[num];
    char match;
    char best_match = 0;
    int earliest_set = 0;
    int earliest[num];
    int all_late;
    int all_early;
    int future;
    int best_future = 0;
    int need_future = 1;

    for (j = 0; j < mapsize; j++) {
        for (i = 0; i < num; i++)
            if (str[i] != strmap[j][i])
                break;
        if (i == num) {
            best_match = lenmap[j][0];
            best_future = lenmap[j][1];
            need_future = 0;
            if (best_future + depth < maximum || !best_match)
                goto J;
            else {
                match = best_match;
                goto K;
            }
        }
    }

    for (match = MIN_CODEPOINT; need_future && match <= MAX_CODEPOINT; match++) {

    K:
        future = 1;
        all_late = earliest_set;
        all_early = 1;
        for (i = 0; i < num; replace[i++]++) {
            replace[i] = strchr(str[i], match);
            if (!replace[i]) {
                future = 0;
                break;
            }

            if (all_early && earliest_set && replace[i] - str[i] > earliest[i])
                all_early = 0;
            if (all_late && replace[i] - str[i] < earliest[i])
                all_late = 0;
        }
        if (all_late) {
            future = 0;
        }

    I:
        if (future) {
            if (all_early || !earliest_set) {
                for (i = 0; i < num; i++) {
                    earliest[i] = (int)(replace[i] - str[i]);
                }
            }

            /* The recursive bit */
            depth++;
            future += LCS(replace, num);
            depth--;

            best_future = future > best_future ? (best_match = match), future : best_future;
        }
    }

    if (need_future) {
        for (i = 0; i < num; i++)
            strmap[mapsize][i] = str[i];
        lenmap[mapsize][0] = best_match;
        lenmap[mapsize++][1] = best_future;
    }

J:
    if (depth + best_future >= maximum) {
        maximum = depth + best_future;
        outputstr[depth] = best_match;
    }

    if (!depth) {
        mapsize = 0;
        maximum = 0;
        puts(outputstr);
    }

    return best_future;
}

/* Return the number of characters total (not necessarily in order) that the strings all share */
int getshared(char** str, int num) {
    int c, i, tot = 0, min;
    for (c = MIN_CODEPOINT; c <= MAX_CODEPOINT; c++) {
        min = strcount(str[0], c);
        for (i = 1; i < num; i++) {
            int count = strcount(str[i], c);
            if (count < min) {
                min = count;
            }
        }
        tot += min;
    }

    return tot;
}

/* Count the number of occurrences of c in str */
int strcount(char* str, char c) {
    int tot = 0;
    while ((str = strchr(str, c))) {
        str++, tot++;
    }
    return tot;
}

The relevant function that performs all of the LCS computation is the function LCS. The code above will time its own call to this function.

Save as main.c and compile with: gcc -Ofast main.c -o FLCS

The code can be ran either with command line arguments or through stdin. When using stdin, it expects a number of strings followed by the strings themselves.

~ Me$ ./FLCS "12345" "23456"
2345
Size: 4
Elapsed time on LCS call: 0.000056 s

Or:

~ Me$ ./FLCS
6 
2341582491248123139182371298371239813
2348273123412983476192387461293472793
1234123948719873491234120348723412349
1234129388234888234812834881423412373
1111111112341234128469128377293477377
1234691237419274737912387476371777273
1241231212323
Size: 13
Elapsed time on LCS call: 0.001594 s

On a Mac OS X box with a 1.7Ghz Intel Core i7 and the test case Dennis provided, we get the following output for 2 strings:

16638018800200>3??32322701784=4240;24331395?<;=99=?;178675;866002==23?87?>978891>=9<66=381992>>7030829?25265804:=3>:;60<9384=081;08?66=51?0;509072488>2>924>>>3175?::<9199;330:494<51:>748571?153994<45<??20>=3991=<962508?7<2382?489
Size: 386
Elapsed time on LCS call: 33.245087 s

The approach is very similar to my approach to the earlier challenge, here. In addition to previous optimization, we now check for a total number of shared characters between strings at every recursion and exit early if there's no way to get a longer substring than what already exists.

For now, it handles 2 strings okay but tends to get crash-y on more. More improvements and a better explanation to come!

BrainSteel

Posted 2015-07-17T18:41:17.290

Reputation: 5 132

1I think I have missed something. With 2 strings isn't this a classic dynamic programming problem that should take about 1000^2 steps to solve? In other words, a fraction of a second. – None – 2015-08-09T20:13:32.463

@Lembik Yeah, it should. This method was built for handling many more than 2 strings, but it ended up scaling too poorly with string length for it to have good results. I've got many more tricks up my sleeve, and if any of them actually work... Things should improve immensely. – BrainSteel – 2015-08-09T21:05:26.287

There seems to be a problem somewhere. @RetoKoradi's code finds a valid common substring of length 391 for n=2, while your code reports a length of 386 and prints a string of length 229. – Dennis – 2015-08-16T04:46:08.673

@Dennis Umm... Yeah, yeah it does... Oh dear. Well, this is embarrassing. I'm working on it :) I'll edit the answer to reflect the bug. – BrainSteel – 2015-08-16T13:08:42.197