C, n = 3 in ~7 seconds
Algorithm
The algorithm is a direct generalization of the standard dynamic programming solution to n
sequences. For 2 strings A
and B
, the standard recurrence looks like this:
L(p, q) = 1 + L(p - 1, q - 1) if A[p] == B[q]
= max(L(p - 1, q), L(p, q - 1)) otherwise
For 3 strings A
, B
, C
I use:
L(p, q, r) = 1 + L(p - 1, q - 1, r - 1) if A[p] == B[q] == C[r]
= max(L(p - 1, q, r), L(p, q - 1, r), L(p, q, r - 1)) otherwise
The code implements this logic for arbitrary values of n
.
Efficiency
The complexity of my code is O(s ^ n), with s
the length of the strings. Based on what I found, it looks like the problem is NP-complete. So while the posted algorithm is very inefficient for larger values of n
, it might actually not be possible to do massively better. The only thing I saw are some approaches that improve efficiency for small alphabets. Since the alphabet is moderately small here (16), that could lead to an improvement. I still predict that nobody will find a legitimate solution that goes higher than n = 4
in 2 minutes, and n = 4
looks ambitious already.
I reduced the memory usage in the initial implementation, so that it could handle n = 4
given enough time. But it only produced the length of the sequence, not the sequence itself. Check the revision history of this post for seeing that code.
Code
Since loops over n-dimensional matrices require more logic than fixed loops, I'm using a fixed loop for the lowest dimension, and only use the generic looping logic for the remaining dimensions.
#include <stdint.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#define N_MAX 4
int main(int argc, char* argv[]) {
int nSeq = argc - 1;
if (nSeq > N_MAX) {
nSeq = N_MAX;
}
char** seqA = argv + 1;
uint64_t totLen = 1;
uint64_t lenA[N_MAX] = {0};
uint64_t offsA[N_MAX] = {1};
uint64_t offsSum = 0;
uint64_t posA[N_MAX] = {0};
for (int iSeq = 0; iSeq < nSeq; ++iSeq) {
lenA[iSeq] = strlen(seqA[iSeq]);
totLen *= lenA[iSeq] + 1;
if (iSeq + 1 < nSeq) {
offsA[iSeq + 1] = totLen;
}
offsSum += offsA[iSeq];
}
uint16_t* mat = calloc(totLen, 2);
uint64_t idx = offsSum;
for (;;) {
for (uint64_t pos0 = 0; pos0 < lenA[0]; ++pos0) {
char firstCh = seqA[0][pos0];
int isSame = 1;
uint16_t maxVal = mat[idx - 1];
for (int iSeq = 1; iSeq < nSeq; ++iSeq) {
char ch = seqA[iSeq][posA[iSeq]];
isSame &= (ch == firstCh);
uint16_t val = mat[idx - offsA[iSeq]];
if (val > maxVal) {
maxVal = val;
}
}
if (isSame) {
mat[idx] = mat[idx - offsSum] + 1;
} else {
mat[idx] = maxVal;
}
++idx;
}
idx -= lenA[0];
int iSeq = 1;
while (iSeq < nSeq && posA[iSeq] == lenA[iSeq] - 1) {
posA[iSeq] = 0;
idx -= (lenA[iSeq] - 1) * offsA[iSeq];
++iSeq;
}
if (iSeq == nSeq) {
break;
}
++posA[iSeq];
idx += offsA[iSeq];
}
int score = mat[totLen - 1];
char* resStr = malloc(score + 1);
resStr[score] = '\0';
for (int iSeq = 0; iSeq < nSeq; ++iSeq) {
posA[iSeq] = lenA[iSeq] - 1;
}
idx = totLen - 1;
int resIdx = score - 1;
while (resIdx >= 0) {
char firstCh = seqA[0][posA[0]];
int isSame = 1;
uint16_t maxVal = mat[idx - 1];
int maxIdx = 0;
for (int iSeq = 1; iSeq < nSeq; ++iSeq) {
char ch = seqA[iSeq][posA[iSeq]];
isSame &= (ch == firstCh);
uint16_t val = mat[idx - offsA[iSeq]];
if (val > maxVal) {
maxVal = val;
maxIdx = iSeq;
}
}
if (isSame) {
resStr[resIdx--] = firstCh;
for (int iSeq = 0; iSeq < nSeq; ++iSeq) {
--posA[iSeq];
}
idx -= offsSum;
} else {
--posA[maxIdx];
idx -= offsA[maxIdx];
}
}
free(mat);
printf("score: %d\n", score);
printf("%s\n", resStr);
return 0;
}
Instructions for Running
To run:
- Save the code in a file, e.g.
lcs.c
.
Compile with high optimization options. I used:
clang -O3 lcs.c
On Linux, I would try:
gcc -Ofast lcs.c
Run with 2 to 4 sequences given as command line arguments:
./a.out axbycz xaybzc
Single quote the command line arguments if necessary, since the alphabet used for the examples contains shell special characters.
Results
test2.sh
and test3.sh
are the test sequences from Dennis. I don't know the correct results, but the output looks at least plausible.
$ ./a.out axbycz xaybzc
score: 3
abc
$ time ./test2.sh
score: 391
16638018802020>3??3232270178;47240;24331395?<=;99=?;178675;866002==23?87?>978891>=9<6<9381992>>7030829?255>6804:=3>:;60<9384=081;0:?66=51?0;5090724<85?>>:2>7>3175?::<9199;5=0:494<5<:7100?=95<91>1887>33>67710==;48<<327::>?78>77<6:2:02:<7=5?:;>97<993;57=<<=:2=9:8=118563808=962473<::8<816<464<1==925<:5:22?>3;65=>=;27?7:5>==3=4>>5>:107:20575347>=48;<7971<<245<??219=3991=<96<??735698;62?<98928
real 0m0.012s
user 0m0.008s
sys 0m0.003s
$ time ./test3.sh
score: 269
662:2=2::=6;738395=7:=179=96662649<<;?82?=668;2?603<74:6;2=04=>6;=6>=121>1>;3=22=3=3;=3344>0;5=7>>7:75238;559133;;392<69=<778>3?593?=>9799<1>79827??6145?7<>?389?8<;;133=505>2703>02?323>;693995:=0;;;064>62<0=<97536342603>;?92034<?7:=;2?054310176?=98;5437=;13898748==<<?4
real 0m7.218s
user 0m6.701s
sys 0m0.513s
Can we throw exceptions? – HyperNeutrino – 2015-08-07T20:49:14.647
@JamesSmith As long as the output is correct, sure. – Dennis – 2015-08-07T20:55:56.997
Since i'm reading with bufferedreader, can i throw ioexception from
public static void main(...)
? – HyperNeutrino – 2015-08-07T20:56:55.043@JamesSmith I don't really know Java, so I don't know what that is, but don't worry about exceptions. – Dennis – 2015-08-07T21:04:14.773
4@JamesSmith Since code length does not matter for this challenge, can't you simply catch the exceptions? – Reto Koradi – 2015-08-08T06:05:46.597
I could, though I prefer to leave it thrown. – HyperNeutrino – 2015-08-08T21:20:04.060