C++ - an improvement over FUZxxl solution
I deserve absolutely no credit for the computation method itself, and if no better approach shows up in the mean time, the bounty should go to FUZxxl by right.
#define _CRT_SECURE_NO_WARNINGS // a Microsoft thing about strcpy security issues
#include <vector>
#include <string>
#include <fstream>
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cstring>
using namespace std;
#include "divsufsort.h"
// graceful exit of sorts
void panic(const char * msg)
{
cerr << msg;
exit(0);
}
// approximative timing of various steps
struct tTimer {
time_t begin;
tTimer() { begin = time(NULL); }
void print(const char * msg)
{
time_t now = time(NULL);
cerr << msg << " in " << now - begin << "s\n";
begin = now;
}
};
// load input pattern
unsigned char * read_sequence (const char * filename, int& len)
{
ifstream file(filename);
if (!file) panic("could not open file");
string str;
std::string line;
while (getline(file, line)) str += line;
unsigned char * res = new unsigned char[str.length() + 1];
len = str.length()+1;
strcpy((char *)res, str.c_str());
return res;
}
#ifdef FUZXXL_METHOD
/*
* Compute for all k the number of k-mers. kmers will contain at index i the
* number of (i + 1) mers. The count is computed as an array of differences,
* where kmers[i] == kmersum[i] - kmersum[i-1] + 1 and then summed up by the
* caller. This algorithm is a little bit unclear, but when you write subsequent
* suffixes of an array on a piece of paper, it's easy to see how and why it
* works.
*/
static void count(const unsigned char *buf, const int *sa, int *kmers, int n)
{
int i, cl, cp;
/* the first item needs special treatment */
/*
kuroi neko: since SA now includes the null string, kmers[0] is indeed 0 instead of 1
*/
// kmers[0]++;
for (i = 1; i < n; i++) {
/* The longest common prefix of the two suffixes */
cl = n - (sa[i - 1] > sa[i] ? sa[i - 1] : sa[i]);
#ifdef ICAP
cl = (cl > ICAP ? ICAP : cl);
#endif
for (cp = 0; cp < cl; cp++)
if (buf[sa[i - 1] + cp] != buf[sa[i] + cp])
break;
/* add new prefixes to the table */
kmers[cp]++;
}
}
#else // Kasai et al. method
// compute kmer cumulative count using Kasai et al. LCP construction algorithm
void compute_kmer_cumulative_sums(const unsigned char * t, const int * sa, int * kmer, int len)
{
// build inverse suffix array
int * isa = new int[len];
for (int i = 0; i != len; i++) isa[sa[i]] = i;
// enumerate common prefix lengths
memset(kmer, 0, len*sizeof(*kmer));
int lcp = 0;
int limit = len - 1;
for (int i = 0; i != limit; i++)
{
int k = isa[i];
int j = sa[k - 1];
while (t[i + lcp] == t[j + lcp]) lcp++;
// lcp now holds the kth longest commpn prefix length, which is just what we need to compute our kmer counts
kmer[lcp]++;
if (lcp > 0) lcp--;
}
delete[] isa;
}
#endif // FUZXXL_METHOD
int main (int argc, char * argv[])
{
if (argc != 2) panic ("missing data file name");
tTimer timer;
int blen;
unsigned char * sequence;
sequence = read_sequence(argv[1], blen);
timer.print("input read");
vector<int>sa;
sa.assign(blen, 0);
if (divsufsort(sequence, &sa[0], blen) != 0) panic("divsufsort failed");
timer.print("suffix table constructed");
vector<int>kmers;
kmers.assign(blen,0);
#ifdef FUZXXL_METHOD
count(sequence, &sa[0], &kmers[0], blen);
timer.print("FUZxxl count done");
#else
compute_kmer_cumulative_sums(sequence, &sa[0], &kmers[0], blen);
timer.print("Kasai count done");
#endif
/* sum up kmers differences */
for (int i = 1; i < blen; i++) kmers[i] += kmers[i - 1] - 1;
timer.print("sum done");
/* human output */
if (blen>10) blen = 10; // output limited to the first few values to avoid cluttering display or saturating disks
for (int i = 0; i != blen; i++) printf("%d ", kmers[i]);
return 0;
}
I simply used Kasai et al. algorithm to compute LCPs in O(n).
The rest is a mere adaptation of FUZxxl code, using more concise C++ features here and there.
I left original computation code to allow comparisons.
Since the slowest processes are SA construction and LCP count, I removed most of the other optimizations to avoid cluttering the code for neglectible gain.
I extended SA table to include the zero-length prefix. That makes LCP computation easier.
I did not provide a length limitation option, the slowest process being now SA computation that cannot be downsized (or at least I don't see how it could be).
I also removed binary output option and limited display to the first 10 values.
I assume this code is just a proof of concept, so no need to clutter displays or saturate disks.
Building the executable
I had to compile the whole project (including the lite version of divsufsort
) for x64 to overcome Win32 2Gb allocation limit.
divsufsort
code throws a bunch of warnings due to heavy use of int
s instead of size_t
s, but that will not be an issue for inputs under 2Gb (which would require 26Gb of RAM anyway :D).
Linux build
compile main.cpp
and divsufsort.c
using the commands:
g++ -c -O3 -fomit-frame-pointer divsufsort.c
g++ -O3 main.cpp -o kmers divsufsort.o
I assume the regular divsufsort
library should work fine on native Linux, as long as you can allocate a bit more than 3Gb.
Performances
The Kasai algorithm requires the inverse SA table, which eats up 4 more bytes per character for a total of 13 (instead of 9 with FUZxxl method).
Memory consumption for reference input is thus above 3Gb.
On the other hand, computation time is dramatically improved, and the whole algorithm is now in O(n):
input read in 5s
suffix table constructed in 37s
FUZxxl count done in 389s
Kasai count done in 27s
14 92 520 2923 15714 71330 265861 890895 2482912 5509765 (etc.)
Further improvements
SA construction is now the slowest process.
Some bits of the divsufsort
algorithm are meant to be parallelized with whatever builtin feature of a compiler unknown to me, but if necessary the code should be easy to adapt to more classic multithreading (à la C++11, for instance).
The lib has also a truckload of parameters, including various bucket sizes and the number of distinct symbols in the input string. I only had a cursory look at them, but I suspect compressing the alphabet might be worth a try if your strings are endless ACTG permutations (and you're desperate for performances).
There exist some parallelizable methods for computing LCP from SA too, but since the code should run under one minute on a processor slightly more powerful than my puny i3-2100@3.1GHz and the whole algorithm is in O(n), I doubt this would be worth the effort.
In
J
, a naïve solution would simply be[:~.]\
but guess that won't cut it. – FUZxxl – 2015-02-02T11:10:10.707@FUZxxl Well... how large a k does it give you in one minute and much RAM does it use? – None – 2015-02-02T11:11:25.247
Well, for 2 it returns in about 5 seconds, for 3 it take 3.2 GB of RAM and takes about a minute. I didn't try what it would do for four. Let me try... – FUZxxl – 2015-02-02T11:22:57.620
@FUZxxl Well... please feel free to submit it as an answer :) Remember I cut it off at 2GB of RAM. – None – 2015-02-02T11:24:53.763
Is a solution required to abort itself after one minute or are you going to do this? – FUZxxl – 2015-02-02T13:35:41.620
Is there any restriction on how input has to be provided? Can we assume input comes from
stdin
and is attached to a file? – FUZxxl – 2015-02-02T13:37:08.287Does the order matter in which we output the entries in one line? With what character are they separated? – FUZxxl – 2015-02-02T13:41:09.847
Are we allowed to put trailing spaces into the output? – FUZxxl – 2015-02-02T13:52:27.087
@FUZxxl You can assume a file called chr2.fa as input. If you want it on stdin you will need to pipe it. The order of the output matters. It should be for k = 1,2,3,4,5,6,7,... in that order. The output is one number per line. – None – 2015-02-02T17:26:34.617
Can we use the special properties of the given file to improve the effectiveness of our program on the test? (While producing correct output for any input.) – randomra – 2015-02-02T18:01:38.637
I get 110 for k==2. Could you double-check the correct result? – Keith Randall – 2015-02-02T18:02:09.850
@KeithRandall The first few solutions I get are 15, 93, 521, 2924, 15715, 71331, 265862, 890896, and 2482913. I'm not sure if they are correct, but OP thinks so. – FUZxxl – 2015-02-02T18:06:07.277
@FUZxxl: ah, I need to remove \n first. You have a bug also though, the +1 after ftello() allows one zero byte to sneak in at the end. – Keith Randall – 2015-02-02T18:30:17.700
Huch? Are you sure? Let me remove that… – FUZxxl – 2015-02-02T18:33:48.097
When you say "You can decompress the input before starting the timing.": a) Does that include a licence to read the input from a ready-decompressed file? b) Does it include a licence to strip the newlines from that prepared file? – Peter Taylor – 2015-02-02T18:37:43.477
@KeithRandall You are right it is all out by one currently. – None – 2015-02-02T18:58:23.927
@PeterTaylor Yes to reading from a ready-decompressed file. No to stripping the newlines beforehand. Handling newlines should be part of the code that is timed. – None – 2015-02-02T18:59:00.633
@randomra I updated the question. Please let me know if it helps. – None – 2015-02-02T20:02:04.900
I found a wonderful method to compute the indices for all k in very brief time, but the preprocessing takes so long -.- – FUZxxl – 2015-02-02T20:47:16.523
@FUZxxl What is it? I am very interested. – None – 2015-02-02T21:13:54.280
1@Lembik I won't tell you. I'm currently in the process of altering the solution so it gives output earlies (but is slower overall). It take 9 minutes to preprocess the data on my machine. – FUZxxl – 2015-02-02T21:24:34.300
@Lembik I won't tell you because I'm afraid someone implements the approach and beats me. I can tell you later, when it's done. – FUZxxl – 2015-02-02T21:27:14.687
@Lembik I improved my solution. It's not the new strategy I am thinking about though. – FUZxxl – 2015-02-02T23:36:22.040
@Lembik Would you mind rating my solution? – FUZxxl – 2015-02-06T17:17:40.823