Counting k-mers

8

The task is to count the number of distinct substrings of length k, for k = 1,2,3,4,.....

Output

You should output one line per k you manage to complete with one number per output line. Your output should be in order of increasing k until you run out of time.

Score

Your score is the highest k you can get to on my computer in under 1 minute.

You should use http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr2.fa.gz as your input and ignore newlines. Your code should, however, be case sensitive.

You can decompress the input before starting the timing.

The following (inefficient) code counts the number of distinct 4-mers.

awk -v substr_length=4 '(len=length($0))>=substr_length{for (i=1; (i-substr_length)<len; i++) substrs[substr($0,i,substr_length)]++}; END{for (i in substrs) print substrs[i], i}' file.txt|wc

Memory limits

To make your code behave nicely on my computer and to make the task more challenging, I will limit the RAM you can use to 2GB by using

ulimit -v 2000000

before running your code. I am aware this is not a rock solid way of limiting RAM usage so please don't use imaginative ways to get round this limit by, for example, spawning new processes. Of course you can write multi-threaded code but if someone does, I will have to learn how to limit the total RAM used for that.

Tie Breaker

In the case of a tie for some maximum k I will time how long it takes to give the outputs up to k+1 and the quickest one wins. In the case that they run in the same time to within a second up to k+1, the first submission wins.

Languages and libraries

You can use any language which has a freely available compiler/interpreter/etc. for Linux and any libraries which are also freely available for Linux.

My machine The timings will be run on my machine. This is a standard ubuntu install on an AMD FX-8350 Eight-Core Processor on an Asus M5A78L-M/USB3 Motherboard (Socket AM3+, 8GB DDR3). This also means I need to be able to run your code. As a consequence, only use easily available free software and please include full instructions how to compile and run your code.


Test output

FUZxxl's code outputs the following (but not all within 1 minute) which I believe is correct.

14
92
520
2923
15714
71330
265861
890895
2482912
5509765
12324706
29759234

League table

  • k >= 4000 FUZxxl (C)
  • k = 16 by Keith Randall (C++)
  • k = 10 by FUZxxl (C)

How much can you specialize your code to the input?

  • Clearly it would ruin the competition if you just precomputed the answers and had your code output them. Don't do that.
  • Ideally, anything your code needs to learn about the data that it will use to run more quickly it can learn at run-time.
  • You can however assume the input will look like the data in the *.fa files at http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/ .

user9206

Posted 2015-02-02T09:46:16.443

Reputation:

In J, a naïve solution would simply be [:~.]\ but guess that won't cut it. – FUZxxl – 2015-02-02T11:10:10.707

@FUZxxl Well... how large a k does it give you in one minute and much RAM does it use? – None – 2015-02-02T11:11:25.247

Well, for 2 it returns in about 5 seconds, for 3 it take 3.2 GB of RAM and takes about a minute. I didn't try what it would do for four. Let me try... – FUZxxl – 2015-02-02T11:22:57.620

@FUZxxl Well... please feel free to submit it as an answer :) Remember I cut it off at 2GB of RAM. – None – 2015-02-02T11:24:53.763

Is a solution required to abort itself after one minute or are you going to do this? – FUZxxl – 2015-02-02T13:35:41.620

Is there any restriction on how input has to be provided? Can we assume input comes from stdin and is attached to a file? – FUZxxl – 2015-02-02T13:37:08.287

Does the order matter in which we output the entries in one line? With what character are they separated? – FUZxxl – 2015-02-02T13:41:09.847

Are we allowed to put trailing spaces into the output? – FUZxxl – 2015-02-02T13:52:27.087

@FUZxxl You can assume a file called chr2.fa as input. If you want it on stdin you will need to pipe it. The order of the output matters. It should be for k = 1,2,3,4,5,6,7,... in that order. The output is one number per line. – None – 2015-02-02T17:26:34.617

Can we use the special properties of the given file to improve the effectiveness of our program on the test? (While producing correct output for any input.) – randomra – 2015-02-02T18:01:38.637

I get 110 for k==2. Could you double-check the correct result? – Keith Randall – 2015-02-02T18:02:09.850

@KeithRandall The first few solutions I get are 15, 93, 521, 2924, 15715, 71331, 265862, 890896, and 2482913. I'm not sure if they are correct, but OP thinks so. – FUZxxl – 2015-02-02T18:06:07.277

@FUZxxl: ah, I need to remove \n first. You have a bug also though, the +1 after ftello() allows one zero byte to sneak in at the end. – Keith Randall – 2015-02-02T18:30:17.700

Huch? Are you sure? Let me remove that… – FUZxxl – 2015-02-02T18:33:48.097

When you say "You can decompress the input before starting the timing.": a) Does that include a licence to read the input from a ready-decompressed file? b) Does it include a licence to strip the newlines from that prepared file? – Peter Taylor – 2015-02-02T18:37:43.477

@KeithRandall You are right it is all out by one currently. – None – 2015-02-02T18:58:23.927

@PeterTaylor Yes to reading from a ready-decompressed file. No to stripping the newlines beforehand. Handling newlines should be part of the code that is timed. – None – 2015-02-02T18:59:00.633

@randomra I updated the question. Please let me know if it helps. – None – 2015-02-02T20:02:04.900

I found a wonderful method to compute the indices for all k in very brief time, but the preprocessing takes so long -.- – FUZxxl – 2015-02-02T20:47:16.523

@FUZxxl What is it? I am very interested. – None – 2015-02-02T21:13:54.280

1@Lembik I won't tell you. I'm currently in the process of altering the solution so it gives output earlies (but is slower overall). It take 9 minutes to preprocess the data on my machine. – FUZxxl – 2015-02-02T21:24:34.300

@Lembik I won't tell you because I'm afraid someone implements the approach and beats me. I can tell you later, when it's done. – FUZxxl – 2015-02-02T21:27:14.687

@Lembik I improved my solution. It's not the new strategy I am thinking about though. – FUZxxl – 2015-02-02T23:36:22.040

@Lembik Would you mind rating my solution? – FUZxxl – 2015-02-06T17:17:40.823

Answers

7

C (&geq; 4000)

This code neither uses less than 2 GB of RAM (it uses slightly more) nor produces any output in the first minute. But if you wait for about six minutes, it prints out all the k-mer counts at once.

An option is included to limit the highest k for which we count the k-mers. When k is limited to a reasonable range, the code terminates in less than one minute and uses less than 2 GiB of RAM. OP has rated this solution and it terminates in less than one minute for a limit not significantly higher than 4000.

How does it work?

The algorithm has four steps.

  1. Read the input file into a buffer and strip newlines.
  2. Suffix-sort an array of indices into the input buffer. For instance, the suffixes of the string mississippi are:

    mississippi
    ississippi
    ssissippi
    sissippi
    issippi
    ssippi
    sippi
    ippi
    ppi
    pi
    i
    

    These strings sorted in lexicographic order are:

    i
    ippi
    issippi
    ississippi
    mississippi
    pi
    ppi
    sippi
    sissippi
    ssippi
    ssissippi
    

    It is easy to see that all equal substrings of length k for all k are found in adjacent entries of the suffix-sorted array.

  3. An integer array is populated in which we store at each index k the number of distinct k-mers. This is done in a slightly convoluted fashion to speed up the process. Consider two adjacent entries the suffix sorted array.

       p     l
       v     v
    issippi
    ississippi
    

    p denotes the length of longest common prefix of the two entries, l denotes the length of the second entry. For such a pair, we find a new distinct substring of length k for p < k &leq; l. Since pl often holds, it is impractical to increment a large number of array entries for each pair. Instead we store the array as a difference array, where each entry k denotes the difference to the number of k-mers to the number of (k - 1)-mers. This turns an update of the form

    0  0  0  0 +1 +1 +1 +1 +1 +1  0  0  0
    

    into a much faster update of the form

    0  0  0  0 +1  0  0  0  0  0 -1  0  0
    

    By carefully observing that l is always different and in fact every 0 < l < n will appear exactly once, we can omit the subtractions and instead subtract 1 from each difference when converting from differences into amounts.

  4. Differences are converted into amounts and printed out.

The source code

This source uses the libdivsufsort for sorting suffix arrays. The code generates output according to the specification when compiled with this invocation.

cc -O -o dsskmer dsskmer.c -ldivsufsort

alternatively the code can generate binary output when compiled with the following invocation.

cc -O -o dsskmer -DBINOUTPUT dsskmer.c -ldivsufsort

To limit the highest k for which k-mers are to be counted, supply -DICAP=k where k is the limit:

cc -O -o dsskmer -DICAP=64 dsskmer.c -ldivsufsort

Compile with -O3 if your compiler supplies this option.

#include <stdlib.h>
#include <stdio.h>

#include <divsufsort.h>

#ifndef BINOUTPUT
static char stdoutbuf[1024*1024];
#endif

/*
 * load input from stdin and remove newlines. Save length in flen.
 */
static unsigned char*
input(flen)
    int *flen;
{
    size_t n, len, pos, i, j;
    off_t slen;
    unsigned char *buf, *sbuf;

    if (fseek(stdin, 0L, SEEK_END) != 0) {
        perror("Cannot seek stdin");
        abort();
    }

    slen = ftello(stdin);
    if (slen == -1) {
        perror("Cannot tell stdin");
        abort();
    }

    len = (size_t)slen;

    rewind(stdin);

    /* Prepare for one extra trailing \0 byte */
    buf = malloc(len + 1);
    if (buf == NULL) {
        perror("Cannot malloc");
        abort();
    }

    pos = 0;

    while ((n = fread(buf + pos, 1, len - pos, stdin)) != 0)
        pos += n;

    if (ferror(stdin)) {
        perror("Cannot read from stdin");
        abort();
    }

    /* remove newlines */
    for (i = j = 0; i < len; i++)
        if (buf[i] != '\n')
            buf[j++] = buf[i];

    /* try to reclaim some memory */
    sbuf = realloc(buf, j);
    if (sbuf == NULL)
        sbuf = buf;

    *flen = (int)j;
    return sbuf;
}

/*
 * Compute for all k the number of k-mers. kmers will contain at index i the
 * number of (i + 1) mers. The count is computed as an array of differences,
 * where kmers[i] == kmersum[i] - kmersum[i-1] + 1 and then summed up by the
 * caller. This algorithm is a little bit unclear, but when you write subsequent
 * suffixes of an array on a piece of paper, it's easy to see how and why it
 * works.
 */
static void
count(buf, sa, kmers, n)
    const unsigned char *buf;
    const int *sa;
    int *kmers;
{
    int i, cl, cp;

    /* the first item needs special treatment */
    kmers[0]++;

    for (i = 1; i < n; i++) {
        /* The longest common prefix of the two suffixes */
        cl = n - (sa[i-1] > sa[i] ? sa[i-1] : sa[i]);
#ifdef ICAP
        cl = (cl > ICAP ? ICAP : cl);
#endif
        for (cp = 0; cp < cl; cp++)
            if (buf[sa[i-1] + cp] != buf[sa[i] + cp])
                break;

        /* add new prefixes to the table */
        kmers[cp]++;
    }
}

extern int
main()
{
    unsigned char *buf;
    int blen, ilen, *sa, *kmers, i;

    buf = input(&blen);

    sa = malloc(blen * sizeof *sa);

    if (divsufsort(buf, sa, blen) != 0) {
        puts("Cannot divsufsort");
        abort();
    }

#ifdef ICAP
    ilen = ICAP;
    kmers = calloc(ilen + 1, sizeof *kmers);
#else
    ilen = blen;
    kmers = calloc(ilen, sizeof *kmers);
#endif

    if (kmers == NULL) {
        perror("Cannot malloc");
        abort();
    }

    count(buf, sa, kmers, blen);

#ifndef BINOUTPUT
    /* sum up kmers differences */
    for (i = 1; i < ilen; i++)
        kmers[i] += kmers[i-1] - 1;


    /* enlarge buffer of stdout for better IO performance */
    setvbuf(stdout, stdoutbuf, _IOFBF, sizeof stdoutbuf);

    /* human output */
    for (i = 0; i < ilen; i++)
        printf("%d\n", kmers[i]);
#else
    /* binary output in host endianess */
    fprintf(stderr, "writing out result...\n");
    fwrite(kmers, sizeof *kmers, ilen, stdout);
#endif

    return 0;
}

The binary file format can be converted into the human-readable output format with the following program:

#include <stdlib.h>
#include <stdio.h>

static int inbuf[BUFSIZ];
static char outbuf[BUFSIZ];

extern int main()
{
    int i, n, sum = 1;

    setbuf(stdout, outbuf);

    while ((n = fread(inbuf, sizeof *inbuf, BUFSIZ, stdin)) > 0)
        for (i = 0; i < n; i++)
            printf("%d\n", sum += inbuf[i] - 1);

    if (ferror(stdin)) {
        perror("Error reading input");
        return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}

sample output

Sample output in the binary format for the file chr22.fa can be found here. Please decompress with bzip2 -d first. Output is provided in binary format only because it compresses much better (3.5 kB vs. 260 MB) than output in human-readable format. Beware though that the reference output has a size of 924 MB when uncompressed. You might want to use a pipe like this:

bzip2 -dc chr2.kmerdiffdat.bz2 | ./unbin | less

FUZxxl

Posted 2015-02-02T09:46:16.443

Reputation: 9 656

This is very interesting. Can you break down the time taken by each part? It seems constructing the suffix array is less than 10% of the time. – None – 2015-02-06T14:45:33.757

1step one takes about three seconds, step two takes about 20 seconds, step three takes the entire rest. The time taken by step four is negligible. – FUZxxl – 2015-02-06T14:49:13.893

If you like, you can place strategic fprintf(stderr, "doing step n\n"); statements to figure out how long the steps take. – FUZxxl – 2015-02-06T14:59:16.517

1@FUZxxl did you plot the distribution of the maximum p-values to see how much you can speed up step 3 by cutting of if p>k (counting only to k-mers)? although if step 2 takes 20 secs there might not be much room there. great explanation btw – randomra – 2015-02-06T15:23:08.347

@randomra It should indeed be possible to cap the highest k we observe and I was able to cut the runtime to under one minute with such an approach. Let me check. – FUZxxl – 2015-02-06T15:41:46.797

Yeah, it works. I didn't bother checking before. Wait a minute as I update the post. – FUZxxl – 2015-02-06T15:53:13.853

I always get gcc -Wall -O3 -o dsskmer -DICAP=64 dsskmer.c /tmp/ccsuEc2x.o: In function main': dsskmer.c:(.text.startup+0xf9): undefined reference todivsufsort' . How exactly are you compiling it? I just followed the instructions in the file INSTALL to install libdivsufsort. – None – 2015-02-06T20:00:04.537

Ah sorry, you have to compile with -ldivsufsort. I forgot to say that. – FUZxxl – 2015-02-06T20:24:25.867

and that has to be the last argument. – FUZxxl – 2015-02-06T20:25:01.997

OK.. I had to do sudo ldconfig as well.Now I get cat ~/Downloads/chr2.fa |./dsskmer Cannot seek stdin: Illegal seek – None – 2015-02-06T21:08:47.683

1@Lembik Do not use cat. Use shell redirection like in ./dsskmer <~/Downloads/chr2.fs. The code needs to know how long the input file is and that's not possible with a pipe. – FUZxxl – 2015-02-06T21:15:03.367

Well.. it's very impressive! I get to 4000 at least. Also I notice that the number of kmers goes up very slowly at this point. I wonder if there is some way of taking advantage of this. For example, the last 4 values up to k = 4000 are 240581805 240581816 240581827 240581838 – None – 2015-02-06T21:43:13.133

@Lembik that's because the file only has about that many entries (go figure). It's the reason why the compressed representation is so small; the difference is all zeros at the end. (they actually all decrement by one, but that one is not part of the difference representation). – FUZxxl – 2015-02-06T22:11:06.860

@Lembik Well, I guess that bounty is mine then :-). Let's see if someone finds a better solution. How did you actually get the idea for this challenge? – FUZxxl – 2015-02-06T22:12:32.623

It's something that people really want to do in bioinformatics. But they are always complaining it is too slow and uses too much memory :) I will leave the bounty awarding until the end of the period if that is OK. – None – 2015-02-06T22:13:54.213

2@Lembik Doesn't seem too slow for me. Maybe they're just bad programmers. – FUZxxl – 2015-02-06T22:14:58.477

7

C++, k=16, 37 seconds

Computes all of the 16-mers in the input. Each 16-mer is packed 4 bits to a symbol into a 64-bit word (with one bit pattern reserved for EOF). The 64-bit words are then sorted. The answer for each k can be read off by looking at how often the 4*k top bits of the sorted words change.

For the test input I use about 1.8GB, just under the wire.

I'm sure the reading speed could be improved.

Output:

14
92
520
2923
15714
71330
265861
890895
2482912
5509765
12324706
29759234
69001539
123930801
166196504
188354964

Program:

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <algorithm>

int main(int argc, char *argv[]) {
    // read file
    printf("reading\n");
    FILE *f = fopen(argv[1], "rb");
    fseek(f, 0, SEEK_END);
    int32_t size = ftell(f);
    printf("size: %d\n", size);
    fseek(f, 0, SEEK_SET);

    // table to convert 8-bit input into 4-bit tokens.  Reserve 15 for EOF.
    int ntokens = 0;
    int8_t squash[256];
    for(int i = 0; i < 256; i++) squash[i] = -1;

    uint64_t *buf = (uint64_t*)malloc(8*size);

    int32_t n = 0;
    uint64_t z = 0;
    for(int32_t i = 0; i < size; i++) {
        char c = fgetc(f);
        if(c == '\n') continue;
        int8_t s = squash[c];
        if(s == -1) {
            if(ntokens == 15) {
                printf("too many tokens\n");
                exit(1);
            }
            squash[c] = s = ntokens++;
        }
        z <<= 4;
        z += s;
        n++;

        if(n >= 16) buf[n-16] = z;
    }
    for(int32_t i = 1; i < 16; i++) {
        z <<= 4;
        z += 15;
        buf[n-16+i] = z;
    }
    printf("   n: %d\n", n);

    // sort these uint64_t's
    printf("sorting\n");
    std::sort(buf, buf+n);

    for(int32_t k = 1; k <= 16; k++) {
        // count unique entries
        int32_t shift = 64-4*k;
        int32_t cnt = 1;
        int64_t e = buf[0] >> shift;
        for(int32_t i = 1; i < n; i++) {
            int64_t v = buf[i] >> shift;
            if((v & 15) == 15) continue; // ignore EOF entries
            if(v != e) {
                cnt++;
                e = v;
            }
        }

        printf("%d\n", cnt);
    }
}

Compile with g++ -O3 kmer.cc -o kmer and run with ./kmer chr2.fa.

Keith Randall

Posted 2015-02-02T09:46:16.443

Reputation: 19 865

Please obey the output format; do not k. – FUZxxl – 2015-02-03T10:37:37.243

This is a cool solution otherwise. – FUZxxl – 2015-02-03T11:32:32.717

@FUZxxl: fixed. – Keith Randall – 2015-02-03T16:19:51.443

How long do you think it takes to find the longest length of any repeated substring? I was thinking you automatically know all the answers for any k longer than that. – None – 2015-02-03T19:23:22.363

@Lembik: Longest repeated substring can be done in linear time. But I don't think it is very helpful - there are really long repeated substrings in the sample input, thousands of symbols at least. – Keith Randall – 2015-02-03T20:49:34.770

@Lembik My “improved” approach was to sort an array of suffixes of the input file lexicographically. Then for all k the equal k-mers are adjacent. The problem is that sorting like that takes quite a while, around 8 minutes on my machine. I was too lazy to devise a better sorting scheme. The advantage is, that after you have created that sorted array, you can generate the answer for each k rather quickly. If you want, you can even generate the answer for all k at once. – FUZxxl – 2015-02-03T23:21:57.400

@FUZxxl You can compute the suffix array in linear time so this could be promising. Whether you can actually get it under 1 minute is an interesting question. Some papers online seem to say you can usefully use multi cores to speed up suffix array construction too. – None – 2015-02-04T10:08:59.990

@Lembik Indeed. It's just takes me a while to implement. Let me try… – FUZxxl – 2015-02-04T10:16:42.140

@Lembik The algorithm I tried (SA-IS, dubbed the fastest one for this purpose) still takes about two minutes to suffix-sort the entire array. I don't think it can be parallelized. – FUZxxl – 2015-02-04T11:43:41.313

@FUZxxl That's interesting. I found http://algo2.iti.kit.edu/sanders/papers/KulSan06a.pdf and also https://code.google.com/p/libdivsufsort/wiki/SACA_Benchmarks .

– None – 2015-02-04T13:40:09.743

@Lembik That's weird. Maybe I made a mistake copying the code? – FUZxxl – 2015-02-04T13:46:31.477

@FUZxxl Which of the existing suffix array implementations have you managed to try? I couldn't get some of them to run at all. – None – 2015-02-04T21:41:06.643

@Lembik I actually only tried SA-IS and it took over two minutes, which is so much more than one minute that it didn't seem worth trying the others. – FUZxxl – 2015-02-05T00:52:04.997

Let us continue this discussion in chat.

– None – 2015-02-05T09:19:13.000

After discussing it with FUZxxl, it seems you can make the suffix array of chr2.fa in under 30 seconds using only 1.1GB of RAM. – None – 2015-02-05T19:32:38.450

4

C++ - an improvement over FUZxxl solution

I deserve absolutely no credit for the computation method itself, and if no better approach shows up in the mean time, the bounty should go to FUZxxl by right.

#define _CRT_SECURE_NO_WARNINGS // a Microsoft thing about strcpy security issues
#include <vector>

#include <string>
#include <fstream>
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cstring>
using namespace std;

#include "divsufsort.h"

// graceful exit of sorts
void panic(const char * msg)
{
    cerr << msg;
    exit(0);
}

// approximative timing of various steps
struct tTimer {
    time_t begin;
    tTimer() { begin = time(NULL); }
    void print(const char * msg)
    {
        time_t now = time(NULL);
        cerr << msg << " in " << now - begin << "s\n";
        begin = now;
    }
};

// load input pattern
unsigned char * read_sequence (const char * filename, int& len)
{
    ifstream file(filename);
    if (!file) panic("could not open file");

    string str;
    std::string line;
    while (getline(file, line)) str += line;
    unsigned char * res = new unsigned char[str.length() + 1];
    len = str.length()+1;
    strcpy((char *)res, str.c_str());
    return res;
}

#ifdef FUZXXL_METHOD
/*
* Compute for all k the number of k-mers. kmers will contain at index i the
* number of (i + 1) mers. The count is computed as an array of differences,
* where kmers[i] == kmersum[i] - kmersum[i-1] + 1 and then summed up by the
* caller. This algorithm is a little bit unclear, but when you write subsequent
* suffixes of an array on a piece of paper, it's easy to see how and why it
* works.
*/
static void count(const unsigned char *buf, const int *sa, int *kmers, int n)
{
    int i, cl, cp;

    /* the first item needs special treatment */
    /*
        kuroi neko: since SA now includes the null string, kmers[0] is indeed 0 instead of 1
    */
//  kmers[0]++;

    for (i = 1; i < n; i++) {
        /* The longest common prefix of the two suffixes */
        cl = n - (sa[i - 1] > sa[i] ? sa[i - 1] : sa[i]);
#ifdef ICAP
        cl = (cl > ICAP ? ICAP : cl);
#endif
        for (cp = 0; cp < cl; cp++)
        if (buf[sa[i - 1] + cp] != buf[sa[i] + cp])
            break;

        /* add new prefixes to the table */
        kmers[cp]++;
    }
}

#else // Kasai et al. method

// compute kmer cumulative count using Kasai et al. LCP construction algorithm
void compute_kmer_cumulative_sums(const unsigned char * t, const int * sa, int * kmer, int len)
{
    // build inverse suffix array
    int * isa = new int[len];
    for (int i = 0; i != len; i++) isa[sa[i]] = i;

    // enumerate common prefix lengths
    memset(kmer, 0, len*sizeof(*kmer));
    int lcp = 0;
    int limit = len - 1;
    for (int i = 0; i != limit; i++)
    {
        int k = isa[i];
        int j = sa[k - 1];
        while (t[i + lcp] == t[j + lcp]) lcp++;

        // lcp now holds the kth longest commpn prefix length, which is just what we need to compute our kmer counts
        kmer[lcp]++;
        if (lcp > 0) lcp--;
    }
    delete[] isa;
}

#endif // FUZXXL_METHOD

int main (int argc, char * argv[])
{
    if (argc != 2) panic ("missing data file name");

    tTimer timer;
    int blen;
    unsigned char * sequence;
    sequence = read_sequence(argv[1], blen);
    timer.print("input read");

    vector<int>sa;
    sa.assign(blen, 0);
    if (divsufsort(sequence, &sa[0], blen) != 0) panic("divsufsort failed");
    timer.print("suffix table constructed");

    vector<int>kmers;
    kmers.assign(blen,0);

#ifdef FUZXXL_METHOD
    count(sequence, &sa[0], &kmers[0], blen);
    timer.print("FUZxxl count done");
#else
    compute_kmer_cumulative_sums(sequence, &sa[0], &kmers[0], blen);
    timer.print("Kasai  count done");
#endif

    /* sum up kmers differences */
    for (int i = 1; i < blen; i++) kmers[i] += kmers[i - 1] - 1;
    timer.print("sum done");

    /* human output */

    if (blen>10) blen = 10; // output limited to the first few values to avoid cluttering display or saturating disks

    for (int i = 0; i != blen; i++) printf("%d ", kmers[i]);
    return 0;
}

I simply used Kasai et al. algorithm to compute LCPs in O(n).
The rest is a mere adaptation of FUZxxl code, using more concise C++ features here and there.

I left original computation code to allow comparisons.

Since the slowest processes are SA construction and LCP count, I removed most of the other optimizations to avoid cluttering the code for neglectible gain.

I extended SA table to include the zero-length prefix. That makes LCP computation easier.

I did not provide a length limitation option, the slowest process being now SA computation that cannot be downsized (or at least I don't see how it could be).

I also removed binary output option and limited display to the first 10 values.
I assume this code is just a proof of concept, so no need to clutter displays or saturate disks.

Building the executable

I had to compile the whole project (including the lite version of divsufsort) for x64 to overcome Win32 2Gb allocation limit.

divsufsort code throws a bunch of warnings due to heavy use of ints instead of size_ts, but that will not be an issue for inputs under 2Gb (which would require 26Gb of RAM anyway :D).

Linux build

compile main.cpp and divsufsort.c using the commands:

g++ -c -O3 -fomit-frame-pointer divsufsort.c 
g++ -O3 main.cpp -o kmers divsufsort.o

I assume the regular divsufsort library should work fine on native Linux, as long as you can allocate a bit more than 3Gb.

Performances

The Kasai algorithm requires the inverse SA table, which eats up 4 more bytes per character for a total of 13 (instead of 9 with FUZxxl method).

Memory consumption for reference input is thus above 3Gb.

On the other hand, computation time is dramatically improved, and the whole algorithm is now in O(n):

input read in 5s
suffix table constructed in 37s
FUZxxl count done in 389s
Kasai  count done in 27s
14 92 520 2923 15714 71330 265861 890895 2482912 5509765 (etc.)

Further improvements

SA construction is now the slowest process.
Some bits of the divsufsort algorithm are meant to be parallelized with whatever builtin feature of a compiler unknown to me, but if necessary the code should be easy to adapt to more classic multithreading (à la C++11, for instance).
The lib has also a truckload of parameters, including various bucket sizes and the number of distinct symbols in the input string. I only had a cursory look at them, but I suspect compressing the alphabet might be worth a try if your strings are endless ACTG permutations (and you're desperate for performances).

There exist some parallelizable methods for computing LCP from SA too, but since the code should run under one minute on a processor slightly more powerful than my puny i3-2100@3.1GHz and the whole algorithm is in O(n), I doubt this would be worth the effort.

user16991

Posted 2015-02-02T09:46:16.443

Reputation:

2+1 for giving credit where credit is due ;) (and nice speed-up!) – Martin Ender – 2015-02-07T16:56:38.027

1This is neat. Nice improvement. – FUZxxl – 2015-02-07T17:09:36.767

I didn't even know about LCP arrays. The Kasai et al. algorithm is really neat, too. It's a bit said that it takes so much memory. – FUZxxl – 2015-02-07T17:13:35.140

@FUZxxl ah well, it's always the same old speed/memory tradeoff, but I think a 45% mem. cost increase is an acceptable price for reaching an O(n) complexity. – None – 2015-02-07T17:19:55.057

@kuroineko True. Notice that my O(n²) approach can save even more memory: It's possible to write out the index array into a temporary file and then read it in chunks as you don't need random access, cutting down the memory usage. – FUZxxl – 2015-02-07T20:49:51.930

@FUZxxl Indeed. Using disk as a buffer would make it rather slow, but I wonder if the O(n) algorithm would still win with disk swap on random memory blocks. Anyway, people in genetics labs are surely rich enough to afford SSDs :). Seriously, I know there is another algorithm that produces LCP in a different order, but since order does not matter in our case it might be worth a try. – None – 2015-02-08T01:03:56.187

Would it work to compute BWT instead of the suffix array and then the LCP array instead? The BWT uses less space than the suffix array. – None – 2015-02-08T13:23:30.457

Indeed. Now we only need to find a fast BWT algorithm and implement the BWT->LCP transform :) – None – 2015-02-08T14:56:09.073

I may be compiling it wrong but I get https://bpaste.net/show/77b03516372a in linux.

– None – 2015-02-08T16:03:19.403

What is your exact command line when you compiled with g++4.7 ? (I am not a C++ person so I may also be missing options that are obvious to an expert.) My g++ is version 4.8.2 . – None – 2015-02-08T16:31:16.893

Doesn't libdivsufsort also compute the BWT? http://encode.ru/threads/1879-SACA-K-vs-divsufsort-for-computing-BWT also seems relevant (seems to be called divbwt there).

– None – 2015-02-08T16:33:40.807

OK, Linux build fixed. About BWT, we can indeed use libsufsort to compute it, but I could not find a comprehensive enough description of an algorithm to build LCP from BWT, and frankly I don't feel like ploughing through raw research papers to try and implement one. – None – 2015-02-08T19:23:43.843

2@kuroineko I have an idea how to construct the data we need in linear time without using an LCP array. Let me check… – FUZxxl – 2015-02-09T07:37:33.507

1

C (can solve for up to 10 in a minute on my machine)

This is a very simple solution. It constructs a tree of the k-mers found and counts them. To conserve memory, characters are first converted into integers from 0 to n - 1 where n is the number of different characters in the input, so we don't need to provide space for characters that never appear. Additionally, less memory is allocated for the leaves than for other nodes as to conserve further memory. This solution uses about 200 MiB of RAM during its runtime on my machine. I'm still improving it, so perhaps in the iteration it might be even faster!

To compile, save the code below in a file named kmers.c and then execute on a POSIX-like operating system:

cc -O -o kmers kmers.c

You might want to substitute -O3 for -O if your compiler supports that. To run, first unpack chr2.fa.gz into chr2.fa and then run:

./kmers <chr2.fa

This produces the output step by step. You might want to limit both time and space. Use something like

( ulimit -t 60 -v 2000000 ; ./kmers <chrs.fa )

to reduce resources as needed.

#include <limits.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/*
 * A tree for the k-mers. It contains k layers where each layer is an
 * array of struct tr. The last layer contains a 1 for every kmer found,
 * the others pointers to the next layer or NULL.
 */
struct tr {
    struct tr *n;
};

/* a struct tr to point to */
static struct tr token;

static void     *mem(size_t s);
static int       add(struct tr*, unsigned char*, int, size_t);
static unsigned char    *input(size_t*);
extern int       main(void);

/*
 * Allocate memory, fail if we can't.
 */
static void*
mem(s)
    size_t s;
{
    void *p;

    p = calloc(s, 1);
    if (p != NULL)
        return p;

    perror("Cannot malloc");
    abort();
}

/*
 * add s to b, return 1 if added, 0 if present before. Assume that n - 1 layers
 * of struct tr already exist and only add an n-th layer if needed. In the n-th
 * layer, a NULL pointer denotes non-existance, while a pointer to token denotes
 * existance.
 */
static int
add(b, s, n, is)
    struct tr *b;
    unsigned char *s;
    size_t is;
{
    struct tr **nb;
    int added;
    int i;

    for (i = 0; i < n - 2; i++) {
        b = b[s[i]].n;
    }

    nb = &b[s[n - 2]].n;

    if (*nb == NULL || *nb == &token)
        *nb = mem(is * sizeof *b);

    added = (*nb)[s[n - 1]].n == NULL;
    (*nb)[s[n - 1]].n = &token;

    return (added);
}

/*
 * load input from stdin and remove newlines. Save length in flen.
 */
static unsigned char*
input(flen)
    size_t *flen;
{
    size_t n, len, pos, i, j;
    unsigned char *buf;

    if (fseek(stdin, 0L, SEEK_END) != 0) {
        perror("Cannot seek stdin");
        abort();
    }

    len = ftello(stdin);
    if (len == -1) {
        perror("Cannot tell stdin");
        abort();
    }

    rewind(stdin);

    /* no need to zero out, so no mem() */
    buf = malloc(len);
    if (buf == NULL) {
        perror("Cannot malloc");
        abort();
    }

    pos = 0;

    while ((n = fread(buf + pos, 1, len - pos, stdin)) != 0)
        pos += n;

    if (ferror(stdin)) {
        perror("Cannot read from stdin");
        abort();
    }

    /* remove newlines */
    for (i = j = 0; i < len; i++)
        if (buf[i] != '\n')
            buf[j++] = buf[i];

    *flen = j;
    return buf;
}

extern int
main()
{
    struct tr *b;
    size_t flen, c, i, k, is;
    unsigned char *buf, itab[1 << CHAR_BIT];

    buf = input(&flen);

    memset(itab, 0, sizeof itab);

    /* process 1-mers */
    for (i = 0; i < flen; i++)
        itab[buf[i]] = 1;

    is = 0;
    for (i = 0; i < sizeof itab / sizeof *itab; i++)
        if (itab[i] != 0)
            itab[i] = is++;

    printf("%zd\n", is);

    /* translate characters into indices */
    for (i = 0; i < flen; i++)
        buf[i] = itab[buf[i]];

    b = mem(is * sizeof *b);

    /* process remaining k-mers */
    for (k = 2; k < flen; k++) {
        c = 0;

        for (i = 0; i < flen - k + 1; i++)
            c += add(b, buf + i, k, is);

        printf("%zd\n", c);
    }

    return 0;
}

Improvements

  1. 8 → 9: Read the entire file in the beginning, pre-process it once and keep it in-core. This greatly improves throughput.
  2. Use less memory, write correct output.
  3. Fix output format again.
  4. Fix off-by-one.
  5. 9 → 10: Don't throw away what you already did.

FUZxxl

Posted 2015-02-02T09:46:16.443

Reputation: 9 656

The output should really be one number per line. It seems currently to output strings from the input file. – None – 2015-02-02T17:22:46.737

@Lembik Ah, I see! It seems like I misunderstood the output format. Give me a minute to fix it. – FUZxxl – 2015-02-02T17:23:24.870

@Lembik Output format fixed again. If you like, I can add code that kills the program after one minute. – FUZxxl – 2015-02-02T17:28:09.247

Thanks. You are currently the winner :) timeout 60s works OK for me so no need to build in a way to kill the code after 1 minute. – None – 2015-02-02T17:38:00.807

You could also use ulimit -t 60. – FUZxxl – 2015-02-02T17:41:54.993