Filter a large file in small memory

7

The challenge is a follow-up to Filter a large file quickly which had two submissions where were as fast as wc! This time the difficulty is that RAM is severely restricted which means that new methods will have to be used.

  • Input: Each line has three space separated positive integers.
  • Output: All input lines A B, Tthat satisfy either of the following criterion.

    1. There exists another input line C, D, U where D = A and 0 <= T - U < 100.
    2. There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random    
nolines = 50000000 # 50 million
for i in xrange(nolines):
    print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1)

Rules. Your code must use no more than 200MB of RAM. I will enforce this using ulimit -v 200000. Your code must therefore also run on ubuntu using only easily installable and free software.

Score. You can report your unofficial score by testing on your machine as (the time in seconds needed by Keith Randall's submission from Filter a large file quickly divided by the time in seconds taken by your submission )*100 .

Winning criterion. The winner will be the one with the highest score when I test the code on my computer using a file created by the sample python code. I will first run

sync && sudo bash -c 'echo  3 > /proc/sys/vm/drop_caches'

before each test run.

The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

largefile.file is a file created using the sample python code.

sync && sudo bash -c 'echo  3 > /proc/sys/vm/drop_caches'

time wc largefile.file

real    0m26.835s
user    0m18.363s
sys     0m0.495s

time sort -n largefile.file  > /dev/null

real    1m32.344s
user    2m9.530s
sys     0m6.543s

user9206

Posted 2014-05-14T20:14:15.340

Reputation:

So many [fastest-code]s. You could change it to least memory used total in filtering the memory, including variables, code, etc. – Justin – 2014-05-14T20:57:35.217

@Quincunx Luckily I am not the only person on codegolf.se :) To be fair, fastest-code questions are a) really cool and b) still in the minority on codegolf.se. It is also important to notice that the winning language has been different each time for my questions and one question had 19 answers. Who would have thought nimrod would win for example? Least memory is an interesting idea but I think you will end up with something so slow you can't test it. Maybe next time. – None – 2014-05-14T21:02:47.533

@Quincunx I should also add that it's not trivial to write any code that works, let alone is fast, for this problem because of the RAM restriction. So I don't really look at it as a standard fastest code problem. – None – 2014-05-14T21:28:17.803

The ram restriction is fairly trivial to work with :) – NPSF3000 – 2014-05-15T04:43:23.473

Is there a restriction on disk space? – Οurous – 2014-05-15T05:00:24.663

1@Ourous Shhh! You're giving away ma secrets :P – NPSF3000 – 2014-05-15T05:29:47.733

2@Ourous No, although I won't test code that might damage my computer! If you are going to use a lot of disk space, please say how much in your answer. – None – 2014-05-15T07:01:56.733

Answers

1

C, score 55.55

Same as my other implementation, but it does the radix sort using disk files. Uses ~25MB of memory, mostly for buffers.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>

#define R 3
#define B 9
#define M ((1<<B)-1)
#define BUFSIZE 4096
#define PARSEBUF 65536

typedef struct {
  int a,b,t;
} entry;

typedef struct {
  int fd;
  int lo;
  int hi; // bytes buf[lo]..buf[hi-1] are valid.
  entry e; // scratch
  char buf[PARSEBUF];
} ParseState;

entry *parse(ParseState *s) {
  int n = s->hi - s->lo;

  // if we need to, grab some more data
  if(n < 100 && s->fd != -1) {
    memmove(&s->buf[0], &s->buf[s->lo], n);
    int k = read(s->fd, &s->buf[n], PARSEBUF - n);
    if(k == 0) s->fd = -1;
    n += k;
    s->lo = 0;
    s->hi = n;
  }
  if(n == 0) return NULL;

  // parse line
  char *p = &s->buf[s->lo];
  int a = 0;
  while(*p != ' ') {
    a *= 10;
    a += *p - '0';
    p++;
  }
  p++;
  int b = 0;
  while(*p != ' ') {
    b *= 10;
    b += *p - '0';
    p++;
  }
  p++;
  int t = 0;
  while(*p != '\n') {
    t *= 10;
    t += *p - '0';
    p++;
  }
  p++;
  s->lo += p - &s->buf[s->lo];
  s->e.a = a;
  s->e.b = b;
  s->e.t = t;
  return &s->e;
}

typedef struct {
  int fd;
  int lo;
  int hi;
  entry buf[BUFSIZE];
} ReadBuf;

entry *bread(ReadBuf *b) {
  if(b->lo == b->hi) {
    int k = read(b->fd, &b->buf[0], BUFSIZE*sizeof(entry));
    if(k == 0) return NULL;
    b->lo = 0;
    b->hi = k / sizeof(entry);
  }
  return &b->buf[b->lo++];
}

typedef struct {
  int fd;
  int offset;   // offset in file to write buffer
  int n;        // # of entries
  entry buf[BUFSIZE];
} WriteBuf;

void bflush(WriteBuf *b) {
  int size = b->n * sizeof(entry);
  pwrite(b->fd, &b->buf[0], size, b->offset);
  b->offset += size;
  b->n = 0;
}

void bwrite(WriteBuf *b, entry *e) {
  b->buf[b->n++] = *e;
  if(b->n == BUFSIZE)
    bflush(b);
}

int count[R][1<<B];
int accum[R][1<<B];

unsigned char bcount[16384];

ParseState s;
ReadBuf rd;
WriteBuf wr[1<<B];
entry recent[65536];
char print[65536];

int main(int argc, char *argv[]) {
  int fd = open(argv[1], O_RDONLY);
  int fd1 = open("tmp1", O_RDWR | O_CREAT | O_TRUNC, 0666);
  int fd2 = open("tmp2", O_RDWR | O_CREAT | O_TRUNC, 0666);

  // parse input, write first file, do stats for radix sort
  entry *e;
  s = (ParseState){fd, 0, 0};
  wr[0] = (WriteBuf){fd1, 0, 0};
  while((e = parse(&s)) != NULL) {
    bwrite(&wr[0], e);
    // compute counts for radix sort
    int t = e->t;
    for(int r = 0; r < R; r++) {
      count[r][t&M]++;
      t >>= B;
    }
  }
  bflush(&wr[0]);

  // accumulate count entries
  for(int r = 0; r < R; r++) {
    int sum = 0;
    for(int i = 0; i < 1<<B; i++) {
      accum[r][i] = sum;
      sum += count[r][i];
    }
  }

  // radix sort
  int fda = fd1;
  int fdb = fd2;
  for(int r = 0; r < R; r++) {
    lseek(fda, 0, SEEK_SET);
    rd = (ReadBuf){fda, 0, 0};
    for(int i = 0; i < 1<<B; i++) {
      wr[i] = (WriteBuf){fdb, accum[r][i]*sizeof(entry), 0};
    }
    while((e = bread(&rd)) != NULL) {
      bwrite(&wr[(e->t>>r*B)&M], e);
    }
    for(int i = 0; i < 1<<B; i++) {
      bflush(&wr[i]);
    }
    int t = fda;
    fda = fdb;
    fdb = t;
  }

  // find matches
  lseek(fda, 0, SEEK_SET);
  ReadBuf rd = {fda, 0, 0};
  int lo = 0;
  int hi = 0;
  while((e = bread(&rd)) != NULL) {
    recent[hi] = *e;
    print[hi] = 0;
    while(recent[lo].t <= e->t - 100) {
      int x = recent[lo].b & 16383;
      if(bcount[x] != 255) bcount[x]--;
      if(print[lo])
    printf("%d %d %d\n", recent[lo].a, recent[lo].b, recent[lo].t);
      lo++;
    }
    if(bcount[e->a & 16383] != 0) {
      // somewhere in the window is a b that matches the
      // low 14 bits of a.  Find out if there is a full match.
      for(int k = lo; k < hi; k++) {
    if(e->a == recent[k].b)
      print[k] = print[hi] = 1;
      }
    }
    int x = e->b & 16383;
    if(bcount[x] != 255) bcount[x]++;
    hi++;
    if(hi == 65536) {
      if(lo == 0) {
    printf("recent buffer too small!\n");
    exit(1);
      }
      memmove(&recent[0], &recent[lo], (hi - lo) * sizeof(entry));
      memmove(&print[0], &print[lo], (hi - lo) * sizeof(char));
      hi -= lo;
      lo = 0;
    }
  }
}

Keith Randall

Posted 2014-05-14T20:14:15.340

Reputation: 19 865

That is an amazing score! Do you think it would be even higher if you used closer to 200MB of memory? – None – 2014-05-17T19:59:19.403

@Lembik: no, adding more buffer has almost no effect. – Keith Randall – 2014-05-18T06:23:50.310

3

Tcl with SQLite, score less than 0.05

Shove the lines into an SQL database, then make SQLite do the work. Requires the Tcl bindings for SQLite 3, OpenBSD package sqlite3-tcl, Ubuntu package libsqlite3-tcl.

Usage: tclsh filter.tcl <input.txt >output.txt

package require sqlite3

sqlite3 db ""
db eval {
    begin transaction;
    create table lines(i integer primary key,
               a integer, b integer, t integer);
}
while {[gets stdin line] >= 0} {
    lassign [split $line] a b t
    db eval {insert into lines values(null,$a,$b,$t)}
}
db eval {
    create view pairs as
      select first.i, second.i as j
      from lines as first, lines as second
      where first.a == second.b and
        first.t - second.t between 0 and 99 and
        first.i != second.i;
    select a, b, t from lines
      where i in (select i from pairs union select j from pairs);
} {
    puts "$a $b $t"
}
db close

Comparison of time

The motto of SQLite is, "Small. Fast. Reliable. Choose any three," but my SQL query is not fast. My program took more than six hours with my slow hard disk. I compare times for three programs:

With 50 million lines of input, cfilter and hfilter both crash because my data size limit is only 1 GB. (The default limit in OpenBSD is 512 MB.) If I raise my limit to 8 GB with ulimit -Sd 8388608, then I can run both cfilter and hfilter. I also lower my physical memory limit with -m 3000000 so that other programs stay in RAM, not in swap space. I have only 4 GB of RAM.

$ ulimit -Sd 8388608 -m 3000000
$ time ./cfilter input.txt >output.txt 
    0m9.03s real     0m5.17s user     0m1.97s system
$ ln -sf input.txt test.file
$ time ./hfilter >output2.txt
    0m25.62s real     0m22.08s user     0m2.04s system
$ time tclsh8.6 filter.tcl <input.txt >output3.txt
^Z  371m19.57s real     0m0.00s user     0m0.00s system
[1] + Suspended            tclsh8.6 filter.tcl < input.txt > output3.txt

I suspended filter.tcl after more than 6 hours. top shows that filter.tcl used less than 15 MB of memory and less than 12 minutes of processor time. Most of the 6 hours went to wait for my slow hard disk (5400 rpm).

For score, 10 seconds for cfilter, divided by more than 6 hours for filter.tcl, times 100, equals less than 0.05.

Comparison of results

The different filters do not select the same lines! Consider these 6 lines of input:

$ cat example.txt
1 801 33
802 1 33
2 803 499
804 2 400
805 2 400
3 3 555

My filter.tcl outputs five lines, all except 3 3 555.

$ tclsh8.6 filter.tcl <example.txt
1 801 33
802 1 33
2 803 499
804 2 400
805 2 400

But cfilter only outputs three lines:

$ ./cfilter example.txt         
804 2 400
805 2 400
2 803 499

cfilter seems to have a bug when T - U == 0. It fails to output the lines 1 801 33 and 802 1 33, though I believe them to be a matching pair.

Now for hfilter and the Scala program by the same author:

$ ln -sf example.txt test.file
$ ./hfilter
804 2 400
2 803 499
805 2 400
2 803 499
3 3 555
3 3 555
$ scala Filterer example.txt /dev/stdout
804 2 400
2 803 499
802 1 33
1 801 33
805 2 400
2 803 499
3 3 555
3 3 555

hfilter also misses the lines 1 801 33 and 802 1 33, but the Scala version finds them. Both programs output 2 803 499 twice (because it is in two matching pairs), but I might ignore such duplication. Both programs also output 3 3 355. This line is in a pair with itself because A = B. This is only correct if the phrase "another input line" includes the same line.

When the input is the random 50 million lines from the Python script, these differences might disappear. The probability that any matching pair has T - U = 0, or any line has A = B, is close to zero when the input has random numbers up to 50 million.

kernigh

Posted 2014-05-14T20:14:15.340

Reputation: 2 615

you are right, there is a bug in my code when two lines have a matching T. – Keith Randall – 2014-05-18T06:30:20.243

This is a great detailed analysis. Thank you! – None – 2014-05-18T08:14:45.487