Joining text files with 600M+ lines

7

4

I have two files, huge.txt and small.txt. huge.txt has around 600M rows and it's 14 GB. Each line has four space separated words (tokens) and finally another space separated column with a number. small.txt has 150K rows with a size of ~3M, a space separated word and a number.

Both files are sorted using the sort command, with no extra options. The words in both files may include apostrophes (') and dashes (-).

The desired output would contain all columns from the huge.txt file and the second column (the number) from small.txt where the first word of huge.txt and the first word of small.txt match.

My attempts below failed miserably with the following error:

cat huge.txt|join -o 1.1 1.2 1.3 1.4 2.2 - small.txt > output.txt

join: memory exhausted  

What I suspect is that the sorting order isn't right somehow even though the files are pre-sorted using:

sort -k1 huge.unsorted.txt > huge.txt
sort -k1 small.unsorted.txt > small.txt

The problems seem to appear around words that have apostrophes (') or dashes (-). I also tried dictionary sorting using the -d option bumping into the same error at the end.

I tried loading the files into MySQL, create indexes and join them, but it seems to take weeks on my laptop. (I don't have a computer with more memory or fast disk/SSD for this task)

I see two ways out of this but don't know how to implement any of them.

  1. How do I sort the files in a way that the join command considers them to be sorted properly?

  2. I was thinking of calculating MD5 or some other hashes of the strings to get rid of the apostrophes and dashes but leave the numbers intact at the end of the lines. Do the sorting and joining with the hashes instead of the strings themselves and finally "translate" back the hashes to strings. Since there would be only 150K hashes it's not that bad. What would be a good way to calculate individual hashes for each of the strings? Some AWK magic?

See file samples at the end.

Sample of huge.txt

had stirred me to 46 
had stirred my corruption 57 
had stirred old emotions 55 
had stirred something in 69 
had stirred something within 40 

Sample of small.txt

caley 114881 
calf 2757974 
calfed 137861 
calfee 71143 
calflora 154624 
calfskin 148347 
calgary 9416465 
calgon's 94846 
had 987654

Desired output:

had stirred me to 46 987654
had stirred my corruption 57 987654
had stirred old emotions 55 987654
had stirred something in 69 987654
had stirred something within 40 987654

dnkb

Posted 2010-05-26T19:29:10.260

Reputation: 347

1ok, you provied huge.txt and small.txt .. can you please provide the desired output/result? – akira – 2010-05-27T18:57:01.837

1please see above – dnkb – 2010-05-27T20:22:05.747

Being nosy here but I have to ask. What kind of analysis are you doing with all that data? – Nifle – 2010-05-27T23:10:51.570

1@Nifle: master plan to take over the world :) – akira – 2010-05-28T11:23:49.213

1@Nifle, @akira: almost :) actually this is about processing the famous google web corpus in order to compile stimuli for a psycholinguistic experiment. the numbers are frequencies of the strings on the english language www as google saw it in 2006. i'm sorry if this is disappoinitngly lame reason to churn through all this data :) – dnkb – 2010-05-28T15:40:51.277

@dnkb: did you try my approach? – akira – 2010-05-28T17:08:50.173

not yet. i'm experimenting with something else, once i get home i'll see if it worked. if not i'll try yours. – dnkb – 2010-05-28T20:43:50.277

@akira: Yay, my silly trick worked, see new answer below. Thank you for your help though I appreciate it. – dnkb – 2010-06-01T02:17:13.410

Answers

1

I know it's embarrassingly simple but it works.
Based on the assumption that my original files contain only lowercase characters, I simply replaced the problematic apostrophes and dashes with two uppercase letters, re-sorted than joined the files, finally changed back the letters back to the signs. That's it.

Thanks again for everyone contributing an answer or insightful comment.

The sorting took like 2 hours for huge.txt (14Gig), the joining less than an hour.

cat small.txt | tr "\'-" "AD" | sort -k1 > small.AD
cat huge.txt | tr "\'-" "AD" | sort -k1 | cat huge.txt | join -o 1.1 1.2 1.3 1.4 2.2 - small.AD | tr "AD" "\'-" > output.txt

dnkb

Posted 2010-05-26T19:29:10.260

Reputation: 347

i am still interested in the speed of my approach. can i download the files somewhere? or recreate them here with .. whatever? – akira – 2010-06-01T05:08:46.893

@akira: it's available on 6 DVDs from UPenn only, not downloadable unfortunately. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 I'd also be very interested to see. My gut feeling is that with traditional 2.5" laptop hard drive the non-sequential disk access that's required to traverse the index would probably slow things down. With a decent SSD it may be faster.

– dnkb – 2010-06-01T20:16:40.383

@akira: You can test it however by generating, say 5M unique random strings and corresponding integers (frequencies), then select the 150K most frequent pieces. This will be small.txt. Than using the same 5M random strings, again randomly, construct four-grams and throwing in yet another integer afterwards. Generate 600M rows to construct huge.txt. E.g.

asdf wert dfw werhhyr 345345
frtko de serrte flxee 423443

Finally try (inner) joining them on any column. This should replicate the complexity rather well. – dnkb – 2010-06-01T20:32:40.390

9

IMO the best way to do this would be to use the programming/scripting language you know best and:

  1. load small.txt into an in-memory hash/map/associative array keyed on the words
  2. Process huge.txt line by line, adding the column looked up from the hash and writing the result into an output file
  3. Buffer input and output so that it happens in chunks of at least 4K

Michael Borgwardt

Posted 2010-05-26T19:29:10.260

Reputation: 3 047

1Thanks Michael. The problem is that what I lined out above is the most simplistic scenario. I have to carry out the operation above also for two huge files (10+ GB) as well, where loading one in memory is not an option. That's why I'm looking to use pre-sorted files and join. – dnkb – 2010-05-26T21:17:57.630

@dnkb: pre-sorted files are of no help when both files are too large to fit into memory, because you still need random access on one of them, which means endless HD thrashing. You need a grace or hybrid hash join http://en.wikipedia.org/wiki/Hash_join - but any remotely professional RDBMS will implement that. Your time is probably best spent trying to get the MySQL based solution to work.

– Michael Borgwardt – 2010-05-26T21:33:38.237

4I beg to differ: if the files are pre-sorted they can be merged using only sequential access, as in my answer. – David Z – 2010-05-26T23:51:38.203

@David: you're right. I shouldn't answer questions at those times... – Michael Borgwardt – 2010-05-27T21:22:30.657

7

To build on Michael Borgwardt's answer: as long as both files are sorted, you can put them together by basically performing one step of a mergesort. It'll be a little different than standard mergesort because you only want to keep one of the files. This will, of course, have to be implemented in your favorite programming language.

Here's a sketch of the algorithm:

line1 = read a line from file 1
line2 = read a line from file 2
start of loop:
if (first word of line1 == first word of line2) {
    write all fields of line1
      and second field of line2 to output
    line1 = read a line from file 1
    go to start of loop
}
else if (first word of line1 < first word of line2) {
    write line1 to output
    line1 = read a line from file 1
    go to start of loop
}
else (first word of line1 > first word of line2) {
    line2 = read a line from file 2
    go to start of loop
}

Here's a Python version (since Python is just what I happen to know best, not necessarily the best language for the job):

file1 = open('file1', 'r')
file2 = open('file2', 'r')
w2, n2 = file2.readline().split()
for line1 in file1:
  w11, w12, w13, w14, n15 = line1.split()
  if w11 == w2:
    print w11, w12, w13, w14, n15, n2
    continue
  elif w11 < w2:
    print w11, w12, w13, w14, n15
    continue
  else:
    while w11 > w2:
      w2, n2 = file2.readline().split()
    if w11 == w2:
      print w11, w12, w13, w14, n15, n2
    elif w11 < w2:
      print w11, w12, w13, w14, n15

and for completeness, after some digging here's what I came up with for Awk:

BEGIN {
  getline line2 <"file2";
  split(line2, a);
}
{
  if (a[1] == $1) print $0,a[2];
  else if (a[1] < $1) print $0;
  else { getline line2 <"file2"; split(line2, a); }
}

Invoke as awk -f program.awk <file1.

David Z

Posted 2010-05-26T19:29:10.260

Reputation: 5 688

Thanks. The devil is in sorting and in the < and > comparisons. GNU sort seems to kinda ignore/mistreat apostrophes, that's were I believe my problems are stemming from.

If I could sort the files "properly" in line with <, >, lt, gt operator implementations, there wouldn't be a problem in the first place.

In fact I tried coding pretty much the above logic in perl but it failed b/c of differences what perl and sort considers to be a "greater" or a "smaller" string. – dnkb – 2010-05-26T23:40:27.403

Hmm, well you could use a custom comparison function in the merge, one that corresponds to the way GNU sort handles the files. – David Z – 2010-05-26T23:55:45.337

Yes. Any tips how to do that? Or how to figure out what sort does? – dnkb – 2010-05-27T02:36:31.027

Excellent post. Longest I've ever seen. +1+1+1+1 – jamesbtate – 2010-05-27T19:32:02.423

2

My answer is similar to Michael Borgwardt's, but you don't have to load all of either file into memory. If the files are both sorted, then you walk through first file one line at a time, and you do binary search on the second file to find the target line in question. That's a lot of HD access, but it's low memory consumption.

Michael H.

Posted 2010-05-26T19:29:10.260

Reputation: 341

i support this answer. if i would address the problem, i would probably do this: create as many .cdb files from small.txt as needed to provide very fast lookup and then go line by line over huge.txt and query the term in all .cdb files. if you want to implement binary search in files yourself than that is fine as well. – akira – 2010-05-27T19:33:37.993

1

OK, this approach uses http://cr.yp.to/cdb.html as a quicker way to look up the content of 'small.txt':

  • Go and install cdbmake (part of 'freecdb' package in Ubuntu, but there are a lot of implementations available.
  • Use awk to pipe small.txt to cdbmake.

    % awk '    { printf "+%d,%d:%s->%s\n", \
                    length($1),length($2),$1,$2 } \
           END { print "" }' | cdbmake small.cdb small.cdbtmp
    

(This transforms a line of 'small.txt' from something like "key value" into "+ks,vs:key->value".)

  • Now you go line by line over 'huge.txt' and print it out, looking up the first word in 'small.cdb':

    #!/bin/python
    import cdb
    import fileinput
    
    c = cdb.init("small.cdb")
    for l in fileinput.input(['huge.txt']):
        print l.strip(),
        v = c.get(l.split()[0])
        print "" if v == None else v
    

You would have to install python-cdb of course to make this tiny snippet work (and it works only for Python 2.5 because of the 'conditional expression'. Anyway, there are a lot of bindings for whatever language you like. You could also use cdbget(a command line tool) and invoke it over and over again but spawning a new process for millions of lines is a bit ineffective.

Anyway, keep this in mind:

  • Each .cdb file can not be bigger than 4 GB. So if you have to process 'small.txt' with a size of 10 GB you obviously have to split up that into multiple files and create 'small1.cdb', 'small2.cdb', 'small3.cbd' and so on. It should be an easy task.
  • You do not need to sort 'small.txt', a lookup in a cdb file is pretty fast anyway.
  • I have not timed my little test case here, it is based on what you provided. :)

akira

Posted 2010-05-26T19:29:10.260

Reputation: 52 754

0

Instead of MySQL, you might try PostgreSQL which likely can handle this task more gracefully. See their guide on efficiently populating a database.

hemp

Posted 2010-05-26T19:29:10.260

Reputation: 330

a rdbms is not the right hammer for this kind of a nail – akira – 2010-05-27T21:29:41.813