6

Is there Unix utility for sorting large files containing fixed length binary records?

In other words, I'm looking for something like sort(1) but for binary files with fixed length records.

I could convert the files into text, then sort using sort(1), and then convert back into the binary representation, but I'm looking for something more time and space efficient.

HughE
  • 163
  • 1
  • 4
  • What's the typical file size involved? – ewwhite Aug 15 '13 at 20:21
  • Have you used C before? You could `mmap` the file and implement sort of `quicksort` with `memcmp`. – ott-- Aug 15 '13 at 20:41
  • 1
    The `libc` already implement `qsort` and can make an `mmap` implementation very trivial, C isn't really a constraint, Perl or Python for example can `mmap`. But which `Unix`? How large the files? – Alex Aug 17 '13 at 23:30
  • `mmap` based solutions would basically need two disk seeks for every permutation (assuming the input file is larger than main memory), and this would be very slow – Arnaud Le Blanc Aug 21 '13 at 15:35
  • You may be able to make use of either 'od' or 'hexdump' to finagle the data into something 'sort' can then better handle. Barring any example of the dataset, I can't be more helpful than that. – Nex7 Aug 22 '13 at 04:09
  • I ended up writing my own in python (I think, it was a long time ago). – HughE Jul 29 '16 at 18:14

3 Answers3

8

One solution could be to convert the input file to hex, with each record encoded on a separate line, sort that, and convert back to binary:

record_size=32
cat input \
    |xxd -cols $record_size -plain \
    |sort \
    |xxd -cols $record_size -plain -revert

However, it's slow (xxd converts about 40MB/s on my machine)

So, since I needed it, I've written binsort, which does the job:

binsort --size 32 ./input ./output

With --size 32, it assumes 32-byte fixed size records, reads ./input, writes sorted records to ./output.

Arnaud Le Blanc
  • 106
  • 1
  • 6
  • Wouldn't this be better off on [SO] ? – user9517 Aug 15 '13 at 20:55
  • Iain: not really. IF the OP was asking how to write a program to do the sort, then yes, but they're asking about what programs exist for the job, so no. – mc0e Aug 21 '13 at 08:12
5

Unix's sort utility is OK for binary data based on byte locations within the records if you refer to them relative to the first "record". Eg -k1.28,1.32.

Unix sort is less flexible regarding it's notion of end-of-line. Depending on your data you may be able to do a much simpler stream edit than the xxd based one user68497 proposes, and use null terminated lines. This is still likely to involve a great deal of copying of data in memory though, and won't come close to the speed of an mmap based approach.

If you do use unix sort though in some manner, be careful of locale. sort assumes its input is text, and locale affects sort order.

mc0e
  • 5,786
  • 17
  • 31
2

Turns out you're in luck; there is a GNU style unix program that does exactly this: bsort.

bsort is a hyper efficient implementation of an inplace radix sort with careful attention paid to memory access patterns when working with files larger than ram. By efficent I mean was able to best the http://sortbenchmark.org's 2014 energy efficient 10^8 record sort on hardware from mid 2014 - the record was 889 Joules, an early prototype of this was able to sort the same in 335 Joules on a stock macbook pro. For "small" data sets that fit entirely in ram (triple digit megabytes) its about 3 times faster than libc's qsort library.