Optimally sort a dataset of N keys when first key is already ordered

2

0

I frequently run into the situation where partially ordered data needs to be sorted. The first column is already sorted, later ones are not. Like this two column example:

1 5
1 3
2 10
2 -1
2 3
3 11
3 -200
3 20

The desired output is that produced by

sort -k 1,1g -k 2,2g

which works but has the problem that nothing will come out of sort until all of the input has been read. When the input is several gigabytes of text, that can take a while, during which time nothing downstream from the sort in a pipeline can execute. It also isn't very efficient in terms of memory usage since the entire dataset must reside there at once, even though to achieve the desired sort, only a tiny fraction really needs to be there.

With a script it wouldn't be difficult to break this up into chunks and then sort each chunk. Does the sort command have an option somewhere to notify it that the data in that column is already ordered? I don't see it in sort 8.4, but perhaps I just missed it?

If the sort encounters an out of order value in the column it has been told is already ordered it should exit. That indicates an error in upstream processing.

mathog

Posted 2018-02-21T22:42:39.477

Reputation: 31

Are you sure you need g? Wouldn't n be enough? – choroba – 2018-02-21T22:56:30.043

Answers

1

I cannot do this with sole sort. It may not be possible.

In my solution awk handles the first column and runs sort as many times as necessary. The script takes input from stdin, prints to stdout).

#!/usr/bin/awk -f

BEGIN { command = "sort -k 2,2g" }

{
if ( NR==1 ) {
   val=$1
   buf=$0
}
else
if ( $1 < val ) {
   print "Unsorted 1st column detected. Processing last valid chunk and aborting." > "/dev/stderr"
   exit 1
   }
else {
if ( $1 == val )
   buf=buf"\n"$0
else
   {
   print buf | command
   close(command)
   buf=$0
   val=$1
   }
   }
}

END { print buf | command }

Notes:

  • close(command) is crucial. Without it, all pipes to command would go to a single sort.
  • In my opinion awk's comparison operators handle numbers quite well. To be really sure the solution works the way sort would work, you need to retrieve exit status of sort -c -k 1,1g for val"\n"$1 and separately for $1"\n"val, then build the script logic on that. This would run two sort processes per input line, I expect huge performance hit.

Kamil Maciorowski

Posted 2018-02-21T22:42:39.477

Reputation: 38 429

0

Perl to the rescue!

It does exactly what you asked for, it pipes chunks of the input starting with the same number to sort that operates on the second column only.

#!/usr/bin/perl
use warnings;
use strict;

my $sort;

my $first = -1;
while (<>) {
    my ($x, $y) = split;
    if ($first != $x) {
        die "Unsorted line $." if $first > $x;
        $first = $x;
        open $sort, '|-', 'sort -k2,2n' or die $!;
    }
    print {$sort} $_;
}

The only problem might be the initial value of $first: If your input starts with a negative number in the 1st column, you need to specify a lesser value.

I used the n sort instead of g as it seems a bit faster on my machine.

choroba

Posted 2018-02-21T22:42:39.477

Reputation: 14 741

0

Kamil Cuk posted this answer in the other thread. I am going to try to delete that thread now, since apparently having this in two places is a no no, and would like to preserve that answer here.

Following script prints lines with the same first column into a temporary file and sorts it.

file=$1
# create temporary file
temp=$(mktemp)
trap 'rm "$temp"' EXIT
i_last=""
while read -r i j; do
    if [ "$i" != "$i_last" ]; then
        # output sorted temporary file
        sort -n $temp
        # and truncate temporary file
        > $temp
        # increment first column pointer
        i_last=$i
    fi
    # print all lines into temporary file
    echo "$i" "$j" >>$temp
done <"$file"
# dont forget leftovers
sort -n "$temp"

mathog

Posted 2018-02-21T22:42:39.477

Reputation: 31

Sure, but what I was asking was not how to do this with a script, but rather how to do it with just linux sort. That is, is there a "presorted" flag of some type for a key. – mathog – 2018-02-22T18:12:28.263