0

I have a bunch of (rather large i.e. >100MB) comma separated files which needs sorting on 4 processor box with SunOS 5.10

Sort appears to run rather slow (minutes)

I am wondering if there is any way to speed things up, for example making use of more than one processor/core or perhaps by just using some clever sort options?

PS: I am using the entire line as a key so just sort filename > filename.sorted

user9517
  • 114,104
  • 20
  • 206
  • 289
Adrian
  • 229
  • 2
  • 4
  • 12

3 Answers3

2

Here is the script I wrote for this purpose. On a 4 processor machine it improved the sort performance by 100% ! (Thanks Bash for the tip!)

#! /bin/ksh

MAX_LINES_PER_CHUNK=1000000
ORIGINAL_FILE=$1
SORTED_FILE=$2
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split.
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted

usage ()
{
     echo Parallel sort
     echo usage: psort file1 file2
     echo Sorts text file file1 and stores the output in file2
     echo Note: file1 will be split in chunks up to $MAX_LINES_PER_CHUNK lines
     echo  and each chunk will be sorted in parallel
}

# test if we have two arguments on the command line
if [ $# != 2 ]
then
    usage
    exit
fi

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
rm -f $SORTED_FILE

#Splitting $ORIGINAL_FILE into chunks ...
split -l $MAX_LINES_PER_CHUNK $ORIGINAL_FILE $CHUNK_FILE_PREFIX

for file in $CHUNK_FILE_PREFIX*
do
    sort $file > $file.sorted &
done
wait

#Merging chunks to $SORTED_FILE ...
sort -m $SORTED_CHUNK_FILES > $SORTED_FILE

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
Adrian
  • 229
  • 2
  • 4
  • 12
2

See: need high performance /bin/sort; any suggestions?

Ole Tange
  • 2,836
  • 5
  • 29
  • 45
1

I found this script a while ago: distsort.sh

I don't remember what I used it for or if it worked, so let me know if it works for you.

Not Now
  • 3,532
  • 17
  • 18