10
3
I'm trying to make unique a set of lines pulled from a file with egrep with sort -u, then count them. About 10% of the lines(all 100 characters long from the alphabet [ATCG]) are duplicated. There are two files, about 3 gigs each, 50% aren't relevant, so perhaps 300 million lines.
LC_ALL=C grep -E <files> | sort --parallel=24 -u | wc -m
Between LC_ALL=C and using -x to accelerate grep, the slowest part by far is the sort. Reading the man pages led me to --parallel=n, but experimentation showed absolutely no improvement. A little digging with top showed that even with --parallel=24, the sort process only ever runs on one processor at a time.
I have 4 chips with 6 cores and 2 threads/core, giving a total of 48 logical processors. See lscpu because /proc/cpuinfo would be too long.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 4
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 21
Model: 1
Stepping: 2
CPU MHz: 1400.000
BogoMIPS: 5199.96
What am I missing? Even if the process is IO-bound, shouldn't I see parallel processing anyway? The sort process uses 99% of the processor it is actually on at any given time, so I should be able to see parallelization if it's happening. Memory isn't a concern, I have 256 Gb to play with and none of it is used by anything else.
Something I discovered piping grep to a file then reading the file with sort:
LC_ALL=C grep -E <files> > reads.txt ; sort reads.txt -u | wc -m
default, file 1m 50s
--parallel=24, file 1m15s
--parallel=48, file 1m6s
--parallel=1, no file 10m53s
--parallel=2, no file 10m42s
--parallel=4 no file 10m56s
others still running
In doing these benchmarks it's pretty clear that when piped input sort isn't parallelizing at all. When allowed to read a file sort splits the load as instructed.
What
sortis that on which distribution? The standardsortdoesn't know that option. – ott-- – 2015-07-09T20:17:27.777uname -agives "3.13.0-46-generic #79-Ubuntu SMP" andlsb_release -aclaims 14.04.2 codename trusty , and the version of sort that's part of the gnu coreutils, according toman sort. – Jeremy Kemball – 2015-07-09T20:22:21.943It seems to me there are portions here that needs to be re-read:
https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html
I'm not sure I understand what you're getting at @Hannu, could you be more specific? sort --parallel=2 doesn't parallelize either. Neither does 4 or 8. nproc gives back 48 like it should. – Jeremy Kemball – 2015-07-09T20:36:53.473
1
I'd say... don't use coreutils for this. Amusingly we had a very similar question and well.... every other method works better http://superuser.com/a/485987/10165
– Journeyman Geek – 2015-07-10T02:23:48.703@JeremyKemball - the fact that n>8 would not add much for starters, it also seemed to me that your use of -u (unique) was not inline with what you asked for; better use
uniquenotsort -u– Hannu – 2015-07-10T08:19:10.663sort | uniqand andsort -uare explicitly equivalent if you don't use keys to sort the lines. Is there a reason to use an extra pipe? – Jeremy Kemball – 2015-07-10T13:40:10.267From how I understand the text under the link above, there is a difference. But then, I'm NOT a native English speaker. – Hannu – 2015-07-10T18:22:44.197
They differ in last-resort comparison if your specified ordering/sorting options, so that things that sort identically are unique'd by those sorting options. I don't have -d, -f, -n, -h, -M,-v, or -k, so
sort | uniqis explicitly identical tosort --unique, doubly so since I have very boring strings. – Jeremy Kemball – 2015-07-11T19:36:48.283