2

Is another question extended from the previous one [1]

I have a compressed file and stream them to feed into a python program, e.g.

bzcat data.bz2 | parallel --no-notice -j16 --pipe python parse.py > result.txt

The parse.py can read from stdin continusuoly and print to stdout

My ec2 instance is 16 cores but from the top command it is showing 3 to 4 load average only.

From the ps, I am seeing a lot of stuffs like..

sh -c 'dd bs=1 count=1 of=/tmp/7D_YxccfY7.chr 2>/dev/null';       

I know I can improve using the -a in.txtto improve performance, but with my case I am streaming from bz2 (I cannot exact it since I don't have enought disk space)

How to improve the efficiency for my case?

[1] Gnu parallel not utilizing all the CPU

Ryan
  • 5,341
  • 21
  • 71
  • 87

1 Answers1

0

Increase the block size:

--block 100m
Ole Tange
  • 2,836
  • 5
  • 29
  • 45