29
8
I find myself having to compress a number of very large files (80-ish GB), and I am surprised at the (lack of) speed my system is exhibiting. I get about 500 MB / min conversion speed; using top
, I seem to be using a single CPU at approximately 100%.
I am pretty sure it's not (just) disk access speed, since creating a tar
file (that's how the 80G file was created) took just a few minutes (maybe 5 or 10), but after more than 2 hours my simple gzip command is still not done.
In summary:
tar -cvf myStuff.tar myDir/*
Took <5 minutes to create an 87 G tar file
gzip myStuff.tar
Took two hours and 10 minutes, creating a 55G zip file.
My question: Is this normal? Are there certain options in gzip
to speed things up? Would it be faster to concatenate the commands and use tar -cvfz
? I saw reference to pigz
- Parallel Implementation of GZip - but unfortunatly I cannot install software on the machine I am using, so that is not an option for me. See for example this earlier question.
I am intending to try some of these options myself and time them - but it is quite likely that I will not hit "the magic combination" of options. I am hoping that someone on this site knows the right trick to speed things up.
When I have the results of other trials available I will update this question - but if anyone has a particularly good trick available, I would really appreciate it. Maybe the gzip just takes more processing time than I realized...
UPDATE
As promised, I tried the tricks suggsted below: change the amount of compression, and change the destination of the file. I got the following results for a tar that was about 4.1GB:
flag user system size sameDisk
-1 189.77s 13.64s 2.786G +7.2s
-2 197.20s 12.88s 2.776G +3.4s
-3 207.03s 10.49s 2.739G +1.2s
-4 223.28s 13.73s 2.735G +0.9s
-5 237.79s 9.28s 2.704G -0.4s
-6 271.69s 14.56s 2.700G +1.4s
-7 307.70s 10.97s 2.699G +0.9s
-8 528.66s 10.51s 2.698G -6.3s
-9 722.61s 12.24s 2.698G -4.0s
So yes, changing the flag from the default -6
to the fastest -1
gives me a 30% speedup, with (for my data) hardly any change to the size of the zip file. Whether I'm using the same disk or another one makes essentially no difference (I would have to run this multiple times to get any statistical significance).
If anyone is interested, I generated these timing benchmarks using the following two scripts:
#!/bin/bash
# compare compression speeds with different options
sameDisk='./'
otherDisk='/tmp/'
sourceDir='/dirToCompress'
logFile='./timerOutput'
rm $logFile
for i in {1..9}
do /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $sameDisk $logFile
do /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $otherDisk $logFile
done
And the second script (compressWith
):
#!/bin/bash
# use: compressWith sourceDir compressionFlag destinationDisk logFile
echo "compressing $1 to $3 with setting $2" >> $4
tar -c $1 | gzip -$2 > $3test-$2.tar.gz
Three things to note:
- Using
/usr/bin/time
rather thantime
, since the built-in command ofbash
has many fewer options than the GNU command - I did not bother using the
--format
option although that would make the log file easier to read - I used a script-in-a-script since
time
seemed to operate only on the first command in a piped sequence (so I made it look like a single command...).
With all this learnt, my conclusions are
- Speed things up with the
-1
flag (accepted answer) - Much more time is spend compressing the data than reading from disk
- Invest in faster compression software (
pigz
seems like a good choice). - If you have multiple files to compress you can put each
gzip
command in its own thread and use more of the available CPU (poor man’spigz
)
Thanks everyone who helped me learn all this!
I came here because I was interested in what the title promised: 100GB. Then the question was about a file with 80GB which took 2 hours to compress down to 50GB. Then the benchmark speeds you gave us was for something that was 4GB. So, you ended up using -1 ? How long did it take for the 80GB file (or 100GB files), and how much did it compress? – user1271772 – 2017-04-08T00:00:06.223
I'd be interested if gzip experts have a view, but my guess would be that technically the speedup via the fastest compression mode is from it not trying to compress data across the entire file - with very large files, you'd get a giant symbol set and spend a lot of time traversing the target file trying to compress everything. Presumably the fastest mode compresses only a small window (in bytesize) of data at a time and never backtracks to improve the compression on prior windows, allowing it to traverse the entire file just once, more-or-less. – stevemidgley – 2019-06-01T16:15:12.603
@stevemidgley your curiosity is more likely to be satisfied if you turn your comment into a well-formulated question (perhaps with a link to this one)... more likely to be seen by someone who can answer it. – Floris – 2019-06-01T16:19:15.997
tar -cvf doesn't do any compression so it will be quicker – parkydr – 2013-05-03T17:23:13.600
@parkydr - Thanks for your comment. I am aware of the reasons why
tar -cvf
is quicker: I added the information to show that my disk access on this system is quite fast, and not limiting the speed of thegzip
. I understand that I needtar -cvfz
in order to get both tar and compression. But I'm looking for any tricks I missed to speed up compression. – Floris – 2013-05-03T17:27:25.3672@Floris: what kind of data are you trying to compress? side-note:
$> gzip -c myStuff.tar | pv -r -b > myStuff.tar.gz
will show you how fast your machine is compressing the stuff. side-note2: store the result onto a different disc. – akira – 2013-05-03T17:31:45.3803Sorry, I misread your question. gzip has the --fast option to select the fastest compression – parkydr – 2013-05-03T17:32:04.863
@Floris as akira mentioned, try outputting to a different disk. The seeks between read and write operations might be contributing to your problem. Also, tar -cvfz will do the tar and compression in one step, so it will be faster than separately running tar, then gzip. – rob – 2013-05-03T17:42:34.110
1@parkydr : The --fast option is one I didn't know about... it's the very last one in the
man
page, and I didn't read that far (because it's sorted by 'single letter command', which is-#
). That will teach me to RTFM! This will be the next thing I try! – Floris – 2013-05-03T17:45:27.960@akira - the data is a fairly dense "proprietary format" - there are actually three kinds of files, one of which is hardly compressible while the other two are fairly sparse. Unfortunately there is no simple expression that distinguishes them. Is there a way to "compress just some" files, based on an expression? My machine doesn't appear to have the
pv
command...uname -r
returns2.6.18-164.el5
. Pretty sure this is Red Hat. – Floris – 2013-05-03T17:52:12.440@rob - I will try to find another disk to write to, and time the difference. I really appreciate everyone's suggestions and will be sure to update timing benchmarks when my testing is completed. – Floris – 2013-05-03T18:11:12.197
@Floris: regarding
– akira – 2013-05-03T18:19:52.063pv
: http://rpm.pbone.net/index.php3?stat=3&search=pv&srodzaj=3 . aside from that:proprietary
sounds likebad compression ratio
. thus i would advise you to less cpu and to compressjust a little bit
(see the answer of @user1291332), flags -1 or -2.2Note that if a suitable compiler is available on the machine, and the filesystem permissions are not set to prohibit executing binaries from the directories you have access to, you can compile
pigz
and run it from wherever you happened to build it, without installing it. If there is no compiler, you could cross-compile it on another computer, although that's starting to get into more effort than might be worth it. (Depending on just how badly you need this compression to run faster, I guess.) – David Z – 2013-05-03T19:42:27.333