Converting gzip files to bzip2 efficiently

10

3

I have a bunch of gzip files that I have to convert to bzip2 every now and then. Currently, I'm using a shell script that simply 'gunzip's each file and then 'bzip2's it. Though this works, it takes a lot of time to complete.

Is it possible to make this process more efficient? I'm ready to take a dive and look into gunzip and bzip2's source codes if necessary, but I just want to be sure of the payoff. Is there any hope of improving the efficiency of the process?

sundar - Reinstate Monica

Posted 2009-08-17T00:45:54.690

Reputation: 1 289

Answers

1

This question was asked a long time ago when pbzip2 either wasn't available or wasn't capable of compressing from stdin, but you can now parallelize both uncompressing and compressing steps using parallel and pbzip2 (instead of bzip2):

ls *.gz | parallel "gunzip -c {} | pbzip2 -c > {.}.bz2"

which is significantly faster than using bzip2.

flyingfinger

Posted 2009-08-17T00:45:54.690

Reputation: 211

Hi, I've changed the accepted answer to this one since this gives the best option for people stumbling upon the question today. Thanks for the pbzip2 mention. In case the link doesn't load for anyone else, here's the project page and the man page.

– sundar - Reinstate Monica – 2018-05-08T10:44:11.260

15

Rather than gunzip in one step and bzip2 in another, I wonder if it would perhaps be more efficient to use pipes. Something like gunzip --to-stdout foo.gz | bzip2 > foo.bz2

I'm thinking with two or more CPUs, this would definitely be faster. But perhaps even with only a single core. I shamefully admit to not having tried this out, though.

ChrisInEdmonton

Posted 2009-08-17T00:45:54.690

Reputation: 8 110

1@sleske, even though you are right in theory, bzip2's CPU usage dwarfs the gunzip one, so in practice the parallelism you get here is minimal. Not having to do disk IO is still nice though! – Johan Walles – 2017-08-23T13:50:27.043

@JohanWalles: Yes, but that is precisely why the parallelization made possible by the pipe is useful: If you instead decompress first to file, then bzip2, you a) incur extra I/O (as mentioned), and b) the CPU cannot even start working on the bzip2 compression before gunzip is done. The fact that gunzip takes little CPU is one more reason to let bzip2 run in parallel, because there's lots of idle CPU to use. – sleske – 2017-08-24T08:35:59.837

2

+1 for piping, disk I/O is something you want to avoid. As for compression, unless I'm mistaking, bzip2 isn't parallell. You'd have to use something like pbzip2 to compress in parallell: http://compression.ca/pbzip2/

– gustafc – 2009-08-17T07:01:51.003

... and unfortunately, there doesn't seem to be any parallell gzip decompression utility available. – gustafc – 2009-08-17T07:07:11.367

@gustafc: Thanks for the link to pbzip2, that was very helpful... @OP: I shied away from piping bcos I want to be able to deal with corrupt gz files, etc., without losing them in the pipe... – sundar - Reinstate Monica – 2009-08-18T05:32:50.217

4@gustafc: Even if bzip2 and gzip don't work in parallel internally, by using a pipe you can have them work in parallel, because a pipe implicitly starts two processes, which will run in parallel. So at least decompression and compression will run in parallel. – sleske – 2011-04-17T18:49:53.140

6

GNU parallel (http://www.gnu.org/software/parallel) might be an option if you have multiple cores (or even multiple machines):

ls *.gz | parallel "gunzip -c {} | bzip2 > {.}.bz2"

Read the tutorial / man page for details and options.

supervlieg

Posted 2009-08-17T00:45:54.690

Reputation: 61

3

What you're currently doing is your best bet. There is no conversion tool available, and attempting to bzip2 an already gzipped file is not really an option, as it frequently has undesired effects. Since the algorithm is different, converting would involve retrieving the original data regardless. Unless of course gzipping was a step in the bzip2 process, in which it isn't unfortunately.

John T

Posted 2009-08-17T00:45:54.690

Reputation: 149 037

Don't the algorithms have any overlapping steps such that I could skip one step in gzip decompression and the same in bzip compression also? – sundar - Reinstate Monica – 2009-08-19T06:54:55.060

2@sundar I wouldn't think so. gzip uses Leimpel-Ziv 77, while bzip2 uses Burrows-Wheeler. Different algorithms, I'm afraid. – new123456 – 2011-07-03T14:21:53.720

2

Occasionally, I need to do the same thing with log files. I start with the smallest *.gz files first (ls -rS), gunzip and then and bzip2 them individually. I do not know if it is possible to direct the gunzip output directly to the bzip2 input. The bzip2 command is so much slower at compressing than gunzip is at decompression that it may consume the memory and swap space on the host.

Improvements or suggestions are welcome. Here is my one liner:

for i in $(ls -rS *.gz | sed 's/\.gz//'); do gunzip ${i}.gz; bzip2 -9 ${i}; done

Mike L Swartz

Posted 2009-08-17T00:45:54.690

Reputation: 21

Thanks for the input, the point about the difference in speed between the two processes and its implication is an important one. – sundar - Reinstate Monica – 2012-12-15T16:46:58.383

1

Just had to do this a few minutes ago:

find . -name "*.gz" | perl -pi -e 's/\.gz$//g;' | xargs -n1 ./rezip

Where rezip would be defined as:

#!/bin/bash
gunzip -v $1.gz && bzip2 -9v $1

Optionally, you can also make it multi-threaded by using a -P option with xargs, but be careful with that one. (Start low!)

Brendan Byrd

Posted 2009-08-17T00:45:54.690

Reputation: 111

1

If you have more than a few, check out the LJ article with a nice shell script.

http://linuxgazette.net/123/bechtel.html

7zip gets better compression, and is multi threaded.

Ronald Pottol

Posted 2009-08-17T00:45:54.690

Reputation: 641