.Tar.gz: Is there a relation between time for compressing and decompressing?

I am compressing a a backup of a mongodb (~500GB) into a .tar.gz archive, which takes time on the scale of hours. I am trying to bring that database back up on different machines for testing purposes and I would like to have an estimate on how long this will take per machine.

My question is, is there any way I can estimate the time it will take for decompressing the archive, based on how long compression took?

Thanks

radschapur

Posted 2017-06-22T17:48:18.470

Reputation: 21

Some benchmarks. But differences in hardware between source and target machines can make the result vary widely....

– xenoid – 2017-06-22T19:59:53.713

1Interesting results, thanks for the link. Most of the machines I'm dealing with have similar hardware, so I can still have an idea. I'm mostly concerned about decompression, so it seems like gzip is the best option for me, with decompression being about 10 times faster than compression. – radschapur – 2017-06-22T21:03:35.857

1I'd expect disk I/O to be the bottleneck in both processes. Writing tends to be faster than reading, because buffering means the writer doesn't have to wait for the disk. – Barmar – 2017-06-23T00:40:05.737

Answers

I'm not aware of a standard ratio of compression to decompression since this really depends on your data and server resources. Assuming all other resources are equal, decompression is generally faster as there is less computational work involved. Your worst case estimate might be the same as initial compression time.

However, for an easy win I would recommend using pigz, a parallel implementation of gzip that takes advantage of multiple processors & cores. Unless you only have a single core available, pigz should substantially reduce the time for both compression & decompression.

Sample usage with tar:

tar -c --use-compress-program=pigz -f data.tgz /path/to/data

For more examples, see StackOverflow: Utilizing multi core for tar+gzip/bzip compression/decompression.

Stennie

Posted 2017-06-22T17:48:18.470

Reputation: 204

Thanks for the info. I used pigz for compression. Unfortunately, I'm intending to compress the db only once in order to replicate it on many other servers, so decompression is the main concern. Pigz doesn't seem to offer a lot of improvement there. – radschapur – 2017-06-30T15:56:44.430

@radschapur Perhaps bzip2 and pbzip2 (parallel bzip) is a better option? The bzip format seems more conducive to parallel decompression per discussion on: https://github.com/madler/pigz/issues/36.

– Stennie – 2017-07-02T23:50:40.120

There is no definite ratio on the same machine, and using multiple machines (of different types) can definitely have an impact. Compression and decompression actively involve data storage (e.g., a "hard drive", or "SSD"), processor, and other components like memory.

As an over-generalization, uncompressing is pretty fast, and may even be faster than copying the uncompressed amount of data. Compressing can also be similarly fast, and for something like RLE compression it may be. For zip and gzip, common implementations are slower than decompression, and you can often squeeze out another 5%-15% compression effectiveness if you choose more aggressive compression options that may take 2-4 times as long.

The difference is largely because compression involves some testing (sometimes thought of as "guessing"), and some tests are fruitless. In contrast, decompression is generally just following a pre-established process, so that goes relatively quicker.

TOOGAM

Posted 2017-06-22T17:48:18.470

Reputation: 12 651