What is the maximum compression ratio of gzip?

53

16

What is the largest size a gzip (say 10kb for the sake of an example) can be decompressed to?

Zombies

Posted 2010-05-09T11:47:14.867

Reputation: 3 214

Answers

95

Update 2020-02-06: As mentioned in the comments, I have been unable to reproduce the original result with gzip. Working on the assumption that I accidentally used a different compression format in that original quick test I've repeated with gzip and updated the figures below accordingly. This new result agrees with the theoretical maximum compression stated in other answers/comments.


It very much depends on the data being compressed. A quick test with a 1Gb file full of zeros using a standard version of gzip (with default options or specifying -9) gives a compressed size of ~1018Kb, so your 10Kb file could potentially expand into ~10Mbytes.

If the data has low redundancy to start with, for instance, the archive contains images files in a format that is compressed natively (gif, jpg, png, ...), then gzip may add not further compression at all. For binary files like program executables you might see up to 2:1 compression, for plain text, HTML or other markups 3:1 or 4:1 or more is not unlikely. You might see 10:1 in some cases but the ~1030:1 seen with a file filled with a single symbol is something you are not going to see outside similarly artificial circumstances.

You can check how much data would result from unpacking a gzip file, without actually writing its uncompressed content to disk, with gunzip -c file.gz | wc --bytes - this will uncompress the file but not store the results, instead passing them to wc which will count the number of bytes as they pass then discard them. If compressed content is a tar file containing many many small files you might find that noticeably more disk space is required to unpack the full archive, but in most circumstances, the count returned from piping gunzip output through wc is going to be as accurate as you need.

David Spillett

Posted 2010-05-09T11:47:14.867

Reputation: 22 424

I've seen HTML expand to 10x (of course x3 and x4 was the most common!).... perhaps a lot of redundant data for those ones that were exploding +8x. I think the page in question that was doing that was a php info page. – Zombies – 2010-05-10T12:10:15.413

Repetitive markup, as seen in the output of phpinfo(), compresses very well. The technical information in that output contains more direct repetition than the average chunk of natural language would too, and the alphabet distribution is probably less smooth which could help the Huffman stage get better results. – David Spillett – 2010-05-10T12:55:56.773

I found 1GiB file full of zeroes gives a 1042079 bytes compressed file which is much larger than ~120KB. gzip 1.3.12 used. – WKPlus – 2017-06-27T06:51:06.757

@WKPlus - I can't replicate the result either with gzip default compression or "--best" or other implementations like 7zip's, in all cases getting just over or just under 1Mb. Even with other formats like 7z I can only get as far as ~150Kb. Even different input sizes like 100M don't give the same result. Unfortunately I don't have a note of what I used back then to see where the discrepancy comes from. I'll fully revise the answer when I have a bit more time tomorrow. – David Spillett – 2017-06-27T10:51:49.620

This answer doesn't account for intentionally malicious compressed data. One can craft a malicious zip file around 10KB that can expand to a bit over 4GB.

– David Schwartz – 2013-01-02T02:36:52.887

Zip bombs of that scale rely on nested archives though, so as a human unpacking the file you would noticed something odd before long. They can be used as an effective DoS attack against automated scanners (on mail services and so forth) though. – David Spillett – 2013-01-02T11:47:33.687

1@DavidSpillett: Nested zip bombs expand into sizes in the petabyte range. That's not what I'm talking about. Look at even just a single layer of a typical zip bomb. – David Schwartz – 2013-01-02T23:48:33.787

10

Usually you don't get more than 95% compression (so that 10kB gzipped data would decompress to ~200kB), but there are specially crafted files that expand exponentially. Look for 42.zip, it decompresses to few petabytes of (meaningless) data.

liori

Posted 2010-05-09T11:47:14.867

Reputation: 3 044

4That is zip, not gzip – BeniBela – 2017-01-03T22:39:59.967

4

Wikipedia says 42.zip is "containing five layers of nested zip files in sets of 16", so that is not a valid example for decompression (only for recursive decompression).

– Tgr – 2013-07-10T13:59:31.387

5Indeed, 42.zip is specifically a danger to tools that automatically scan zip files recursively, for example virus scanners. – thomasrutter – 2014-02-05T00:42:42.997

8

Quoted verbatim from https://stackoverflow.com/a/16794960/293815

The maximum compression ratio of the deflate format is 1032:1. This is because the longest run that can be encoded is 258 bytes. At least two bits are required for each such run (one bit for the length code and one bit for the distance code), hence 4*258 = 1032 uncompressed bytes can be encoded per one compressed byte.

You can get more compression by gzipping the result of gzip. Normally that doesn't improve compression, but for very long runs it can.

By the way, the LZ77 approach used by deflate is more general than run-length encoding. Instead of just a length, a length/distance pair is used. This allows copying a string from some distance back, or replicating a byte as in run-length for a distance of one, or replicating triples of bytes with a distance of three, etc.

ioquatix

Posted 2010-05-09T11:47:14.867

Reputation: 181

6

The compression ratio of any compression algorithm will be a function of the data being compressed (besides the length of that data).

Here is an analysis at MaximumCompression,
Look at one of the samples like,

Summary of the multiple file compression benchmark tests

File type : Multiple file types (46 in total)  
# of files to compress in this test : 510  
Total File Size (bytes) : 316.355.757 
Average File Size (bytes) : 620,305
Largest File (bytes) : 18,403,071
Smallest File (bytes) : 3,554

nik

Posted 2010-05-09T11:47:14.867

Reputation: 50 788

4

A huge file containing only one symbol will compress very well.

geek

Posted 2010-05-09T11:47:14.867

Reputation: 7 120

4

10 MB of zeros in file, compress with gzip -9 to 10217. So maximal ratio looks to be around 1000x.

nikos

Posted 2010-05-09T11:47:14.867

Reputation: 41

1

The answer to your question, depends the input. To give you an idea how compression is done watch this six minutes videos.

https://www.youtube.com/watch?v=ZdooBTdW5bM

What you should get from it is that the compression rate depends on the frequency of each character, thus there is no generel max rate, it depends on the input, for english text it is about 65 percent.

brunsgaard

Posted 2010-05-09T11:47:14.867

Reputation: 111

1Welcome to Super User! Please quote the essential parts of the answer from the reference link(s), as the answer can become invalid if the linked page(s) change. – DavidPostill – 2016-10-21T11:03:30.883

It would be more accurate to say "frequency of each string" rather than "frequency of each character" – JoelFan – 2017-02-09T19:33:52.327