Why does a zip file appear larger than the source file especially when it is text?

6

2

I have a text file that is 19 bytes in size and having compressed the file using zip and 7zip, it appears to be larger. I had a read of the question on Why is a 7zipped file larger than the raw file? as well as Why doesn't ZIP Compression compress anything? but considering the file is not already compressed I would have expected further compression. Attached is a screenshot.

enter image description here

EDIT0

I took the example further by creating a file that contained random data as follows dd if=/dev/urandom of=sample.log bs=1G count=1 and attempted to compress the file using both zip and 7zip however there were no compression gains. Why is that?

enter image description here

PeanutsMonkey

Posted 2012-08-29T06:39:59.937

Reputation: 7 780

1And that is a plain text 1GB log file? – CyberSkull – 2012-08-29T17:50:06.867

@CyberSkull - Yes it is. – PeanutsMonkey – 2012-08-29T19:21:02.120

Can you please tell us what your zip parameters were? I would have done something like zip -9T "example.zip" sample.log (-t is just to test the integrity of the archive.). – CyberSkull – 2012-08-29T19:31:12.087

@CyberSkull - I only ran the standard command i.e. zip sample.zip sample.log however when I ran 7zip I defined the maximum compression i.e. 7zr a -mx=9 sample.7z sample.log – PeanutsMonkey – 2012-08-29T19:44:42.050

6Random data from /dev/urandom does not generate a true text file; it will not compress well at all. Text bytes are limited in range, with many spaces and repeating patterns (e.g. "th" and "sp") and words. You have in fact generated a random binary file. – Ken – 2012-08-29T19:49:30.607

@Ken - I had no idea that it would create a random binary file. How would you create a random true text file? – PeanutsMonkey – 2012-08-29T20:05:49.113

One option is to just cat all your logs into a single file. Another is to download a collection of text files (like from Gutenberg) and try compressing them or joining them into a single large file to experiment on.

– CyberSkull – 2012-08-29T20:21:30.880

@CyberSkull - So there is no other way to create a true text file using commands such as dd? – PeanutsMonkey – 2012-08-30T01:18:59.077

1Open your favorite text editor. Now get your cat or small child and induce them to play with the keyboard for 5 minutes or so. You now have a large random text file! ;) – CyberSkull – 2012-08-30T05:58:18.963

1@CyberSkull: No, you have a random stream of ASCII characters. Which is a bit more compressible than random binary, but still nowhere near as structured as text. – Ben Voigt – 2013-05-28T19:24:46.277

Answers

10

As @kinokijuf said, there is a file header. But to expand upon that there are a few other things to understand about file compression.

The zip header contains all the necessary info for identifying the file type (the magic number), zip version and finally a listing of all the files included in the archive.

Your file probably wasn't compressed anyways. If you run unzip -l example.zip you will probably see that the file size is unchanged. 19 bytes would probably generate more overhead than would be saved if it were compressible at all by DEFLATE (the main compression method used by zip).

In other cases, PNG images for example, they are already compressed so zip will just store them. DEFLATE won't bother compressing anything already compressed.

If on the other hand you had a lot of text files, and their size was more than a few kilobytes each, you would get great savings by putting them all into a single zip archive.

You will get your best savings when compressing very regular, formatted data, like a text file containing a SQL dump. For example, I once had a dump of a small SQL database at around 13MB. I ran zip -9 dump.sql dump.zip on it and ended up with around a 1MB afterwards.

Another factor is your compression level. Many archivers by default will only compress at mid-level, going for speed over reduction. When compressing with zip, try the -9 flag for maximum compression (I think the 3.x manual says that compression levels are only supported by DEFLATE at this time).

TL;DR

The overhead for the archive exceeded any gains you may have gotten for compressing the file. Try putting larger text files in there and see what you get. Use the -v flag when zipping to see your savings as you go.

CyberSkull

Posted 2012-08-29T06:39:59.937

Reputation: 1 386

When you say the file size is unchanged if I were to unzip it, do you mean the size of the of the archive? Secondly if I were to use a different compression method other than DEFLATE such as PPMD, will it make a difference? So when you say the -V flag do you mean when I execute the zip command? – PeanutsMonkey – 2012-08-29T07:20:39.077

Also when you say file type do you mean the type of file the source is e.g. text, MP3, etc? – PeanutsMonkey – 2012-08-29T07:49:38.850

4

Because the overhead of .zip headers is way larger than 19 bytes.

kinokijuf

Posted 2012-08-29T06:39:59.937

Reputation: 7 734

How does this affect larger text files? – PeanutsMonkey – 2012-08-29T07:05:21.770

1

Compression removes redundant information, which appears when the data is highly structured.

From this it should be apparent that already-compressed files cannot compress further, because the redundancy is already gone, but also that random data won't compress well, because it never had any structure or redundancy.

There a whole science, information theory, which deals with measuring the density of information (and mutual information) and uses redundancy and structure to perform compression, attacks on encryption, and error detection and recovery.

Ben Voigt

Posted 2012-08-29T06:39:59.937

Reputation: 6 052