Why is Zip able to compress single file smaller than multiple files with the same content?

128

Suppose that I have 10,000 XML files. Now suppose that I want to send them to a friend. Before sending them, I would like to compress them.

Method 1: Don't compress them

Results:

Resulting Size: 62 MB
Percent of initial size: 100%

Method 2: Zip every file and send him 10,000 xml files

Command:

for x in $(ls -1) ;  do   echo $x ; zip "$x.zip" $x ; done

Results:

Resulting Size: 13 MB
Percent of initial size: 20%

Method 3: Create a single zip containing 10,000 xml files

Command:

zip all.zip $(ls -1)

Results:

Resulting Size: 12 MB
Percent of initial size: 19%

Method 4: Concatenate the files into a single file & zip it

Command:

cat *.xml > oneFile.txt ; zip oneFile.zip oneFile.txt

Results:

Resulting Size: 2 MB
Percent of initial size: 3%

Questions:

Why do I get such dramatically better results when I am just zipping a single file?
I was expecting to get drastically better results using method 3 than method 2, but don't. Why?
Is this behaviour specific to zip? If I tried using gzip would I get different results?

Additional info:

$ zip --version
Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.
This is Zip 3.0 (July 5th 2008), by Info-ZIP.
Currently maintained by E. Gordon.  Please send bug reports to
the authors using the web page at www.info-zip.org; see README for details.

Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip,
as of above date; see http://www.info-zip.org/ for other sites.

Compiled with gcc 4.4.4 20100525 (Red Hat 4.4.4-5) for Unix (Linux ELF) on Nov 11 2010.

Zip special compilation options:
    USE_EF_UT_TIME       (store Universal Time)
    SYMLINK_SUPPORT      (symbolic links supported)
    LARGE_FILE_SUPPORT   (can read and write large files on file system)
    ZIP64_SUPPORT        (use Zip64 to store large files in archives)
    UNICODE_SUPPORT      (store and read UTF-8 Unicode paths)
    STORE_UNIX_UIDs_GIDs (store UID/GID sizes/values using new extra field)
    UIDGID_NOT_16BIT     (old Unix 16-bit UID/GID extra field not used)
    [encryption, version 2.91 of 05 Jan 2007] (modified for Zip 3)

Edit: Meta data

One answer suggests that the difference is the system meta data that is stored in the zip. I don't think that this can be the case. To test, I did the following:

for x in $(seq 10000) ; do touch $x ; done
zip allZip $(ls -1)

The resulting zip is 1.4MB. This means that there is still ~10 MB of unexplained space.

sixtyfootersdude

Posted 2015-12-14T17:30:26.230

Reputation: 6 399

I'm not familiar with the internals of how the zip program works, but my initial guess would be that method 2 & 3 are essentially doing the same thing, except that zip combines the individual zipped files into a single archive at the end, which would explain why 3 & 4 are so different as well. – heavyd – 2015-12-14T17:38:52.033

34If I'm not mistaken, it's this phenomona that causes people to make .tar.gz as opposed to just zipping the whole directory. – corsiKa – 2015-12-14T20:21:16.737

@corsiKlauseHoHoHo - I bet you are right. Then you are just zipping a single file. Which probably has the same effect... Very interesting – sixtyfootersdude – 2015-12-14T20:24:29.070

A similar question was already asked, tl;dr use solid 7zip archives.

– Dmitry Grigoryev – 2015-12-14T20:37:59.790

3@sixtyfootersdude As a test to validate some of the answers, can you try zipping the zip produced in method 3? I suspect this will reduce the file size to something comparable to method 4. – Travis – 2015-12-14T21:52:00.823

7Instead of $(ls -1), just use *: for x in *; zip all.zip * – muru – 2015-12-15T01:21:56.327

1@Travis: the compressed representation of two fairly similar xml files might not be very similar to each other, esp. if the difference was near the beginning. If you're lucky, you might get down to a size similar to 4, but it could easily be a lot worse. – Peter Cordes – 2015-12-15T04:33:31.123

Interesting extra test would be to .zip the file from method 3 again, i.e. using two zips inside each other. – jpa – 2015-12-15T06:23:59.793

You're using Linux, why not use a tar.[any] format and take advantage of a "solid" archive? tar.xz can use the same format as .7z - or just use .7z, if the reason is "my friend doesn't use Linux or have any good archive programs installed" and you don't want to have to decompress the whole archive just to list the files. PS. Wikipedia mentions for zip "Each file is stored separately... it is possible to extract them, or add new ones, without applying compression or decompression to the entire archive. This contrasts with...compressed tar files [where] random-access is not easily possible." – Xen2050 – 2015-12-15T06:36:32.003

The answer is "Solid compression".

– JimmyB – 2015-12-15T17:48:30.393

4If you want to do solid compression with ZIP, here's a workaround: first, create an uncompressed ZIP containing all your files. Then, put that ZIP inside another compressed ZIP. – user253751 – 2015-12-15T23:17:18.297

zip is very old and far worse than rar or 7z – phuclv – 2015-12-17T03:10:29.510

1Out of curiosity, using Method 4, how did you plan for your friend to undo the cat step and end up with 10k files again? – kmort – 2015-12-17T16:41:20.433

@kmort: As the source files are Xml documents, it is at least still unambiguously possible to extract the single Xml documents, thanks to the fact that no well-formed Xml document can contain more than one root element. Given the right tools, that is, in particular an Xml reader that does not refuse to read several consecutive Xml documents from the same stream. – O. R. Mapper – 2015-12-17T19:13:52.143

@O.R.Mapper You're right, it is certainly possible, I was just wondering if there was an easy canonical way to "uncat" something via a shell command or two. Plus how to end up with the right file names, etc. :-) – kmort – 2015-12-17T19:38:41.040

@corsiKlauseHoHoHo: No, it's not because of smartness; it's because of dumbness -- gzip can't actually zip a directory.

– user541686 – 2015-12-18T08:47:29.357

Regarding your question about if gzip would give different results: gzip has no notion of method 3. – jamesdlin – 2015-12-20T02:28:02.057

Answers

129

Zip treats the contents of each file separately when compressing. Each file will have its own compressed stream. There is support within the compression algorithm (typically DEFLATE) to identify repeated sections. However, there is no support in Zip to find redundancy between files.

That's why there is so much extra space when the content is in multiple files: it's putting the same compressed stream in the file multiple times.

Alan Shutko

Posted 2015-12-14T17:30:26.230

Reputation: 3 698

9It's also why some compression tools give you the option of compressing the files separately or as a single entity. (Though generally that also means you have to decompress more of the archive than you would otherwise if you want to view just a single file in it.) – JAB – 2015-12-14T19:35:46.580

28@JAB: Compression tools like 7z and rar use the term "solid" archive for packing multiple files head to tail into larger compression streams. With a moderate chunk size like 64MiB, random access to a single file might require decompressing up to 64MiB of data from the start of the compression block that it's in. You can get a decent tradeoff between random-access and finding cross-file redundancy. 7z can use the more effective (but slower to compress) LZMA compression scheme, which is another advantage over zip. – Peter Cordes – 2015-12-15T04:29:32.243

Are you saying that there is no support in Zip to find redundancy between files is in the zip file specification? – sixtyfootersdude – 2015-12-16T15:12:29.020

6@sixtyfootersdude Many compression algorithms, such as DEFLATE, operate as a stream. To recover enough information to decompress a part of the stream, you need to process the entire stream up to that point. If they tried to find redundency between files, you'd have to decompress all 1000 files in order to get to the last one. This is typically how tgz works, actually. However, zip was designed to let you extract individual files. tgz is designed to be more all-or-nothing – Cort Ammon – 2015-12-16T21:46:18.837

1@sixtyfootersdude - that's correct. To paraphrase Cort: The pkzip specs don't support working cross-file. If they did then extracting one file might require the whole archive (and every file) to be extracted. – James Snell – 2015-12-18T11:40:16.687

ZIP compression is based on repetitive patterns in the data to be compressed, and the compression gets better the longer the file is, as more and longer patterns can be found and used.

Simplified, if you compress one file, the dictionary that maps (short) codes to (longer) patterns is necessarily contained in each resulting zip file; if you zip one long file, the dictionary is 'reused' and grows even more effective across all content.

If your files are even a bit similar (as text always is), re-use of the 'dictionary' becomes very efficient, and the result is a much smaller total zip.

Aganju

Posted 2015-12-14T17:30:26.230

Reputation: 9 103

3ZIP does both archiving and compressing. Does this mean that ZIP compresses each file individually, even if they all end up in the same ZIP-file? – gerrit – 2015-12-14T20:00:33.070

2it kind of has to - imagine you remove a single file, you don't wouldn't want it to spend another half hour re-compressing the rest with a new 'dictionary'. - also, it probably assumes that different files need very different 'dictionaries'. – Aganju – 2015-12-14T20:02:36.633

2I don't see why it has to. With Unix tools, I would first archive a file with tar, then compress it with gzip/bz2/lzma. The compression algorithm doesn't care how many files are encoded in the archive. Also, how common is it really to remove a single file from a compressed archive? I don't think I've ever done that. – gerrit – 2015-12-14T20:07:48.370

4I don't disagree, and that is probably a good way. I didn't design or write ZIP. I just said what it does... – Aganju – 2015-12-14T20:10:23.760

@gerrit It doesn't have to. Both ways are valid. Both will work better for some test cases but not others. In a pathological set of cases a format that tries both methods and keeps whichever was smaller and a flag to indicate which was used will end up larger (by the size of the flag) than a format that always used one or the other. (In most cases try both, pick the best will result in a somewhat smaller file. However it will also take about 2x as long to initially compress. Tradeoffs everywhere.) – Dan is Fiddling by Firelight – 2015-12-14T20:39:57.033

@gerrit Tarring and zipping is similar to Method 4 in the question. It is more complicated because tar keeps file meta info, but tar essentially concatenates files. – bob0the0mighty – 2015-12-14T21:05:00.760

@WillP. Of course it doesn't have to, .rar and .7z don't – BlueRaja - Danny Pflughoeft – 2015-12-14T21:27:29.553

@bob0the0mighty I am aware, I'm just surprised that zip-ing multilpe files doesn't natively do the same. – gerrit – 2015-12-14T22:49:40.237

16@gerrit It has its own problems. Zip is designed to allow you to quickly access any file in the archive - try unpacking a single file from a 100 GiB UHA archive and you'll see why they chose this way. It's also designed for appending - you can have your backup zip and just keep adding (or replacing) files as needed. All of this is a huge help when using archives. The trade-off is that if you're compressing files that are very similar (which is not all that common), it can't exploit the similarities to reduce archive size. – Luaan – 2015-12-15T08:55:57.460

2@gerrit It may not be common to remove single files from an archive, but I've had situations where I needed to update individual files in one. – JAB – 2015-12-15T14:04:12.040

In Zip each file is compressed separately. The opposite is 'solid compression', that is files are compressed together. 7-zip and Rar use solid compression by default. Gzip and Bzip2 can't compress multiple files so Tar is used first, having the same effect as solid compression.

As the xml file have similar structure and probably similar content if the files are compressed together the compression will be higher.

For example if a file contains the string "<content><element name=" and the compressor already found that string in another file it will replace it with a small pointer to the previous match, if the compressor doesn't use 'solid compression' the first ocurrence of the string in the file will be recorded as a literal which is larger.

ggf31416

Posted 2015-12-14T17:30:26.230

Reputation: 531

Zip doesn't just store the contents of the file, it also stores file metadata like the owning user id, permissions, creation and modification times and so on. If you have one file you have one set of metadata; if you have 10,000 files you have 10,000 sets of metadata.

Mike Scott

Posted 2015-12-14T17:30:26.230

Reputation: 4 220

3Good point, but the system meta data is just taking up 1.4MB of space. See my edit. – sixtyfootersdude – 2015-12-14T18:02:19.667

1I'm not familiar with the zip algorithm, but the metadata isn't just the file information, but also things like size and a dictionary, possibly some information on distribution of characters. A dictionary on a non-empty text file will be non-zero. That's probably why you see the metadata being larger in your xml files than your empty files. – Ben Richards – 2015-12-14T18:54:37.087

This was my first thought. Zip-File Header Information

– WernerCD – 2015-12-14T19:40:46.167

This only explains the difference between 2 and 3 - not 4. – Luaan – 2015-12-15T08:56:47.900

@Luaan No, in both 2 and 3 the metadata for all 10,000 files is included in the zip file or files, so the total file size is almost the same size. In 4, there's only metadata for one file, and the zip file is much smaller. – Mike Scott – 2015-12-15T12:01:52.260

Yes, but the difference is less than between 2 and 3 - the compression tables etc. take much more space than the file metadata. So the upper limit on the savings between 2 and 4 due to metadata is still around a MiB or so - definitely not 10 MiB. – Luaan – 2015-12-15T12:45:41.727

Try building a zip file of 10000 one-character files for comparison. – Paŭlo Ebermann – 2015-12-22T16:41:39.770

An option missed by the OP is to zip all of the files together with compression turned off, then zip the resulting zip with compression set to maximum. This roughly emulates the behavior of *nix .tar.Z, .tar.gz, .tar.bz, etc. compressed archives, by allowing the compression to exploit redundancies across file boundaries (which the ZIP algorithm cannot do when run in a single pass). This allows the individual XML files to be extracted later, but maximizes the compression. The downside is that the extraction process requires the extra step, temporarily using much more disk space than would be needed for a normal .zip.

With the ubiquity of free tools like 7-Zip to extend the tar family to Windows, there's really no reason not to use a .tar.gz or .tar.bz, etc., as Linux, OS X, and the BSDs all have native tools to manipulate them.

Monty Harder

Posted 2015-12-14T17:30:26.230

Reputation: 179

gzip and bzip2 might end up even worse because they are designed with compressing streams in mind, so they will have to start outputting compressed data before all of the data to compress is even known. – rackandboneman – 2015-12-16T12:31:31.857

@rackandboneman: This is the tradeoff you have to make when compressing files larger than the amount of memory you're willing to use at compression time. (And also, the amount of CPU time required to find anything globally optimal would be huge.) A huge compression dictionary can also increase the memory required for decompression. This is an option for LZMA (xz / 7-zip). Anyway, adaptive dictionaries can pick up on patterns once they're visible. It's not like it just builds a static coding system based on the first 32k. This is why gzip doesn't suck. – Peter Cordes – 2015-12-21T11:13:55.910

I really like this "trick" if you need to stay with the zip format. I disagree with your "no reason not to use 7-zip"—if I'm sending a file to a non-technical friend, I want to be sure they will be able to open it easily. If I'm sending to a business-client, even more-so. – Wowfunhappy – 2019-03-14T20:45:29.440

The zip compression format stores and compresses each file separately. It doesn't take advantage of repetition between files, only within a file.

Concatenating the file allows zip to take advantage of repetitions across all of the files, resulting in drastically more compression.

For example, say each XML file has a certain header. That header only occurs once in each file but is repeated almost identically in many other files. In methods 2 and 3 zip couldn't compress for this but in method 4 it could.

BonsaiOak

Posted 2015-12-14T17:30:26.230

Reputation: 158

3How is this different from one of the top 3 answers already posted 5hrs earlier? – Xen2050 – 2015-12-15T05:35:45.553

1@Xen2050 Not much difference, I just thought I could explain it more clearly. – BonsaiOak – 2015-12-15T23:35:24.260

1@BonsaiOak - then add a comment to the correct answer or edit if you have enough rep. If not, but your comment adds clarity, someone else might pick this up and edit the post anyway. – AdamV – 2015-12-17T13:09:38.240

@AdamV I see your point. My answer doesn't currently add any useful information, although it arguably did when I wrote it. There are already appropriate comments under the first answer so I don't see the point in adding them, either. Are you saying that I should just close my answer? What harm is there in leaving it open? – BonsaiOak – 2015-12-18T22:10:53.960

Next to the metadata Mike Scott mentioned there is also overhead in the compression algorithm.

When compressing a bunch of individual small files you’ll have to be very lucky to be able to compress them that it just happens to fill one compression block. When compressing a single monolithic block the system can just continue to stream data to its algorithm, ignoring the ‘boundaries’ (for lack of better word) of the individual files.

Also ASCII is known to have a high compression factor. plus xml is often very repetitive making the metadata a large chunk of the data that can’t be so easily compressed as the xml content.

Lastly, if memory serves right, zip uses something like dictionary encoding , which is especially effective on ascii files and even more so on XML due to their repetitivity

Data Compression Explained : http://mattmahoney.net/dc/dce.html

GapWim

Posted 2015-12-14T17:30:26.230

Reputation: 158

Consider this XML:

<root>
  <element id="1" />
  <element id="2" /> 
  <other id="3" />
  ...
</root>

An XML has a very repetitive structure, Zip takes advantage of those repetitions to build a dictionary of which pattern has more occurrences and then, when compressing, uses less bits to store more repeated patterns and more bits to store less repeated pattern.

When you concatenate those files, the source file (the source for zip) is big but contains much more repeated patterns because de distribution of the boring structures of an XML are amortized in the big whole file, giving the chance to ZIP to store those pattern using less bits.

Now, If you combine different XML into a single file, even when those files have completely different tag names, the compression algorithm will found the best pattern distribution across all files and not file by file.

Ultimately the compression algorithm has found the best repeated pattern distribution.

rnrneverdies

Posted 2015-12-14T17:30:26.230

Reputation: 131

-1

In addition to the 7-Zip answer there's another approach that's not as good but would be worth testing if for some reason you don't want to use 7-Zip:

Compress the zip file. Now, normally a zip file is incompressible but when it contains a lot of identical files the compressor can find this redundancy and compress it. Note that I have also seen a small gain when dealing with large numbers of files without redundancy. If you really care about size it's worth trying if you have an awful lot of files in your zip.

Loren Pechtel

Posted 2015-12-14T17:30:26.230

Reputation: 2 234

That only works if you do the first zip with compression turned off as I mention above. – Monty Harder – 2015-12-18T21:38:43.580

@MontyHarder I've seen it work with the compression turned on. – Loren Pechtel – 2015-12-18T21:50:08.843