128
24
Suppose that I have 10,000 XML files. Now suppose that I want to send them to a friend. Before sending them, I would like to compress them.
Method 1: Don't compress them
Results:
Resulting Size: 62 MB
Percent of initial size: 100%
Method 2: Zip every file and send him 10,000 xml files
Command:
for x in $(ls -1) ; do echo $x ; zip "$x.zip" $x ; done
Results:
Resulting Size: 13 MB
Percent of initial size: 20%
Method 3: Create a single zip containing 10,000 xml files
Command:
zip all.zip $(ls -1)
Results:
Resulting Size: 12 MB
Percent of initial size: 19%
Method 4: Concatenate the files into a single file & zip it
Command:
cat *.xml > oneFile.txt ; zip oneFile.zip oneFile.txt
Results:
Resulting Size: 2 MB
Percent of initial size: 3%
Questions:
- Why do I get such dramatically better results when I am just zipping a single file?
- I was expecting to get drastically better results using method 3 than method 2, but don't. Why?
- Is this behaviour specific to
zip
? If I tried usinggzip
would I get different results?
Additional info:
$ zip --version
Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.
This is Zip 3.0 (July 5th 2008), by Info-ZIP.
Currently maintained by E. Gordon. Please send bug reports to
the authors using the web page at www.info-zip.org; see README for details.
Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip,
as of above date; see http://www.info-zip.org/ for other sites.
Compiled with gcc 4.4.4 20100525 (Red Hat 4.4.4-5) for Unix (Linux ELF) on Nov 11 2010.
Zip special compilation options:
USE_EF_UT_TIME (store Universal Time)
SYMLINK_SUPPORT (symbolic links supported)
LARGE_FILE_SUPPORT (can read and write large files on file system)
ZIP64_SUPPORT (use Zip64 to store large files in archives)
UNICODE_SUPPORT (store and read UTF-8 Unicode paths)
STORE_UNIX_UIDs_GIDs (store UID/GID sizes/values using new extra field)
UIDGID_NOT_16BIT (old Unix 16-bit UID/GID extra field not used)
[encryption, version 2.91 of 05 Jan 2007] (modified for Zip 3)
Edit: Meta data
One answer suggests that the difference is the system meta data that is stored in the zip. I don't think that this can be the case. To test, I did the following:
for x in $(seq 10000) ; do touch $x ; done
zip allZip $(ls -1)
The resulting zip is 1.4MB. This means that there is still ~10 MB of unexplained space.
I'm not familiar with the internals of how the zip program works, but my initial guess would be that method 2 & 3 are essentially doing the same thing, except that zip combines the individual zipped files into a single archive at the end, which would explain why 3 & 4 are so different as well. – heavyd – 2015-12-14T17:38:52.033
34If I'm not mistaken, it's this phenomona that causes people to make
.tar.gz
as opposed to just zipping the whole directory. – corsiKa – 2015-12-14T20:21:16.737@corsiKlauseHoHoHo - I bet you are right. Then you are just zipping a single file. Which probably has the same effect... Very interesting – sixtyfootersdude – 2015-12-14T20:24:29.070
18
A similar question was already asked, tl;dr use solid 7zip archives.
– Dmitry Grigoryev – 2015-12-14T20:37:59.7903@sixtyfootersdude As a test to validate some of the answers, can you try zipping the zip produced in method 3? I suspect this will reduce the file size to something comparable to method 4. – Travis – 2015-12-14T21:52:00.823
7Instead of
$(ls -1)
, just use*
:for x in *
;zip all.zip *
– muru – 2015-12-15T01:21:56.3271@Travis: the compressed representation of two fairly similar xml files might not be very similar to each other, esp. if the difference was near the beginning. If you're lucky, you might get down to a size similar to 4, but it could easily be a lot worse. – Peter Cordes – 2015-12-15T04:33:31.123
Interesting extra test would be to .zip the file from method 3 again, i.e. using two zips inside each other. – jpa – 2015-12-15T06:23:59.793
You're using Linux, why not use a tar.[any] format and take advantage of a "solid" archive? tar.xz can use the same format as .7z - or just use .7z, if the reason is "my friend doesn't use Linux or have any good archive programs installed" and you don't want to have to decompress the whole archive just to list the files. PS. Wikipedia mentions for zip "Each file is stored separately... it is possible to extract them, or add new ones, without applying compression or decompression to the entire archive. This contrasts with...compressed tar files [where] random-access is not easily possible." – Xen2050 – 2015-12-15T06:36:32.003
The answer is "Solid compression".
– JimmyB – 2015-12-15T17:48:30.3934If you want to do solid compression with ZIP, here's a workaround: first, create an uncompressed ZIP containing all your files. Then, put that ZIP inside another compressed ZIP. – user253751 – 2015-12-15T23:17:18.297
zip is very old and far worse than rar or 7z – phuclv – 2015-12-17T03:10:29.510
1Out of curiosity, using Method 4, how did you plan for your friend to undo the
cat
step and end up with 10k files again? – kmort – 2015-12-17T16:41:20.433@kmort: As the source files are Xml documents, it is at least still unambiguously possible to extract the single Xml documents, thanks to the fact that no well-formed Xml document can contain more than one root element. Given the right tools, that is, in particular an Xml reader that does not refuse to read several consecutive Xml documents from the same stream. – O. R. Mapper – 2015-12-17T19:13:52.143
@O.R.Mapper You're right, it is certainly possible, I was just wondering if there was an easy canonical way to "uncat" something via a shell command or two. Plus how to end up with the right file names, etc. :-) – kmort – 2015-12-17T19:38:41.040
2
@corsiKlauseHoHoHo: No, it's not because of smartness; it's because of dumbness -- gzip can't actually zip a directory.
– user541686 – 2015-12-18T08:47:29.357Regarding your question about if
gzip
would give different results: gzip has no notion of method 3. – jamesdlin – 2015-12-20T02:28:02.057