There are at least four separate jobs that are often confused together because popular tools integrates them:
- Archiving: the ability to combine multiple files (including metadata) into a single file, preserving as much things as possible. In Linux/Unix world, archiving is traditionally done in TAR file format.
- Compression: the ability to losslessly minimize the size of a stream of binary data. In Linux/Unix world, this is traditionally done by GZip and BZip2.
- Encryption: the ability to scramble data with keys
- Checksum: the ability to detect (and possibly correct) errors.
The ubiquity of .tar.gz and .tar.bz corresponds to Unix philosophy of small tools doing a single job well, over a single tool that does everything. The TAR file format does not support compression or encryption, but it can be compressed further by any compressor (including as .tar.zip or .tar.7z). The job of GZip and BZip2 is simply to compress a file stream to another filestream, the compression layer does not need to care how to preserve metadata or encryption or checksum. Over time though, several shortcuts have been made in tar
program to work with a compressor more conveniently.
In zip and 7z file format, these separate jobs are done by a single program in a single super file format.
Why does the trend above appear to hold, even though all of these are portable formats? Are there any particular advantages to using a particular archive format on a particular platform?
Because it has been the way it's done, program source codes are traditionally distributed as .tar.gz or .tar.bz2, because preserving file permissions, modification time, etc are important for various tools used for programming (e.g. make).
The separate archival and compression step has worked for years very well, it has a clear advantage of being able to freely mix and match archival and compression, and its disadvantage (a 2-step compression process) can be easily circumvented by developing smarter tools (most modern linux compression program will directly compress to .tar.gz or .tar.bz2, hiding the intermediate step).
There is no strong reason to move to other file formats, newer compressors does not have a significantly better compression rate to justify breaking the tradition and tar can preserve everything well enough.
2http://superuser.com/questions/205223/pros-and-cons-of-bzip-vs-gzip – Sathyajith Bhat – 2011-07-03T09:25:31.050
1
See also: http://stackoverflow.com/questions/2397474/i-need-to-choose-a-compression-algorithm/2397746#2397746
– Thomas Bonini – 2011-07-03T11:01:36.453@Sathya, @Andreas: Thanks for the links, those are helpful and answer parts of my question. :) – user541686 – 2011-07-03T14:47:32.913
1Compression is a pretty complex field, and no one algorithm can produce optimal results for everything - furthermore, it's a problem you can throw resources at and get better results, but also one that can be done almost as well in much less time. Some algorithms focus on being fast and memory light, some focus on producing the smallest possible file regardless of how long it takes or whether you need 12GB RAM (not exaggerating) to do it, so on. – Phoshi – 2011-07-03T16:15:18.703
1@Phoshi, this should be an answer. – Yitzchak – 2011-07-03T16:20:34.417
@Yitz; I think @Ruairi's answer covers the specifics pretty well, and it doesn't really answer the question - just answered why the question could be asked at all. – Phoshi – 2011-07-03T16:37:11.587
two notes / gotchas on linux systems: remember that by .tar doesn't really have compression, it just sticks all files into one - which is why you usually see .tar.gz types of files. Also, gzip and gunzip behave differently than zip; zip will leave the originalfiles behind after (de)compressing, where as gzip will sort of "convert" them. in a folder with only test.txt, "gzip test.txt" results in one file "test.txt.gz", and gunzip "test.txt.gz" also leaves the folder with just one file, test.txt. – cwd – 2011-07-03T16:49:28.737
I think you may have slightly confused tools and formats. Many tools can read/create archive files in several different formats, and some formats (including ZIP) encompass several different substandards (based primarily on which compression algorithm is used). 7-ZIP is fairly popular because it can at least read several different formats, including most if not all of the ZIP formats. It can also be used to examine a compressed file without (permanently) unzipping it, a very convenient feature when you're trying to figure this all out. Probably ZIP is the most widely-recognized format. – Daniel R Hicks – 2011-07-03T18:21:38.057
@phoshi, you're right. – Yitzchak – 2011-07-03T22:46:50.000