How can I evaluate the best choice of archive format for compressing files?

24

1

In general, I've observed the following:

  • Linux-y files or tools use bzip2 or gzip for distributing archives
  • Windows-y files or tools use ZIP for distributing archives
  • Many people use 7-Zip for creating and distributing their own archives

Questions:

  • What are the advantages and disadvantages of these formats, all of which appear to be open formats? When/why should I choose one (say, 7-Zip) over another (say, ZIP)?
  • Why does the trend above appear to hold, even though all of these are portable formats? Are there any particular advantages to using a particular archive format on a particular platform?

user541686

Posted 2011-07-03T09:06:59.207

Reputation: 21 330

2http://superuser.com/questions/205223/pros-and-cons-of-bzip-vs-gzip – Sathyajith Bhat – 2011-07-03T09:25:31.050

@Sathya, @Andreas: Thanks for the links, those are helpful and answer parts of my question. :) – user541686 – 2011-07-03T14:47:32.913

1Compression is a pretty complex field, and no one algorithm can produce optimal results for everything - furthermore, it's a problem you can throw resources at and get better results, but also one that can be done almost as well in much less time. Some algorithms focus on being fast and memory light, some focus on producing the smallest possible file regardless of how long it takes or whether you need 12GB RAM (not exaggerating) to do it, so on. – Phoshi – 2011-07-03T16:15:18.703

1@Phoshi, this should be an answer. – Yitzchak – 2011-07-03T16:20:34.417

@Yitz; I think @Ruairi's answer covers the specifics pretty well, and it doesn't really answer the question - just answered why the question could be asked at all. – Phoshi – 2011-07-03T16:37:11.587

two notes / gotchas on linux systems: remember that by .tar doesn't really have compression, it just sticks all files into one - which is why you usually see .tar.gz types of files. Also, gzip and gunzip behave differently than zip; zip will leave the originalfiles behind after (de)compressing, where as gzip will sort of "convert" them. in a folder with only test.txt, "gzip test.txt" results in one file "test.txt.gz", and gunzip "test.txt.gz" also leaves the folder with just one file, test.txt. – cwd – 2011-07-03T16:49:28.737

I think you may have slightly confused tools and formats. Many tools can read/create archive files in several different formats, and some formats (including ZIP) encompass several different substandards (based primarily on which compression algorithm is used). 7-ZIP is fairly popular because it can at least read several different formats, including most if not all of the ZIP formats. It can also be used to examine a compressed file without (permanently) unzipping it, a very convenient feature when you're trying to figure this all out. Probably ZIP is the most widely-recognized format. – Daniel R Hicks – 2011-07-03T18:21:38.057

@phoshi, you're right. – Yitzchak – 2011-07-03T22:46:50.000

Answers

16

There are a large variety of compression formats and methods available, some don't compress at all and are designed to store a number of files in one archive, and other newer experimental compressors (PAQ based) are designed to compress as aggressively as possible, regardless of the time it takes to perform said operation.

You need to evaluate the features you require from your compression method choice, and also consider the context in which it will be used.

Different features and considerations include:

  • Compression ability - Does it shrink the file significantly enough?
  • Ease-of-use - If the file is going to another user, will the archive be easy to extract or will it require more software to be installed?
  • Password protection and/or encryption - Are these security measures required?
  • Multiple volumes support - If the target medium requires the file to be split into appropriate chunks, does the format support this elegantly. For example, 650 MB for a CD.
  • Repairing and recovery - If the file becomes partially corrupt, does it offer a recovery record to aid restoration of data?
  • Unicode support - Does the archiver support international file names or just standard ASCII?
  • System Requirements - Modern compressors such as 7-Zip do offer the ability to increase compression efficiency by using a larger dictionary (a dictionary is a reference of commonly repeated data in a compressed file), but this in turn increases memory consumption at both compression and decompression time.
  • Self-extraction support - Can the archive be rolled into an executable file that provides ease of use to whomever needs to use it? (Also bear in mind you can only create a self-extractor for a single platform. Generally speaking a Windows self-extractor will not work on Linux by default, unless run through a compatibility layer like Wine).
  • File system attributes - Does the compressor store relevant file system metadata and permissions that may be worth preserving at point of extraction?

Generally speaking ZIP is the most ubiquitous format, but sizes over 4 GB aren't generally supported (if at all), security support is generally regarded as poor (the standard password can be compromised with a plain-text attack, and further encryption is generally implemented as an unofficial derivative of the format by commercial ZIP software vendors).

Apart from that, most other popular formats will have some form of support on all operating systems by installing more software.

My personal choice is 7-Zip, as it has great and flexible compression; despite it having a peculiar user interface on Windows. There are de-compressors for Linux and Mac OS X (although not GUI based as standard).

Ruairi Fullam

Posted 2011-07-03T09:06:59.207

Reputation: 2 284

Zip is the most future proof solution and is advised by the UK's National Archive because it is non-solid and very stable compared to gzip, tar or 7-zip.

– gaborous – 2016-12-27T22:43:06.347

3If the archive is meant for distribution, it's also important to consider your target audience and use a format that's supported by default on their platform. Accessibility may be more important than the other considerations in this case. – hammar – 2011-07-03T14:14:02.553

+1 thanks for the information, though it would've been even better to mention which formats support those bullet points. :) – user541686 – 2011-07-04T03:02:50.607

I was tempted but there are a multitude of formats available, which would take a long time to list. Wikipedia does have a good feature matrix of compression formats which may help: http://en.wikipedia.org/wiki/Comparison_of_archive_formats

– Ruairi Fullam – 2011-07-04T07:23:20.603

1History teaches an important lesson when it comes to self-extracting archive files. There are self-extracting archives from two decades ago that people can no longer self-extract because their machines cannot run MS/PC-DOS programs, or because the self-extractor programs crash as the result of processor changes, or because the self-extractors complain that discs are full when they aren't since they don't expect discs to be so large, or … – JdeBP – 2011-07-04T11:44:27.100

That point is certainly debatable, I've not encountered that particular problem but I can see it occurring; I suppose it's all a question of the end goal of creating the archive and the expected longevity of the files use. Certainly if you have an old archive that's difficult to extract from the DOS era, you could use DOSBox, or even create a VM if needed. – Ruairi Fullam – 2011-07-04T12:02:54.970

8

One things that comes to mind is a (two year old) blog post from Jeff Atwood: File Compression in the Multi-Core Era. In that article he finds that bzip2 outperforms 7-zip when running more than two cores.

matpe

Posted 2011-07-03T09:06:59.207

Reputation: 81

+1 omg! I didn't know that. The compression ratio seems to not be worth it, though. :) – user541686 – 2011-07-04T03:03:30.060

2That post is more than 2 years old. Doesn't 7-zip work better with more than two cores now? – cregox – 2011-07-04T06:31:37.707

BZIP2 compresses more efficiently over multiple cores because it compresses into 100-900KB blocks, thus can spread blocks over separate cores, but the compression efficiency is lost as these blocks are considered to be distinct from each other. – Ruairi Fullam – 2011-07-04T07:21:18.873

4

To you first question, 7-Zip is an archiver than can use many algorithms to compress and decompress data.

To your second question, just make sure that the platform supports tools that support the given format. For example, I would avoid using RAR on a Mac. While it is possible to use, and there are free utilities that support it, they lack the much richer interface that Windows utilities that support RAR have (in my experience).

soandos

Posted 2011-07-03T09:06:59.207

Reputation: 22 744

Whereas I personally hate the graphical rar programs and always use the command line, even on Windows. – CarlF – 2011-07-03T18:54:29.777

4

As others have mentioned, the choice of a particular compression format is heavily dependent on the use and the intended audience.

  • .tar.gz and tar.bz2 archives are ideal for use on Linux systems (and by extension for sharing files with Linux users) because the tar, gzip and bzip2 tools are largely ubiquitous on the platform, and because the .tar format has full support for Unix permissions and other platform-specific properties. The choice between gzip and bzip2 to compress the tar archive is mainly a decision about speed versus compression ratio, with bzip2 delivering smaller files but with a much slower compression speed. The disadvantages of these formats include less compatibility with Windows and the (potential) need to uncompress the entire archive to extract a single file.

  • ZIP archives can be extracted on most platforms using native tools, so it is an ideal choice for sending an archive to a non-technical user who would be uncomfortable with installing third-party archive software such as 7-Zip. The compression level isn't as good as more advanced algorithms and it doesn't support Unix permissions, but it is an excellent format if you wanted to send an archive of holiday photos to your grandmother, for example. ZIP also provides some basic password protection, and can quickly extract a file from anywhere in the archive.

  • 7-Zip is good if you want the best possible compression ratios. Like ZIP, it doesn't support Unix file permissions or ownership, and is also not installed by default on most platforms which makes it slightly more work to use, but it may be worth it on Windows if the compression ratio gains are important. In an all-Linux environment it would be better to use the 'xz' or 'lzma' compression tools along with tar, which operate in exactly the same way as 'gzip' and 'bzip2' but use the more advanced LZMA algorithm like 7-Zip.

user89061

Posted 2011-07-03T09:06:59.207

Reputation:

2

Just as an example, I use the mentioned formats in these cases:

  • Text files (logs especially): bz2
  • Collection of files to be distributed (e.g. source code): gz (tar.gz really).
  • Assorted files: 7zip. I can compress almost anything in a very efficient way. Cross-platform, open-source, stable, lightweight, file (header and data) encryption,... Can you ask for anything else? :)

I avoid RAR altogether and whenever I receive RAR file from someone I know, I tell him/her to stop using that format since it is propietary, and that probably he/she is using unlicensed software (most people download WinRAR's trial version and keep using it forever).

PS: I run Ubuntu (primarily) and Windows (both dual boot and VirtualBox).

glarrain

Posted 2011-07-03T09:06:59.207

Reputation: 206

1

There are at least four separate jobs that are often confused together because popular tools integrates them:

  1. Archiving: the ability to combine multiple files (including metadata) into a single file, preserving as much things as possible. In Linux/Unix world, archiving is traditionally done in TAR file format.
  2. Compression: the ability to losslessly minimize the size of a stream of binary data. In Linux/Unix world, this is traditionally done by GZip and BZip2.
  3. Encryption: the ability to scramble data with keys
  4. Checksum: the ability to detect (and possibly correct) errors.

The ubiquity of .tar.gz and .tar.bz corresponds to Unix philosophy of small tools doing a single job well, over a single tool that does everything. The TAR file format does not support compression or encryption, but it can be compressed further by any compressor (including as .tar.zip or .tar.7z). The job of GZip and BZip2 is simply to compress a file stream to another filestream, the compression layer does not need to care how to preserve metadata or encryption or checksum. Over time though, several shortcuts have been made in tar program to work with a compressor more conveniently.

In zip and 7z file format, these separate jobs are done by a single program in a single super file format.

Why does the trend above appear to hold, even though all of these are portable formats? Are there any particular advantages to using a particular archive format on a particular platform?

Because it has been the way it's done, program source codes are traditionally distributed as .tar.gz or .tar.bz2, because preserving file permissions, modification time, etc are important for various tools used for programming (e.g. make).

The separate archival and compression step has worked for years very well, it has a clear advantage of being able to freely mix and match archival and compression, and its disadvantage (a 2-step compression process) can be easily circumvented by developing smarter tools (most modern linux compression program will directly compress to .tar.gz or .tar.bz2, hiding the intermediate step).

There is no strong reason to move to other file formats, newer compressors does not have a significantly better compression rate to justify breaking the tradition and tar can preserve everything well enough.

Lie Ryan

Posted 2011-07-03T09:06:59.207

Reputation: 4 101