8

Zip, Rar, 7z, Gzip, BZip2, Tar etc. I'm hearing 7z is the flavor of the month, why? Is it best for all situations or are there better choices for specific situations.

Or maybe the actual file archiver ie WinZip, WinRar, 7Zip etc (as opposed the format) has a bigger effect?

In your answer could you describe what sort of speed/compression tradeoff your mentioned format uses.

Please provide links to any empirical tests that back up your answer.

Background: I need to backup a custom search index that creates about 3000 relatively small files (less then 10MB), each containing a lot of repetitive data.

(As usual Wikipedia has a relevant article but the section on performance comparison is brief.)

Thanks

Ash
  • 273
  • 4
  • 7

7 Answers7

14

Compress, Gzip, Bzip, Bzip2 are not for archiving multiple files. They only compress single file. For archiving they are usually used with TAR. The problem with TAR is that it has no index table. It's only good if you're planning to restore the whole thing. If you're expecting that you ever need to restore only limited number of selected files, forget about TAR. To get the last file from tar.gz or tar.bz2 archive, you have to decompress and process all of it. In the case of zip, rar or 7-zip, it'll go to the index table, skip to relevant position of the archive and only process relevant files.

Ok, TAR's out, so that leaves you with ZIP, RAR and 7-ZIP. Of these three, ZIP is the most proliferated, most anything supports it, many applications have built-in support. And it's fast. On the other hand 7-ZIP is also portable, the library is LGPL, and has compression rates much better then other two, comes as a cost of being more CPU consuming. RAR is real loser there, neither great compression, nor really portable, nor fast.

EDIT: seems that the best option would be 7-ZIP, but with bzip2 compression method. This way you won't have the disadvantages of TAR, but you'll can still take advantage of bzip2 multi-core support. See this article.

Dennis Williamson
  • 60,515
  • 14
  • 113
  • 148
vartec
  • 6,137
  • 2
  • 32
  • 49
  • Excellent information, thanks. I'd seen TAR in my playing with Linux but had never looked at it closely. – Ash May 07 '09 at 00:45
  • I would have to dissagree about rar. The compression is good (compared to gzip atleast) and the speed seems fine in my use cases. One thing I like about rar is it can handle streaming content or individual files, and allows you to auto-include a time-stamp in the filename... – Dscoduc Feb 24 '10 at 18:49
10

Recommended reading:

File Compression in the Multi-Core Era (Jeff Atwood a.k.a. CodingHorror, february 2009)

I've been playing around a bit with file compression again, as we generate some very large backup files daily on Stack Overflow.

We're using the latest 64-bit version of 7zip (4.64) on our database server. I'm not a big fan of more than dual core on the desktop, but it's a no brainer for servers. The more CPU cores the merrier! This server has two quad-core CPUs, a total of 8 cores, and I was a little disheartened to discover that neither RAR nor 7zip seemed to make much use of more than 2.

Still, even if it does only use 2 cores to compress, the 7zip algorithm is amazingly effective, and has evolved over the last few years to be respectably fast. I used to recommend RAR over Zip, but given the increased efficiency of 7zip and the fact that it's free and RAR isn't, it's the logical choice now.

And regarding algorithms:

Why is bzip2 able to work so much faster than 7zip? [...] Bzip2 uses more than 2 CPU cores to parallelize its work.

splattne
  • 28,348
  • 19
  • 97
  • 147
  • Thanks for the link. We certainly know Jeff's view is based on real life experience! – Ash May 06 '09 at 11:25
4

It isn't all about efficiency and speed. Sure they are important and you can look at the benchmarks for those and choose wisely from the options (though I'd recommend some simple benchmarking of your own with your own data on your own server). But archiving inevitably leads at some point to accessing your data again (otherwise why not just delete it?). Or maybe years down the road it won't be you accessing the data at all, but someone third party. Pick something that will be around when you need to access the data and something that people recognize. I personally use 7zip, but when I archive files others might need I use zip. They know it, lots of tools can handle it. It may not be quite as fast or quite as small, but it helps with the human factor.

Joshua Hunter
  • 221
  • 1
  • 3
  • Good point. This search index will be extracting data directly from certain compressed files. That's why I'm interested if there are any formats offer configurable compress/decompress performance. So I am looking more at the compression aspect then the archiving for the future aspect. – Ash May 07 '09 at 00:59
3

lzma seems to perform very well in both compression ratio and speed.

In the following http://tukaani.org/lzma/benchmarks benchmarks the fastest setting for lzma gave compression times considerably faster than the fastest bzip2 option, while still giving compression better than the slowest bzip2 option:

    ratio   bzip2   lzmash
    fastest 35.8%   31.7%       
    slowest 34.0%   25.4%

    time    bzip2   lzmash  
    fastest 1m 26s  0m 58s  
    slowest 2m 37s  12m 20s

    *Compressing  full installation of OpenOffice.org 1.1.4 for Linux (203 MB) 

It performs especially well with binary data, but I think I read some benchmarks of plain text where bzip2 outperformed it.

The lzma man page is worth reading:

   lzma  provides  notably  better compression ratio than bzip2 especially
   with files having other than plain text content. The other advantage of
   lzma  is fast decompression which is many times quicker than bzip2. The
   major disadvantage is that achieving  the  highest  compression  ratios
   requires  extensive  amount of system resources, both CPU time and RAM.
   Also software to handle LZMA  compressed  files  is  not  installed  by
   default on most distributions.
Guy C
  • 505
  • 1
  • 4
  • 9
  • Me neither until last week when someone recommended it on one of my Server Fault questions. It looks impressive, just worth investigating the performance with plain text, if you'll be using it for that. – Guy C May 06 '09 at 11:41
  • LZMA is the name of algorithm used in 7-zip. – vartec Jan 19 '11 at 15:47
2

Take a look at this Wikipedia entry. Towards the bottom, "Comparison of efficiency". It will give you compression percentage, and time taken, approximately. All those numbers will vary (speed wise) based on the speed of the machine being used, the amount of memory, etc.

More compression benchmarks:

  • Thanks, but I have read that and thought it was a bit brief (see my point in the question). Do you know of any more detailed tests in more varying scenarios? – Ash May 06 '09 at 11:27
  • Added a couple of links that I hope provide more information. –  May 06 '09 at 11:30
2

Comparing zip, 7z, rar with two cases

It depends on what exactly you're compressing, but in general 7z makes better use of multiple processors, and the 7z compression format itself yields higher compression than zip, and sometimes higher than rar (although rar and 7z are nearly equivalent, but rar isn't free...)

My tests a few months ago gave these results:

Compressing a single, 10MB access database file:

Database.mdb 17,240,064 (original)
Database.zip  1,634,794 (Regular zip, 11:1)
Database.rar    262,212 (RAR compression, 66:1)
Database.7z     195,678 (7-zip compression, 88:1)

Compressing a folder containing over nine thousand files of varying types (903,488KB) and got the following (this is a combination of source code and all the tools surrounding it for software being developed):

Type   Time  Size (KB)  Compression
ZIP    7:28   247,529   3.7:1
RAR    8:15   222,232   4.1:1
7z    10:49   181,633   5.0:1

For time purposes, this was on a Core2 Duo, 2GHz, 1GB RAM, and a cheap hard drive.

So 7z gave a substantial improvement in compression ratio in the two cases I tested above and beyond zip, and even improved on rar, but 7z was certainly slower. Not significantly so, but enough to be noted.

-Adam

Adam Davis
  • 5,366
  • 3
  • 36
  • 52
  • Nice tests. That MDB compression is huge. I only get 4 to 1 on the 100k binary index word files I tested on. I guess it shows how important it is to test using files/data similar to what you your system will use. – Ash May 07 '09 at 00:52
  • Yeah, MDB files are all fluff and no substance. A binary file is going to get less (because it uses all 8 bits, while text files use little more than 6 bits) and chances are good there's not much duplication going on in it. Always important to test though. You might have better luck playing with the compression settings - sometimes you can optimize it for the usage and get better than the standard settings provides. – Adam Davis May 07 '09 at 04:47
0

I've just installed dar (but haven't had a chance to play with it yet). It's similar to tar with gzip or bzip2 compression, with the added ability to split the archive into multiple parts, and calculate parity so that if one or more parts are corrupted, it can be reconstructed from the parity files.

pgs
  • 3,471
  • 18
  • 19