209

I know that tar was made for tape archives back in the day, but today we have archive file formats that both aggregate files and perform compression within the same logical file format.

Questions:

Is there a performance penalty during the aggregation/compression/decompression stages for using tar encapsulated in gzip or bzip2, when compared to using a file format that does aggregation and compression in the same data structure? Assume the runtime of the compressor being compared is identical (e.g. gzip and Deflate are similar).
Are there features of the tar file format that other file formats, such as .7z and .zip do not have?
Since tar is such an old file format, and newer file formats exist today, why is tar (whether encapsulated in gzip, bzip2 or even the new xz) still so widely used today on GNU/Linux, Android, BSD, and other such UNIX operating systems, for file transfers, program source and binary downloads, and sometimes even as a package manager format?

MarcusJ

Posted 2013-03-14T14:33:41.800

Reputation: 1 947

2It's a very good question. I too highly dislike their whole operation of installing software with either odd names or that I can't simply apt-get. Only reason why I can see it getting downvoted is that this is more of a question for Unix/Linux. However SU should accept this. – Griffin – 2013-03-14T14:38:44.850

3@Griffin: The question is not about installing software from tarballs. It is about using the Tar format (e.g. over Zip or RAR) – user1686 – 2013-03-14T14:52:02.223

33I disagree that it "wastes time". If you mean performance, there is no actual performance penalty for tar as the format is very efficient. If you mean it wastes your time, I don't see how tar xvzf is harder than 7z -x... – allquixotic – 2013-03-14T15:28:08.647

Allquixotic, I mean that you have to extract the archive twice, the first time to extract the tar, adn the second to extract from the tar. – MarcusJ – 2013-03-14T15:54:34.850

41He seems to be lamenting the fact that tar does not store a catalog at the start, so gui compression tools that want to list the contents prior to extracting have to decompress the whole tar just to list the contents, then they decompress it again when extracting. – psusi – 2013-03-14T16:02:10.760

1@MarcusJ Usually, the tar.xx formats have a one-line solution. If you have tar.gz, for example, you could use tar -xzf <file>.tar.gz and it will decompress and extract all at once. – Kruug – 2013-03-14T16:03:26.867

4psusi, no no no, I'm talking about the fact that tar needs a separate compressor and decompressor, so basically when you open a tar.gz, you need to extract BOTH the gz file to get the tar, then have to extract the tar file, instead of merely decompressing something like a 7z - in one step. It takes more cpu power to do it like this, and seems redundant. – MarcusJ – 2013-03-14T16:04:22.320

4@MarcusJ, both steps have to be done either way, so it takes no more cpu power. – psusi – 2013-03-14T16:05:30.167

2Not to say you're wrong or anything, but how would a 7z require both steps? It would merely load the file, then decompress whatever was selected to be decompressed. :/ – MarcusJ – 2013-03-14T16:06:54.330

10@MarcusJ: you think 7z somehow magically knows where each file starts in an archive? Besides, the usual compression algorithms (gzip, bzip2) work with streaming the content: no need to complete 100% the first stage before next. – nperson325681 – 2013-03-14T16:09:19.680

3Which step do you think it doesn't have to do? It has to parse the file format, and it has to decompress the content. The difference is really just in the order the two are done. tar decompresses the content first, then parses the archive. 7zip parses the archive, then decompresses the file content ( the metadata is uncompressed ). – psusi – 2013-03-14T16:17:01.503

11Also @MarcusJ you seem to be confusing two different things: when you do tar xvzf, the uncompressed data is not written to hard disk in .tar format! You're right that if you ran gunzip blah.tar.gz and then tar xf blah.tar, it would write the data to disk twice (once as a .tar and again as files in the filesystem), but nobody actually does it that way. The tar xzf uses a UNIX Pipe (basically a memory copy) to transfer the uncompressed data from gzip (or whatever compressor) to tar, so the data is not written to disk in .tar format. – allquixotic – 2013-03-14T16:41:34.607

1@grawity I understand that. I was simply trying ensure him that it wouldn't be downvoted. Judging by the responce I don't think he's in too much fear of that anymore. – Griffin – 2013-03-14T17:02:35.563

14One thing I know is that tar (especially compressed) behaves awfully when it comes to data corruption. Small redundancy / recovery data added by modern formats is worth gold – PPC – 2013-03-14T19:15:41.523

1tar is superior for streaming. Unlike zip, you don't have to wait for the central directory. For archiving, this can also be a disadvantage (slower to list contents). tar xvzf will also automatically use two processes/cores, so it's not inefficient to split the two processes. – user239558 – 2013-03-14T23:33:04.787

5@PPC: that's what PAR files are for. Tar is an unix utility; as such, error correction is best left to dedicated tools. – André Paramés – 2013-03-15T11:22:48.480

1Hmm, tar keeps soft links. I can recall back in the doing: "tar cf - | ( cd /somewhere/else ; tar xf -)" rather a lot because "cp" didn't have a flag to respect soft links. Don't know if it does today - if I encountered the problem, I'd probably just use 'tar' this way again. – Thomas Andrews – 2013-03-15T23:35:00.303

1Why use 1 command when 2 suffice? – user541686 – 2013-03-16T06:42:17.930

1@Kruug: GNU tar automatically applies the z (or j, or J) flag: tar xf foo.tar.gz. It does this based on the actual content of the file, not its name, so it still works even if a gzipped tar file is named foo.tar. – Keith Thompson – 2013-03-16T20:29:14.130

@psusi however, if you want to extract just a single file, AFAIK tar have to decompress the whole archive first, while another format could only decompress the target file instead. – o0'. – 2014-04-30T21:08:49.293

Answers

178

Part 1: Performance

Here is a comparison of two separate workflows and what they do.

You have a file on disk blah.tar.gz which is, say, 1 GB of gzip-compressed data which, when uncompressed, occupies 2 GB (so a compression ratio of 50%).

The way that you would create this, if you were to do archiving and compression separately, would be:

tar cf blah.tar files ...

This would result in blah.tar which is a mere aggregation of the files ... in uncompressed form.

Then you would do

gzip blah.tar

This would read the contents of blah.tar from disk, compress them through the gzip compression algorithm, write the contents to blah.tar.gz, then unlink (delete) the file blah.tar.

Now, let's decompress!

Way 1

You have blah.tar.gz, one way or another.

You decide to run:

gunzip blah.tar.gz

This will

READ the 1 GB compressed data contents of blah.tar.gz.
PROCESS the compressed data through the gzip decompressor in memory.
As the memory buffer fills up with "a block" worth of data, WRITE the uncompressed data into the file blah.tar on disk and repeat until all the compressed data is read.
Unlink (delete) the file blah.tar.gz.

Now, you have blah.tar on disk, which is uncompressed but contains one or more files within it, with very low data structure overhead. The file size is probably a couple of bytes larger than the sum of all the file data would be.

You run:

tar xvf blah.tar

This will

READ the 2 GB of uncompressed data contents of blah.tar and the tar file format's data structures, including information about file permissions, file names, directories, etc.
WRITE the 2 GB of data plus the metadata to disk. This involves: translating the data structure / metadata information into creating new files and directories on disk as appropriate, or rewriting existing files and directories with new data contents.

The total data we READ from disk in this process was 1 GB (for gunzip) + 2 GB (for tar) = 3 GB.

The total data we WROTE to disk in this process was 2 GB (for gunzip) + 2 GB (for tar) + a few bytes for metadata = about 4 GB.

Way 2

You have blah.tar.gz, one way or another.

You decide to run:

tar xvzf blah.tar.gz

This will

READ the 1 GB compressed data contents of blah.tar.gz, a block at a time, into memory.
PROCESS the compressed data through the gzip decompressor in memory.
As the memory buffer fills up, it will pipe that data, in memory, through to the tar file format parser, which will read the information about metadata, etc. and the uncompressed file data.
As the memory buffer fills up in the tar file parser, it will WRITE the uncompressed data to disk, by creating files and directories and filling them up with the uncompressed contents.

The total data we READ from disk in this process was 1 GB of compressed data, period.

The total data we WROTE to disk in this process was 2 GB of uncompressed data + a few bytes for metadata = about 2 GB.

If you notice, the amount of disk I/O in Way 2 is identical to the disk I/O performed by, say, the Zip or 7-Zip programs, adjusting for any differences in compression ratio.

And if compression ratio is your concern, use the Xz compressor to encapsulate tar, and you have LZMA2'ed TAR archive, which is just as efficient as the most advanced algorithm available to 7-Zip :-)

Part 2: Features

tar stores Unix permissions within its file metadata, and is very well known and tested for successfully packing up a directory with all kinds of different permissions, symbolic links, etc. There are more than a few instances where one might need to glob a bunch of files into a single file or stream, but not necessarily compress it (although compression is useful and often used).

Part 3: Compatibility

Many tools are distributed in source or binary form as .tar.gz or .tar.bz2, because it is a "lowest common denominator" file format: much like most Windows users have access to .zip or .rar decompressors, most Linux installations, even the most basic, will have access to at least tar and gunzip, no matter how old or pared down. Even Android firmwares have access to these tools.

New projects targeting audiences running modern distributions may very well distribute in a more modern format, such as .tar.xz (using the Xz (LZMA) compression format, which compresses better than gzip or bzip2), or .7z, which is similar to the ZIP or RAR file formats in that it both compresses and specifies a layout for encapsulating multiple files into a single file.

You don't see .7z used more often for the same reason that music isn't sold from online download stores in brand new formats like Opus, or video in WebM. Compatibility with people running ancient or very basic systems.

allquixotic

Posted 2013-03-14T14:33:41.800

Reputation: 32 256

^ i think "higher fault tolerance" there actually means 'does not notice errors and will blindly stumble on, whether you want it to or not'! – underscore_d – 2015-10-24T14:24:21.670

1This answers why tar fits in the archiving ecosystem (ie, to aggregate files together, providing a performance boost and some other benefits like permissions saving), but it does not address why modern alternatives such as dar aren't used in place. In other words, this answer justifies the usage of files aggregators, but not of the tar software in itself. – gaborous – 2016-06-02T13:06:11.123

Kudos for great answer with all the content separated under three distinct headers. – JFW – 2013-03-15T18:22:49.920

2"Part 3: Compatibility" appears to have been copied from @Kruug's answer. – titaniumdecoy – 2013-03-17T02:19:47.400

23@titaniumdecoy Have you noticed that it was allquixotic who originally wrote that part and edited it into Kruug's answer? – slhck – 2013-03-17T07:50:40.340

2Thanks for pointing that out, I didn't notice. However it seems a bit silly to me to have an identical block of text in two different answers on this page. – titaniumdecoy – 2013-03-18T00:37:46.783

I don't have enough repo to add an answer so I write here: AFAIK tar's fault tolerance is much higher than other similar tools. If you have to save something from a not so reliable medium (for example network fs.) tar is probably the best tool for saving as much data as it is possible. Rsync and other tools failed when the first error happened, but with tar we were able to pass single errors. (It was a not so critical daily backup.) – Lajos Veres – 2014-04-24T23:13:25.390

101

This has been answered on Stack Overflow.

bzip and gzip work on single files, not groups of files. Plain old zip (and pkzip) operate on groups of files and have the concept of the archive built-in.

The *nix philosophy is one of small tools that do specific jobs very well and can be chained together. That's why there's two tools here that have specific tasks, and they're designed to fit well together. It also means you can use tar to group files and then you have a choice of compression tool (bzip, gzip, etc).

Many tools are distributed in source or binary form as .tar.gz or .tar.bz2, because it is a "lowest common denominator" file format: much like most Windows users have access to .zip or .rar decompressors, most Linux installations, even the most basic, will have access to at least tar and gunzip, no matter how old or pared down. Even Android firmwares have access to these tools.

New projects targeting audiences running modern distributions may very well distribute in a more modern format, such as .tar.xz (using the Xz (LZMA) compression format, which compresses better than gzip or bzip2), or .7z, which is similar to the ZIP or RAR file formats in that it both compresses and specifies a layout for encapsulating multiple files into a single file.

Kruug

Posted 2013-03-14T14:33:41.800

Reputation: 5 078

7Hi @Kruug, I edited your post just to give a practical perspective on why people still choose to use these formats when they have a choice to use something else. I didn't change the text you already had. This is just to ensure that what appears to be the canonical answer to this question will have the full picture. Feel free to edit my edit if you want :) – allquixotic – 2013-03-14T15:06:43.073

5@allquixotic inception anyone? Edit the edit of and edit so you can edit an edit... – SnakeDoc – 2013-03-14T15:11:39.643

2@allquixotic I feel a bit bad, getting all of these upvotes when at least 50% of the answer was yours. – Kruug – 2013-03-14T15:51:14.823

21This answer is definitely a case of "I'm sometimes blown away by undeserved upvotes". It does not address the core issue of the question which is with listing the contents of compressed tar and it's not even an original answer! – Ярослав Рахматуллин – 2013-03-14T16:08:04.990

@Kruug Don't feel bad now; I posted my own answer ;-D – allquixotic – 2013-03-14T17:12:02.947

WebM might not be the best example since it is technically inferior to the more popular H.264 codec. – titaniumdecoy – 2013-03-14T19:26:16.033

2@ЯрославРахматуллин: This answer provides the rationale for using tar from a Unix/Linux user's perspective, which readers are finding helpful. It deserves my upvote. – Roy Tinker – 2013-03-15T00:39:57.027

1Actually, most stock Android firmwares have an unzip and use renamed and optimized zip files as their application delivery format, and they may have a gzip, but they do not have a tar. Alternate installations often have a more complete unix toolset. – Chris Stratton – 2013-03-15T03:52:54.913

5-1 for great justice. this should have been a comment. – wim – 2013-03-15T06:53:19.030

6I don't buy the legacy/lowest common denominator argument; I remember on new systems (sun) frequently having to download gzip/gunzip (from sunfreeware) just to install other tar.gz packaged software (plus gnu tar, since sun's tar sucked). For legacy/lower-common denominator, you had tar.Z (compress/uncompress). The progression of utilities has been a constant stream (no pun intended) of change & improvement: Z => zip => gz => bz2 => 7z => xz (or whatever order you prefer). As for tar's role, some utils un/compress only, and still require tar to bundle up file hierarchies. – michael – 2013-03-15T08:27:59.573

1@michael_n, the progression of compression tools has continued, yet we still use tar as the container format. The question made it clear it was talking about that, not the compression. – psusi – 2013-03-15T23:33:17.377

1I highly disagree that xz achieves better compression than a .7z archive. The 7-zip file format supports a wide variety of compression algorithms including LZMA(2) which is its "home" compression algorithm and was developed by the the 7-zip developer.

From the xz wiki article: "xz is essentially a stripped down version of the 7-Zip program, which uses its own file format rather than the .7z format used by 7-Zip which lacks support for Unix-like file system metadata." – PTS – 2013-03-16T02:25:44.390

2XZ uses LZMA2 as its compression algorithm. The only difference is that 7-zip has a different metadata format. The mathematics used to compress the files is exactly the same as LZMA2. Certain input data can yield better compression ratios if you use PPMD compression in 7-zip, but the runtime and memory costs of PPMD far exceed any other compression algorithm in existence, both for compression and decompression. – allquixotic – 2013-03-16T05:52:50.243

1LZMA on the other hand decompresses very fast (almost as fast as zip, and much much faster than it compresses). PPMD, while it may save a few kilobytes on several dozen megabytes of data, will take gigabytes of memory to decompress, and will decompress just as slowly as it compresses (slooooooooooooooooooooooooooow). So, throwing out ppmd as being impractical, Xz and 7-Zip are identical in compression capability, varying insignificantly based on the way they store file structure and metadata. – allquixotic – 2013-03-16T05:54:20.940

@psusi yeah, I know / understand / agree / etc. And now (gnu) tar compresses, too, in a variety of formats (gz/bz/xz/yada-yada-yada-z): time rolls on, lines blur, things change, and sun's tar still doesn't handle long file/path names. (...arguably for "posix compliance", but no need to delve into pedantry (my fault) and lose the larger point (whatever it was, i forget)) – michael – 2013-03-16T06:55:18.387

Tar has a rich set of operations and modifiers that know all about Unix files systems. It knows about Unix permissions, about the different times associated with files, about hard links, about softlinks (and about the possibility that symbolic links could introduce cycles in the filesystem graph), and allow you to specify several different ways for managing all this data.

Do you want the extracted data to preserve file access times? Tar can do that. To preserve permissions? Tar can do that.
Do you want to preserve symbolic links as symbolic links? Tar does that by default. Want to copy the target instead? Tar can do that.
Do you want to be sure hardlinked data is only stored once (that is, to do the right thing)? Tar does that.
Do you want to handle sparse files well? Tar can do that.
Do you want uncompressed data (why?)? Tar can do that. To compress with gzip? Tar can do that. With bzip2? Tar can do that. With arbitrary external compression programs? Tar can do that.
Do you want to write or recover to/from a raw device? Tar's format handles that fine.
Do you want to add files to an existing archive? Tar can do that. To diff two archive to see what changed? Tar can do that. To update only those parts of the archive that have changed? Tar can do that.
Do you want to be sure you don't archive across more than one filesystem? Tar can do that.
Do you want to grab only files that are newer than your last backup? Tar can do that.
Do you want to preserve user and group names or numbers? Tar can do either one.
Do you need to preserve device nodes (like the files in /dev) so that after extraction, the system will run correctly? Tar can do that.

Tar has been evolving to handle lots and lots of use cases for decades and really does know a lot about the things people want to do with Unix filesystems.

dmckee --- ex-moderator kitten

Posted 2013-03-14T14:33:41.800

Reputation: 7 311

3"Do you want uncompressed data (why?)?" I use tar very often to copy a filesystem tree from one place to another and preserve permissions, etc., and compression in this case just takes extra CPU cycles. E.g. tar cf - * | tar xf - -C /somewhere. – Steve – 2014-08-15T21:42:19.803

1Additionally, you would want a .tar file when the destination filesystem performs de-duplication. Creating compressed archives on a filesystem that performs de-duplication will substantially lower the dedupe ratio. Example: We once deleted a $10,000.00 tar.gz file; meaning, it was taking up $10k worth of storage space because someone used compression. – Aaron – 2015-01-23T00:06:42.333

@Steve CPU cycles may be cheaper than disk IO for algorithms like LZ4 or LZO. That's why they're used in zram, and transparent compression file systems like NTFS, ZFS, Btrfs... so sometimes it's actually faster than to compress since the amount of disk IO is greatly reduced – phuclv – 2018-07-28T10:03:34.953

12You don't need GNU tar to use an arbitrary compressor: just tell tar to write the archive to stdout with f - and pipe it to the compressor. – Ilmari Karonen – 2013-03-15T17:18:10.820

You confuse the two distinct processes of archiving and compression.

Reasons for using an archiver

One reason to use archiving without compression is, for instance, if a bunch of files is copied from one host to another. A command like the following

tar cf - some_directory | ssh host "(cd ~/somewhere | tar xf -)"

can speed up things considerably. If I know that the files cannot be compressed or if SSH is set up with compression, it can save considerable CPU time. Sure, one can use a more modern compressing tool with an archiving function and turn off the compression. The advantage of tar is, that I can expect it to be available on every system.

Reasons for using an archiver with gzip compression

One reason that I use tar with gzip is: speed! If I want to transfer a few GiB of text files from one place to another, I don't care about squeezing out the last bytes, since the compression is only used for transit, not for long-term storage. In those cases I use gzip, which doesn't max out the CPU (in contrast to 7-Zip, for instance), which means that I'm I/O bound again and not CPU bound. And again: gzip can be considered available everywhere.

Reasons for using tar in favour of scp, rsync, etc.

It beats scp if you have a lot of small files to copy (for example, a mail directories with hundred thousands of files). rsync, awesome as it is, might not be available everywhere. Further, rsync only really pays off if part of the files - or an older version- - is already present on the destination. For the initial copy tar is the fastest, with compression or without, depending on the actual data.

Marco

Posted 2013-03-14T14:33:41.800

Reputation: 4 015

1But if you're going to archive, why not compress as well? Okay, yeah it can save time for files that aren't easily compressed, but then archivers should probably know that music for example, aren't very compressible, except for the headers. – MarcusJ – 2013-03-14T15:58:22.480

2You may not need to, or your content might not be compressible. – Hasturkun – 2013-03-14T16:05:33.820

4For performance reasons it is often easier to use uncompressed file aggregation when sending data over very high bandwidth network links that exceed the speed at which the compressor can compress data. This is achievable for example with Gigabit Ethernet; only a few well-designed compression algorithms, which also have very poor compression ratio, can compress data that fast even on a large desktop CPU. On an embedded device you have even less CPU time to work with. – allquixotic – 2013-03-14T16:38:42.927

@MarcusJ there are also all sorts of "uncompressible" binary file formats, running them through a compressor is a waste of time/CPU. tar however will archive them, making their transfer easier and faster. As you said, compressors can know about some of them (mp3 for example) and guess some others from the magic number , but not all. – terdon – 2013-03-14T16:38:55.580

3not only is this speeding up things but it also allows preserving file ownership, timestamps and attributes (if the user privileges allow it) – Andre Holzner – 2013-03-14T20:39:31.600

It seems easier to use the pipe | ssh host tar x -C '~/somewhere' – Dietrich Epp – 2013-03-14T23:06:44.190

@DietrichEpp That doesn't work on Solaris. – Marco – 2013-03-14T23:20:24.737

3@AndreHolzner Right. I often do tar cf - . | (cd ~/somewhere; tar xvf -). It is really useful not to have to wait until the central index is written (like for example in a zip file). – user239558 – 2013-03-14T23:30:35.817

Why would you use this rather than scp, rsync, SFTP or any of the other file transfer protocols though? – Konrad Rudolph – 2013-03-15T08:14:19.790

@Konrad: you can perform that kind of transfer using tar using very simple network tools like netcat. scp, rsync sftp or such implies running much more complex client and server software. – kriss – 2013-03-16T11:33:47.787

Adding to the other good answers here, I prefer the combination tar + gzip|bzip2|xz mainly because these compressed files are like streams, and you can pipe them easily.

I need to uncompress a file available in the internet. With either zip or rar formats I have to download it first and then uncompress it. With tar.{gz,bz2,xz} I can download and uncompress in the same step, without need to having the compressed archive phisically on disk:

curl -s http://example.com/some_compressed_file.tar.gz | tar zx

This will leave just the uncompressed files in my disk, and will speed up the whole process, because I am not waisting time first downloading the entire file and after the download finishes I uncompress it. Instead, I am uncompressing it while it is downloading. You cannot do this with zip or rar files.

Carlos Campderrós

Posted 2013-03-14T14:33:41.800

Reputation: 643

2I don't know about rar ( it's a terrible program that only seems to have become popular with pirates beacuse of its ability to split into multiple smaller files ), but you can stream zip just fine. The man page even mentions it. It also has the advantage of being able to extract or update files from the middle of a large archive efficiently, though tar tends to get slightly better compression. Compression vs. random access is a tradeoff. – psusi – 2013-03-14T15:53:26.990

@psusi incorrect. You can do hacks like this, but what it does is download all the file in memory and then unzip it, instead of unzipping while downloading. And funzip just extracts the first file in the zipfile, not all.

– Carlos Campderrós – 2013-03-14T16:03:11.917

Ahh, while you can pipe the output of zip, it appears that unzip is buggy and can't read from stdin. This is a defect in the program though, not a limitation of the file format. – psusi – 2013-03-14T16:35:00.637

No offense, but when on Earth is this an issue nowadays? – Stu – 2013-03-15T03:07:36.510

2@Stu just to clarify, is not an issue, is just optimizing your time (I don't care about space if that's what you thought) – Carlos Campderrós – 2013-03-15T08:39:22.927

I use tar on the other end (the sending side, rather than the receiving side), since gnu tar has really flexible options for including/excluding files, over, say scp -r, e.g,. tar -czh --exclude=.svn --exclude=.git --exclude=*~ --exclude=*.bak -f - some_dir | ssh user@rmt_host "cat > ~/some_dir.tgz" (avoids creating local tar.gz before sending, too) – michael – 2013-03-15T08:49:18.387

1Both sides work: You can tar on one side and untar in the other, too: tar zc /some/folder | ssh user@host "cd /other/folder && tar zx" – Carlos Campderrós – 2013-03-15T08:51:55.277

@psusi as I remember from old times when using pkzip to store files on multiple floppies, zip store catalog at end of archive. It always request last floppy for start extraction or show catalog. So http://en.wikipedia.org/wiki/File:ZIP-64_Internal_Layout.svg

– Mikhail Moskalev – 2013-03-15T23:06:57.860

@mmv-ru, oh yea, it is backwards, I forgot about that. – psusi – 2013-03-15T23:30:52.460

There are several reasons to stick with (GNU) Tar.

It is:

GPL licensed
good in the sense of Unix philosophy
- single purpose tool, capable of doing several tasks
well documented and has many trusted features
compatible with several compression algorithms
easy to use and people have have developed habits with it
broadly available
I feel warm and fuzzy inside when using software started by RMS (excluding Emacs)

If your particular beef is with having to "decompress" a tarball before being able to read the contents, then you're probably right. WinRAR and 7-Zip do it automatically. However, there are simple workarounds to this problem such as documenting the content of an archive in an uncompressed form.

Ярослав Рахматуллин

Posted 2013-03-14T14:33:41.800

Reputation: 9 076

1It's free software - So are a lot of them It's good at what it does - Hardly compared to other stuff It's well documented and has many features - Features are hardly used and detestably easy to use. It supports several compression algorithms - Not as many as some others – Griffin – 2013-03-14T14:50:33.850

4the Unix Gods created it - therefore we must use it! – SnakeDoc – 2013-03-14T15:02:57.317

28Tar also stores UNIX permissions natively, and is very well known and tested. There's more than a few instances where one might need to glob a bunch of files into a single file or stream, but not necessarily compress it. – LawrenceC – 2013-03-14T15:03:41.313

3Luckily tar is not limited to just GNU versions. While GNU tar is certainly a good piece of software, libarchive + related front ends are much faster and easy to embed in other software projects. You can make an argument for tar without turning it into a licensing fight. – Lucas Holt – 2013-03-14T17:50:10.213

@Lucas Holt Very true, I mention it in parentheses only because it's the only version I'm familiar with. – Ярослав Рахматуллин – 2013-03-14T18:37:03.347

having used other tar's, gnu tar is the only one I would trust to work consistently & correctly. Especially on solaris, but also a bit cautious with native (proprietary) tar's on hp-ux/aix & z/os. – michael – 2013-03-15T08:36:19.400

Downvote. Sarcasm is inappropriate on Stackexchange. People do actually trust these answers. – ulidtko – 2013-03-15T10:05:57.763

1I'm not sarcastic. I like RMS and the way he carries forth his believes. – Ярослав Рахматуллин – 2013-03-15T15:45:10.813

Sarcasm and sincerity travels poorly across plain text. We usually guess about it from your tone of voice. Guessing someone is serious or not on the internet is a bit difficult. – Warren P – 2013-03-27T13:04:40.387

@Warren P thanks for the comment. I'll try to maintain a neutral tone in the future. – Ярослав Рахматуллин – 2013-03-31T18:45:41.277

Performance

The big difference is the order that the compression and archiving are done in. tar archives, then can optionally send the archive to a compressor, and zip builds up the archive, and compresses the file data in 32 KB chunks as it is inserted into the archive. By breaking the file data up into small chunks and compressing them separately, it allows you to extract specific files or parts of files without having to decompress everything in the archive before it. It also prevents the compressor from building up a very large dictionary before it is restarted. This means compression will go faster, but not give as good of a ratio as compressing the whole thing with a larger dictionary size.

You can visualize it by thinking of two files, where the first 500 bytes of the second file are the same as the last 500 bytes of the first file. With the zip method, the compressor is restarted for the second file, so does not remember that the first file ended in the same data, so it can't remove the duplicate data from the second file.

Popularity

There are plenty of other formats that have a number of advantages over tar. 7-Zip doesn't store Unix file permissions, but dar does, and zip can, and all three store an index, which allows for fast browsing, extraction of a subset of files, and updating files within the archive. They can also use multi-core CPUs for compression.

The reason everyone still uses tar is the same reason everyone still uses Windows, and Flash: people don't like change. Without a strong reason to change, people just stick to what they know. dar doesn't provide enough of a benefit to justify publishing files in the format when most people already have tar installed, and very few know about dar, so simple inertia keeps us on the old standard.

psusi

Posted 2013-03-14T14:33:41.800

Reputation: 7 195

I disagree that dar doesn't provide enough of a benefit to justify the change: it is way more robust and way less susceptible to corruption (ie, tar is a solid archive whereas dar is not and thus allows partial file extraction, even if corrupted, whereas you lose all your files in a corrupted tar). In addition, most modern archiving features are natively supported, such as encrypting. So certainly the benefits are huge, and certainly justify the change. So the reason it has not been more widely adopted yet is to find elsewhere (lack of easy GUI tools? Inertia?). – gaborous – 2016-06-02T12:54:04.523

So I stand with @MarkAdler, this answer is based on incorrect premisses: tar does not allow partial file extraction, in fact it's the opposite, if you tar your files before feeding them to zip/Deflate, you lose the ability to partially extract files without uncompressing the archive, because tar can only make solid archives.

– gaborous – 2016-06-02T13:00:34.233

2zip can store and restore the Unix permissions. The zip and unzip utilities from InfoZIP normally distributed with Unix system does this. – Mark Adler – 2013-03-15T05:33:56.327

3zip does not compress the file in 32K chunks. You are confusing the sliding window size of 32K with how the compression is done. – Mark Adler – 2013-03-15T05:35:36.497

1gzip -9 does not have a 900 kB dictionary. All levels of gzip use a 32K dictionary. You may be thinking of bzip2. – Mark Adler – 2013-03-15T05:36:07.743

So much misinformation in one answer. – Mark Adler – 2013-03-15T05:37:06.883

@MarkAdler, it appears zip has been extended to store the file mode, but not owner. 7zip still warns it does not handle unix permissions. Zip ( and cab ) does compress 32k blocks at a time, else it could not efficiently extract a file from the middle of a large archive, which is the problem tar has. 7z, rar, and dar have an option to use the blocking method ( like zip ) or "solid" mode ( like tar ), as they call it. Re: -9, it seems I was thinking of bzip2 and lzma, and gzip uses a more simplistic system, but it does not use a fixed 32k dictionary, but the window limits it near there. – psusi – 2013-03-16T01:14:50.277

The zip format can store both the uid and gid. – Mark Adler – 2013-03-16T06:45:30.563

1You need to either make corrections to your answer or delete your answer. – Mark Adler – 2013-03-16T06:47:20.607

@MarkAdler, I once worked on the cab extractor for ReactOS, trust me, it it compresses 32k at a time, either combining smaller files or splitting larger ones as needed. – psusi – 2013-03-17T21:38:39.307

When working on the CAB format, it might have been a good idea to spend some time studying the cab format specification. The 32K CFDATA blocks are not random access entry points. The random access points are at the start of CAB "folders", which consist of a series of CFDATA blocks. From the specification: "By default, all files are added to a single folder (compression history) in the cabinet." So a non-default option would be needed for CAB file to have any random access midpoints at all. – Mark Adler – 2013-03-17T22:48:55.143

Your edited answer has improved, but is still chock full of misinformation. zip does not compress in 32K chunks, and does not provide access to parts of files without having to decompress the entire file. "It also prevents the compressor from building up a very large dictionary before it is restarted." is nonsensical. There is no building up of anything. The deflate dictionary is simply the 32K bytes that precede the next byte to be compressed. Once you get past the first 32K, the dictionary is always the same size, there is no "building up", and the compression speed does not change. – Mark Adler – 2013-03-17T23:01:23.580

Because the data stream is broken into a series of CFDATA blocks that are limited in size, that does in fact, provide for random access, since you can seek to any CFDATA block and start decompression there. The folder mechanism is a seemingly useless abstraction. As I said, the deflate dictionary is not strictly limited to 32k, though in practice it tends to not grow much larger due to the 32k distance limit, but inf-zip allows for bzip2, which has no such limit. Whatever the limits of the compression algorithm, restarting it does reduce compression ratios. – psusi – 2013-03-18T04:00:16.547

No, you cannot start decompressing at any CFDATA block. Read the specification, which is very clear on this point. Within a folder, each CFDATA block can and does use the previous CFDATA blocks as history for compression. The folder is the only abstraction in the specification that defines where you can start decompressing, so it is not only useful, but essential for the random extraction application you are calling attention to in your answer. – Mark Adler – 2013-03-18T05:47:52.487

The deflate dictionary is strictly limited to 32K. It does not "grow" once you're at least 32K into the stream. From there on it is always exactly 32K. bzip2 certainly does have a limit of 900K of history, which is not a sliding dictionary but rather a block on which the BWT transform is applied. Each block is compressed independently, and cannot make use of the information in previous blocks. – Mark Adler – 2013-03-18T05:50:23.287

Since there seems to be no limit to the amount of misinformation you can fabricate, this is no longer productive. I am done commenting on this answer and related comments. Thank you and good night. – Mark Adler – 2013-03-18T05:52:37.407

File formats like .zip require the software to read the end of the file first, to read a catalog of filenames. Conversely, tar stores that information in with the compressed stream.

The advantage of the tar way is that you can decompress data whilst reading it from a non-seekable pipe, like a network socket.

The advantage of the zip way is that, for a static file on disk, you can browse the contents and metadata without decompressing the whole archive first.

Both have their uses, depending on what you're doing.

xorsyst

Posted 2013-03-14T14:33:41.800

Reputation: 479

5No, you can both read and write zip files as a stream from and to a pipe. – Mark Adler – 2013-03-15T05:38:32.217

That may be implementation-specific then, it certainly isn't supported by the original pkzip. – xorsyst – 2013-03-15T09:09:41.380

1Yes, the software has to be written to support it. The zip format supports it completely, with data descriptors that can follow the compressed data with the lengths and CRC. – Mark Adler – 2013-03-15T16:51:13.377

@MarkAdler, what software? Infozip doesn't support unzipping from a pipe. – psusi – 2013-03-16T01:16:47.950

http://zlib.net/sunzip033.c.gz – Mark Adler – 2013-03-16T06:04:51.143

Also Info-ZIP's zip supports compression to a stream. – Mark Adler – 2013-03-16T06:48:53.193

There seems to be some reluctance to answer all of your questions directly, with an apparent preference to use your question as a jumping off point for pontification. So I'll give it a shot.

Is there a performance penalty during the aggregation/compression/decompression stages for using tar encapsulated in gzip or bzip2, when compared to using a file format that does aggregation and compression in the same data structure? Assume the runtime of the compressor being compared is identical (e.g. gzip and Deflate are similar).

No. In fact since tar and gzip are usually two processes, you even get a smidge of multi-core speed benefit that an archiver like Info-ZIP's zip does not provide. In terms of compression ratio, tar+gzip will usually do noticeably better than zip with deflate since the former can benefit from correlation between files, whereas the latter compresses files separately. That compression benefit translates into a speed benefit when extracting, since a more-compressed archive decompresses in less time.

Are there features of the tar file format that other file formats, such as .7z and .zip do not have?

Yes, tar was designed for Unix, and has evolved over the years to be able to exactly record and restore every odd little nook and cranny of Unix file systems, even the nookier and crannier Mac OS X Unix file system. zip is able to retain much of the metadata such as permissions, times, owners, groups, and symbolic links, but still not everything. As an example, neither zip nor 7z can recognize or take advantage of sparse files, nor are they aware of or able to restore hard links.

Since tar is such an old file format, and newer file formats exist today, why is tar (whether encapsulated in gzip, bzip2 or even the new xz) still so widely used today on GNU/Linux, Android, BSD, and other such UNIX operating systems, for file transfers, program source and binary downloads, and sometimes even as a package manager format?

Lots of other good answers here to that. The best is that it just works, and you can keep updating it to better compression formats (e.g. xz) and still use the same tar format and even the same compiled tar utility. If you just want to pack up a bunch of stuff, and then unpack it all on the other end, then there is little reason to use anything but one of the oldest, most complete, and most debugged pieces of software out there.

If you want random access, partial updates, or other things that need to deal with the contents piecemeal, or you want to be able to find out what's in it without reading the whole thing, then you would want to use a different format.

Mark Adler

Posted 2013-03-14T14:33:41.800

Reputation: 246

CW stands for Community Wiki. See also What are "Community Wiki" posts?. – ctype.h – 2013-03-18T00:47:20.973

I guess it is CW because the question has more than 15 answers. When you posted this answer, because it is the 15th, the question and all of the answers were marked CW. – ctype.h – 2013-03-18T00:56:41.067

I fail to see how this answer says something that none of the other answers do, other than directly quoting the questions (which I wrote, BTW, because the original revision of the question was horrible enough to be closed as NARQ). Nice try though. – allquixotic – 2013-03-20T14:02:33.697

Um, ok. Whatever you'd like to think is fine. Your answer nor any other answer seems to address whether there is a performance penalty. Your answer does not address the noticeable compression difference, though others do. Since yours does not actually address performance (your performance section is actually about workflow, nothing about performance), no other answer answers everything in one place. It is interesting that you wrote the performance penalty question, but you did not answer it! Go figure. – Mark Adler – 2013-03-21T00:26:56.967

By the way, your workflow discussion is about something no one ever does, which is to write a tar file to a disk and then compress it. tar is always used either calling the compression program directly, or directly into a pipe to a compression program. – Mark Adler – 2013-03-21T00:27:16.310

Tar was created for doing backup full-fidelity backups of your filesystem, not just for transferring files around. As such, the tar utility is the most complete utility for creating an archive that preserves everything important about your filesystem structure.

This includes all these features that are missing in one or more competing tools:

file ownership
file permissions
less-common file permissions (e.g. setuid, sticky bit)
symbolic links
hard links
device entries (i.e. character and block devices)
sparse files
ACL entries (not supported by all versions)
extended/user attributes (not supported by all versions)
SElinux labels (not supported by all versions)

It also has the --one-file-system option which is extraordinarily useful when making backups.

Any time a new feature is added to filesystems, support gets added to tar first (or even exclusively). So it continues to be the most compatible way to save files.

tylerl

Posted 2013-03-14T14:33:41.800

Reputation: 2 064

This answer is the only one that makes sense. Thank you for posting it. – gaborous – 2016-06-02T13:02:42.077

We have lots of compressed files floating around today, MP3s, JPGs, Videos, tar.gz files, JAR packages, RPMs, DEBs and so on. If you need to bundle a bunch of these into a single file for transfer, then it is useful to have a 'tar' utility which only bundles the files without attempting to compress them.

Not only does it waste time and electricity to attempt to compress a compressed file, but it often results in a file which is bigger than the original.

Another use of it is to improve compression rates. For instance, if you 'tar' a bundle of log files and then gzip the result, you likely come up with a smaller file than if you compressed first, then bundled with 'tar'. And of course, using tar, you can choose any compression algorithm that you want, and specify options to optimize compression for your particular use case.

I find that tar' is very relevant today and I prefer it to use ZIP. In our office, everyone with Windows has 7-zip installed so, for us, tar files are fully cross-platform compatible.

Michael Dillon

Posted 2013-03-14T14:33:41.800

Reputation: 899

You practically never see uncompressed tar files and there's a reason for that. tar uses very large chunks, meaning that you get a lot of padding at the end of files. To get rid of all these zeros, it almost always pays to just use gzip without giving it a second thought. – Christian – 2013-03-16T14:28:27.427

An amusing exception is that the gzip source code is available as a naked tar, for obvious reasons. – Mark Adler – 2013-03-18T00:32:56.457

Maybe we should wonder why such "new" file formats performing both compression and aggregation (and I would add encryption) where not built on tar from the beginning instead of completely different tools.

As I understand it there are historical reasons (related to OS history, patents "protection", ability for software vendore to sell tools, etc.).

Now, as other response pointed it even now tar is not clearly inferior to other solutions and may be better on other aspects like ability to work on streams or Unix rights management.

If you read the wikipedia article about tar you can see another interesting fact. The article acknowledge some shortcomings of tar... but does not suggest using zip instead (really zip format does not solve these shortcomings) but DAR.

I will end with a personal touch. Some times ago I had to create a file format for storing encrypted data. Using tar as a basis was handy (others made the same choice, for instance tar is the internal aggregation format for .deb packages). It was obvious to me that trying to compress data after encryption was totally useless, I had to perform compression as an independant step before encryption, and I was not either ready to use zip encryption (I wanted two key encryption with public and private keys). Using tar it worked as a breeze.

kriss

Posted 2013-03-14T14:33:41.800

Reputation: 221

The reason is "entrenchment in the culture". There are numerous people like me whose eyes glaze over if they are asked to process anything other than a compressed tar archive, or the occasional ZIP, if it came from the Windows world.

I don't want to hear about 7-Zip, RAR or anything else. If I have to install a program to uncompress your file, that is work. I will do it if it results in me being paid, or if the content is something I "must have" and isn't available in any other way.

One advantage of tar is that if you send someone a tarball, it is instantly recognized. The recipient can type the extraction commands using muscle memory.

The real question is: why are some people so obsessed with saving one more byte of space that they ask everyone else to waste time installing some exotic utility and learning how to use it? And then there are the stupid uses of exotic compression and archive formats. Does a H.264 video with AAC sound really need to be put into mult-part RAR?

The tar format may be old, but it stores everything that is relevant: file contents, paths, timestamps, permissions and ownerships. It stores not only symbolic links, but it can preserve hard link structure. It stores special files also, so a tape archive can be used for things like a minature /dev directory that is used during bootstrapping. You can put a Linux distribution together whose binary package format consists of nothing, but tarballs that are uncompressed relative to the filesystem root.

Kaz

Posted 2013-03-14T14:33:41.800

Reputation: 2 277

Re "So obsessed"... imagine you're stranded in warzone with a single hardened laptop, and the undersized 20G hard drive's nearly full, maybe a Gig left, and hearing the gunfire from far off, you'd really like to browse a 100MB .PDF manual that shows how to repair the jeep, but the file is in a 2 Gig .tgz file. And the laptop runs a closed source strange proprietary OS, and you don't have root access to delete system files, not that it'd be obvious how to delete 4G+ without breaking the dearchiver or the PDF viewer. If you could just extract that 100MB file... – agc – 2016-06-02T04:40:25.263

I'm surprised no one has mentioned this, but one of the reasons—not really an advantage, but a necessity—is for backwards compatibility. There are plenty of systems running software for decades that might call tar for archiving. It's not cost effective to hire someone to "fix" all the old systems.

Keltari

Posted 2013-03-14T14:33:41.800

Reputation: 57 019

tar is UNIX as UNIX is tar

In my opinion the reason of still using tar today is that it's one of the (probably rare) cases where the UNIX approach just made it perfectly right from the very beginning.

Taking a closer look at the stages involved in creating archives I hope you'll agree that the way the separation of different tasks takes place here is UNIX philosophy at its very best:

one tool (tar to give it a name here) specialized in transforming any selection of files, directories and symbolic links including all relevant meta-data like timestamps, owners and permissions into one byte stream.
and just another arbitrarily interchangeable tool (gzip bz2 xz to name just a few options) that transforms any input stream of bytes into another (hopefully) smaller output stream.

Using such and approach delivers a whole couple of benefits to the user as well as to the developer:

extensibility Allowing to couple tar with any compression algorithm already existing or any compression algorithm yet still to be developed without having to change anything on the inner workings of tar at all.

As soon as the all brand new "hyper-zip-utra" or whater compression tool comes out you're already ready to use it embracing your new servant with the whole power of tar.
stability tar has been in heavy use since the early 80es tested and been run on numberous operating systems and machines.

Preventing the need to reinvent the wheel in implementing storing ownership, permissions, timestamps and the like over and over again for every new archiving tool not only saves a lot of (otherwise unnecessarily spent) time in development, but also guarantees the same reliability for every new application.
consistency The user interface just stays the same all the time.

There's no need to remember that to restore permissions using tool A you have to pass option --i-hope-you-rember-this-one and using tool B you have to use --this-time-its-another-one while using tool C it's `--hope-you-didnt-try-with-tool-as-switch.

Whereas in utilizing tool D you would have really messed up it if you didn't use --if-you-had-used-tool-bs-switch-your-files-would-have-been-deleted-now.

mikyra

Posted 2013-03-14T14:33:41.800

Reputation: 426

Lots of good answers, but they all neglect an important fact. Tar has a well-established ecosystem of users and developers in the Unix-like world. That keeps it going, just as ZIP is kept going by its DOS/Windows ecosystem. Having such an ecosystem is what sustains a technology, not its technical advantages.

Isaac Rabinovitch

Posted 2013-03-14T14:33:41.800

Reputation: 2 645

Really good comment, I hadn't even thought of that, and that's a REALLY good point to make. – MarcusJ – 2013-03-19T21:17:13.587

Directly answering the specific questions you posed:

Is there a performance penalty during the aggregation/compression/decompression stages for using tar encapsulated in gzip or bzip2, when compared to using a file format that does aggregation and compression in the same data structure? Assume the runtime of the compressor being compared is identical (e.g. gzip and Deflate are similar).

There is a specific performance improvement, in general cases, using tar especially with the compression library built in (the tar xvzf or tar xvjf style command lines, where a compression library is used rather than a second process). This comes from two main causes:

when processing a large number of relatively small files, especially those commonly used in distributing software, there is high redundancy. Compressing over many files results in higher overall compression than compressing individual files. And the "dictionary" is computed once for every chunk of input, not for each file.
tar understands file systems. It is designed to save and restore a working/workable operating system. It deeply grasps exactly what is important on a UNIX file system, and faithfully captures and restores that. Other tools... not always, especially the zip family, which is better designed for sharing files amongst a family of OSs, where the document is the important thing, not a faithful OS sensitive copy.

Are there features of the tar file format that other file formats, such as .7z and .zip do not have?

Sparse file handling. Some of the direct database libs rely on sparse files - files where the data is nominally GB, but the actual data written and stored is much, much less, and only a few blocks of disk are actually used. If you use an unaware tool, then on decompressing, you end up with massive disk block consumption, all containing zeroes. Turning that back into a sparse file is... painful. If you even have the room to do it. You need a tool that grasps what a sparse file is, and respects that.

Metadata. Unix has evolved some strange things over the years. 14 character file names, long file names, links sym links, sticky bits, superuser bits, inherited group access permissions, etc. Tar understands and reproduces these. File sharing tools... not so much. A lot of people don't use links the way they could... If you've ever worked with software that does use links, and then used a non-aware tool to back up and restore, you now have a lot of independent files, instead of a single file with many names. Pain. Your software fails and you have disk bloat.

Since tar is such an old file format, and newer file formats exist today, why is tar (whether encapsulated in gzip, bzip2 or even the new xz) still so widely used today on GNU/Linux, Android, BSD, and other such UNIX operating systems, for file transfers, program source and binary downloads, and sometimes even as a package manager format?

tar works. It does the job it is designed for, well. There have been other touted replacements (cpio, pax, etc, etc). But tar is installed on pretty much everything, and the compression libs it uses are also very common for other reasons. Nothing else has come along that substantially beats what tar does. With no clear advantages, and a lot of embedded use and knowledge in the community, there will be no replacement. Tar has had a lot of use over the years. If we get major changes in the way that we think of file systems, or non-text files somehow become the way to transfer code (can't currently imagine how, but ignore that...) then you could find another tool. But then that wouldn't be the type of OS that we now use. It'd be a different thing, organised differently and it would need its' own tools.

The most important question, I think, that you didn't ask, is what jobs 'tar' is ill-suited to.

tar with compression is fragile. You need the entire archive, bit for bit. In my experience, it is not resilient. I've had single bit errors result in multi-part archives becoming unusable. It does not introduce redundancy to protect against errors (which would defeat one of the questions you asked, about data compression). If there is a possibility of data corruption, then you want error checking with redundancy so you can reconstruct the data. That means, by definition, that you are not maximally compressed. You can't both have every bit of data of being required and carrying its maximum value of meaning (maximum compression) and have every bit of data being capable of loss and recovery (redundancy and error correction). So... what's the purpose of your archive? tar is great in high reliability environments and when the archive can be reproduced from source again. IME, it's actually worse at the original thing its' names suggests - tape archiving. Single bit errors on a tape (or worse, single bit errors in a tape head, where you lose one bit in every byte a whole tape or archive) result in the data becoming unusable. With sufficient redundancy and error detection and correction, you can survive either of those problems.

So... how much noise and corruption is there in the environment you're looking at, and can the source be used to regenerate a failed archive? The answer, from the clues that you've provided, is that the system is not noisy, and that source is capable of regenerating an archive. In which case, tar is adequate.

tar with compression also doesn't play well with pre-compressed files. If you're sending around already compressed data... just use tar, and don't bother with the compression stage - it just adds CPU cycles to do not much. That means that you do need to know what you're sending around and why. If you care. If you don't care about those special cases, then tar will faithfully copy the data around, and compress will faithfully fail to do much useful to make it smaller. No big problem, other than some CPU cycles.

JezC

Posted 2013-03-14T14:33:41.800

Reputation: 550

-3

TAR is Tape Archive. It's been around for decades and it's widely used and supported. It is a mature product and takes care of the current needs as well as legacy ones.

Edward

Posted 2013-03-14T14:33:41.800

Reputation: 329

What is the advantage of using 'tar' today?

Answers

Part 1: Performance

Way 1

Way 2

Part 2: Features

Part 3: Compatibility

Performance

Popularity

tar is UNIX as UNIX is tar