Directly answering the specific questions you posed:
Is there a performance penalty during the
aggregation/compression/decompression stages for using tar
encapsulated in gzip or bzip2, when compared to using a file format
that does aggregation and compression in the same data structure?
Assume the runtime of the compressor being compared is identical (e.g.
gzip and Deflate are similar).
There is a specific performance improvement, in general cases, using tar
especially with the compression library built in (the tar xvzf
or tar xvjf
style command lines, where a compression library is used rather than a second process). This comes from two main causes:
when processing a large number of relatively small files, especially those commonly used in distributing software, there is high redundancy. Compressing over many files results in higher overall compression than compressing individual files. And the "dictionary" is computed once for every chunk of input, not for each file.
tar
understands file systems. It is designed to save and restore a working/workable operating system. It deeply grasps exactly what is important on a UNIX file system, and faithfully captures and restores that. Other tools... not always, especially the zip family, which is better designed for sharing files amongst a family of OSs, where the document is the important thing, not a faithful OS sensitive copy.
Are there features of the tar file format that other file formats,
such as .7z and .zip do not have?
Sparse file handling. Some of the direct database libs rely on sparse files - files where the data is nominally GB, but the actual data written and stored is much, much less, and only a few blocks of disk are actually used. If you use an unaware tool, then on decompressing, you end up with massive disk block consumption, all containing zeroes. Turning that back into a sparse file is... painful. If you even have the room to do it. You need a tool that grasps what a sparse file is, and respects that.
Metadata. Unix has evolved some strange things over the years. 14 character file names, long file names, links sym links, sticky bits, superuser bits, inherited group access permissions, etc. Tar understands and reproduces these. File sharing tools... not so much. A lot of people don't use links the way they could... If you've ever worked with software that does use links, and then used a non-aware tool to back up and restore, you now have a lot of independent files, instead of a single file with many names. Pain. Your software fails and you have disk bloat.
Since tar is such an old file format, and newer file formats exist
today, why is tar (whether encapsulated in gzip, bzip2 or even the new
xz) still so widely used today on GNU/Linux, Android, BSD, and other
such UNIX operating systems, for file transfers, program source and
binary downloads, and sometimes even as a package manager format?
tar
works. It does the job it is designed for, well. There have been other touted replacements (cpio
, pax
, etc, etc). But tar is installed on pretty much everything, and the compression libs it uses are also very common for other reasons. Nothing else has come along that substantially beats what tar does. With no clear advantages, and a lot of embedded use and knowledge in the community, there will be no replacement. Tar has had a lot of use over the years. If we get major changes in the way that we think of file systems, or non-text files somehow become the way to transfer code (can't currently imagine how, but ignore that...) then you could find another tool. But then that wouldn't be the type of OS that we now use. It'd be a different thing, organised differently and it would need its' own tools.
The most important question, I think, that you didn't ask, is what jobs 'tar' is ill-suited to.
tar
with compression is fragile. You need the entire archive, bit for bit. In my experience, it is not resilient. I've had single bit errors result in multi-part archives becoming unusable. It does not introduce redundancy to protect against errors (which would defeat one of the questions you asked, about data compression). If there is a possibility of data corruption, then you want error checking with redundancy so you can reconstruct the data. That means, by definition, that you are not maximally compressed. You can't both have every bit of data of being required and carrying its maximum value of meaning (maximum compression) and have every bit of data being capable of loss and recovery (redundancy and error correction). So... what's the purpose of your archive? tar
is great in high reliability environments and when the archive can be reproduced from source again. IME, it's actually worse at the original thing its' names suggests - tape archiving. Single bit errors on a tape (or worse, single bit errors in a tape head, where you lose one bit in every byte a whole tape or archive) result in the data becoming unusable. With sufficient redundancy and error detection and correction, you can survive either of those problems.
So... how much noise and corruption is there in the environment you're looking at, and can the source be used to regenerate a failed archive? The answer, from the clues that you've provided, is that the system is not noisy, and that source is capable of regenerating an archive. In which case, tar
is adequate.
tar
with compression also doesn't play well with pre-compressed files. If you're sending around already compressed data... just use tar, and don't bother with the compression stage - it just adds CPU cycles to do not much. That means that you do need to know what you're sending around and why. If you care. If you don't care about those special cases, then tar will faithfully copy the data around, and compress will faithfully fail to do much useful to make it smaller. No big problem, other than some CPU cycles.
2It's a very good question. I too highly dislike their whole operation of installing software with either odd names or that I can't simply apt-get. Only reason why I can see it getting downvoted is that this is more of a question for Unix/Linux. However SU should accept this. – Griffin – 2013-03-14T14:38:44.850
3@Griffin: The question is not about installing software from tarballs. It is about using the Tar format (e.g. over Zip or RAR) – user1686 – 2013-03-14T14:52:02.223
33I disagree that it "wastes time". If you mean performance, there is no actual performance penalty for tar as the format is very efficient. If you mean it wastes your time, I don't see how
tar xvzf
is harder than7z -x
... – allquixotic – 2013-03-14T15:28:08.647Allquixotic, I mean that you have to extract the archive twice, the first time to extract the tar, adn the second to extract from the tar. – MarcusJ – 2013-03-14T15:54:34.850
41He seems to be lamenting the fact that tar does not store a catalog at the start, so gui compression tools that want to list the contents prior to extracting have to decompress the whole tar just to list the contents, then they decompress it again when extracting. – psusi – 2013-03-14T16:02:10.760
1@MarcusJ Usually, the tar.xx formats have a one-line solution. If you have tar.gz, for example, you could use
tar -xzf <file>.tar.gz
and it will decompress and extract all at once. – Kruug – 2013-03-14T16:03:26.8674psusi, no no no, I'm talking about the fact that tar needs a separate compressor and decompressor, so basically when you open a tar.gz, you need to extract BOTH the gz file to get the tar, then have to extract the tar file, instead of merely decompressing something like a 7z - in one step. It takes more cpu power to do it like this, and seems redundant. – MarcusJ – 2013-03-14T16:04:22.320
4@MarcusJ, both steps have to be done either way, so it takes no more cpu power. – psusi – 2013-03-14T16:05:30.167
2Not to say you're wrong or anything, but how would a 7z require both steps? It would merely load the file, then decompress whatever was selected to be decompressed. :/ – MarcusJ – 2013-03-14T16:06:54.330
10@MarcusJ: you think 7z somehow magically knows where each file starts in an archive? Besides, the usual compression algorithms (gzip, bzip2) work with streaming the content: no need to complete 100% the first stage before next. – nperson325681 – 2013-03-14T16:09:19.680
3Which step do you think it doesn't have to do? It has to parse the file format, and it has to decompress the content. The difference is really just in the order the two are done.
tar
decompresses the content first, then parses the archive.7zip
parses the archive, then decompresses the file content ( the metadata is uncompressed ). – psusi – 2013-03-14T16:17:01.50311Also @MarcusJ you seem to be confusing two different things: when you do
tar xvzf
, the uncompressed data is not written to hard disk in.tar
format! You're right that if you rangunzip blah.tar.gz
and thentar xf blah.tar
, it would write the data to disk twice (once as a .tar and again as files in the filesystem), but nobody actually does it that way. Thetar xzf
uses a UNIX Pipe (basically a memory copy) to transfer the uncompressed data fromgzip
(or whatever compressor) totar
, so the data is not written to disk in.tar
format. – allquixotic – 2013-03-14T16:41:34.6071@grawity I understand that. I was simply trying ensure him that it wouldn't be downvoted. Judging by the responce I don't think he's in too much fear of that anymore. – Griffin – 2013-03-14T17:02:35.563
14One thing I know is that
tar
(especially compressed) behaves awfully when it comes to data corruption. Small redundancy / recovery data added by modern formats is worth gold – PPC – 2013-03-14T19:15:41.5231
tar
is superior for streaming. Unlikezip
, you don't have to wait for the central directory. For archiving, this can also be a disadvantage (slower to list contents).tar xvzf
will also automatically use two processes/cores, so it's not inefficient to split the two processes. – user239558 – 2013-03-14T23:33:04.7875@PPC: that's what PAR files are for. Tar is an unix utility; as such, error correction is best left to dedicated tools. – André Paramés – 2013-03-15T11:22:48.480
1Hmm, tar keeps soft links. I can recall back in the doing: "tar cf - | ( cd /somewhere/else ; tar xf -)" rather a lot because "cp" didn't have a flag to respect soft links. Don't know if it does today - if I encountered the problem, I'd probably just use 'tar' this way again. – Thomas Andrews – 2013-03-15T23:35:00.303
1Why use 1 command when 2 suffice? – user541686 – 2013-03-16T06:42:17.930
1@Kruug: GNU tar automatically applies the
z
(orj
, orJ
) flag:tar xf foo.tar.gz
. It does this based on the actual content of the file, not its name, so it still works even if a gzipped tar file is namedfoo.tar
. – Keith Thompson – 2013-03-16T20:29:14.130@psusi however, if you want to extract just a single file, AFAIK tar have to decompress the whole archive first, while another format could only decompress the target file instead. – o0'. – 2014-04-30T21:08:49.293