Which archival formats efficiently extracts a single file from an archive?

4

1

Extracting a single file from a zip file is a fast operation, so I assumed this would be true for TAR as well, but I learned that even though a TAR file is without compression, it can take a looong time for a file to be extracted. I had used tar to backup my home folder on OS X, and I then needed a single file. Since tar doesn't know where the file is, it needed to scan the entire 300GB file before being able to extract. This means TAR is a terrible format for most backup scenarios, so I'd like to know my options.

So, which archival file formats are suitable for quickly extracting a single file?

Even though this question isn't really about compression, I don't mind answers listing formats that combine archiving and compression (like zip), in which case "solid compression" will matter.

oligofren

Posted 2018-12-18T22:36:37.387

Reputation: 842

Remeber that tar stands for tape archive so keep in mind it was originally designed (in the 70's) to work with tapes (and still works with tape drives today). Definitely wasn't meant for random or quick access. – LawrenceC – 2018-12-18T23:40:31.283

Also, it is also targeted for streaming into pipes, which doesn't work that well with indices. GNU tar does add an index though. – oligofren – 2018-12-19T11:12:07.147

Answers

3

It sounds like speed & efficiency of extraction are your main concerns, and I'm assuming you're using linux or macOS so want to preserve special file attributes (the ones zip & 7z ignore). In that case, an excellent archive format would be:

  • An ext[2/3/4] filesystem - Just copy the files somewhere, then extracting a single file is as quick & easy as mounting & reading the original file. You could put the whole archive filesystem inside a single archive file if you wish, just create a file big enough & format it & mount it (don't even need the -o loop option anymore).

    Pros:

    • A nice bonus is you can easily add encryption (LUKS) to the whole archive file too, or any other encryption the filesystem supports (eCryptFS, EncFS, etc).

    • You can also use rsync-based backup solutions easily.

    • It's easy to add/delete files (up to the overall archive file's size).

    Cons:

    • If using a single archive file, you have to pick it's size before adding files, and it doesn't dynamically change size.
    • It's still possible to expand or shrink the entire archive even if it's in a single file, but you need tools like resize2fs to shrink the filesystem, then truncate to shrink the file (or vice versa to expand).
  • The same filesystem you're already using, in case you're using macOS and it likes something other than ext. I'm pretty sure macOS's mount command works with a single large archive file too.

If you do want some compression also, that's usually where the solid archives & slow reading comes in. Some filesystems support compression directly (btrfs, reiserfs/reiser4, planned for ext?) but I'd just go with:

  • SquashFS - It might be the compression King, saves file attributes, and allows quick extraction of a single file (mounting & browsing of every file in fact). It's great for archives too, and has adjustable levels of compression, use it.

    Or perhaps combine it with incremental backups & overlay mounts for a nice "partial backups but full files" solution.

    A con is it's impossible to increase or shrink the size of the archive, or add/delete files.

    Or just use an existing backup product (Time Machine?).

If you really wanted to use an archive like 7z/zip anyway, but still keep the file attributes, you could tar each file individually (saving the attributes) then store the separate tar files in a 7z/zip archive. It needs an extra step with more hassles, but would let you easily extract a single (tar'd) file, and expand or shrink the archive without re-compressing everything (if it's not a solid archive).

Xen2050

Posted 2018-12-18T22:36:37.387

Reputation: 12 097

-1

The Zip format has been made for extracting single files randomly and efficiently. A Zip archive contains a catalog at its end allowing to reach single files quickly - compressed or not.

Zerte

Posted 2018-12-18T22:36:37.387

Reputation: 142

Cool, but we knew this. Do you know of any other formats doing the same? – oligofren – 2018-12-18T23:21:13.977

OP already said this in his Question. He's looking for other suggestions besides .zip. – Spiff – 2018-12-18T23:38:13.053

-1

Most modern compression archive formats include a database or catalog of the files and folders stored within them. These include: 7-Zip, ACE, ARC, ARJ, BZIP2, CAB, CPIO, GZIP, IMG, ISO (ISO9660), LHA, RAR, RPM, SFX, SQX, TAR, TBZ (TAR.BZ), TGZ (TAR.GZ), TXZ (TAR.XZ), XZ, ZIP, Zip64, and ZOO. These formats will allow you to extract an individual file or folder, as needed.

ZIP is by far the most common and widely used. Some operating systems, like Windows have native support for ZIP files, allowing you to use a ZIP file as if it was a standard folder.

As for efficiency of extracting an individual file, I have never seen a test on this. However, I have used ZIP archives in this manner, so I can say it is pretty fast, dependent on the size of the file.

Keltari

Posted 2018-12-18T22:36:37.387

Reputation: 57 019

Many of the formats you listed are just compression formats, not archive formats. ZIP is both-in-one, but TAR is just an uncompressed archive format, and GZIP is just a compression format. If you want to take a directory full of files and put them all inside one compressed file, you can't use TAR alone or GZIP alone; you have to use TAR to make the archive, and GZIP to compress it. Also, as OP said, TAR doesn't meet his needs because it does not contain any kind of catalog/database/table-of-contents data structure up front. – Spiff – 2018-12-18T23:47:24.803

@Spiff compression formats are a type of archive format. It doesnt matter if TAR meets his needs, you are capable of removing a single file. He can determine his needs as necessary. – Keltari – 2018-12-19T00:11:32.037

2

No, not all compression formats are archive formats. Unix has always distinguished between compression (making a single file smaller) and archiving (storing a bunch of files in side a single file). If you come from a DOS/Windows or classic Mac background where formats like PKZIP and StuffIt! always combined both roles in one, you might not have learned that there are archive formats that don't compress, and compression formats that don't archive. Here, Wikipedia is smart enough to keep it straight: https://en.wikipedia.org/wiki/List_of_archive_formats

– Spiff – 2018-12-19T03:14:01.700

1This is incorrect. Neither tar nor cpio has such an index (in POSIX versions - GNU tar does, but not BSD). When you list the contents it is done by scanning the entire archive. This is to make it pipe friendly. So listing the files of a 100gb archive involves reading up to 100gb. Same goes for extraction of single files. If you are lucky they might be at the start of the archive. – oligofren – 2018-12-19T10:36:56.887