extract single file from huge tgz file

19

2

I have a huge tar file (about 500G) and I wan't to extract just a single file from it.
However, when I run tar -xvf file.tgz path/to/file it seems like it is still loading the whole contents into memory, and takes over an hour to extract. I've also tried to use --exclude=ignore.txt where ignore.txt is list of patterns in an attempt to stop it from traversing futile paths, but that doesn't seem to work.

Perhaps I don't understand tar... Is there a way to quickly extract the file?

Brian

Posted 2013-10-08T00:28:23.887

Reputation: 293

I am wondering about the same. The file I am looking for is found quickly and extracted - and then I need to wait for an hour for the rest of the achieve to be processed :o( – maasha – 2014-09-29T07:55:20.747

Answers

14

Unfortunately, in order to unpack single member of .tar.gz archive you have to process whole archive, and not much you can do to fix it.

This is where .zip (and some other formats like .rar) archives work much better, because zip format has central directory of all files contained in it with direct offsets pointing to the middle of the zip file, so archive members can be quickly extracted without processing whole thing.

You might ask why processing .tar.gz is so slow?

.tar.gz (often shortened as .tgz) is simply .tar archive compressed with gzip compressor. gzip is streaming compressor that can only work with one file. If you want to get any part of gzip stream, you have to uncompress it as a whole, and this is what really kills it for .tar.gz (and for .tar.bz2, .tar.xz and other similar formats based on .tar).

.tar format is actually very, very simple. It is simply stream of 512-byte file or directory headers (name, size, etc), each followed by file or directory contents (padded to 512 block size with 0 bytes if necessary). When you observe totally null 512 block for a header, this means end of .tar archive.

Some people think that even .tar archive members cannot be accessed quickly, but this is not quite true. If .tar archive contains few big files, you actually can quickly seek into next header, and thus you can find necessary archive member in few seeks (but still could require as many seeks as there are archive members). If your .tar archive contains of lots of tiny files, this means quick member retrieval becomes effectively impossible even for uncompressed .tar.

mvp

Posted 2013-10-08T00:28:23.887

Reputation: 3 705

3gzip can stream uncompressed data, it doesn't have to undo the whole thing. But, since .tar is short for tape archive, you do need to traverse the whole file until you find the file you are looking for. Although tar will keep looking because there might be another, later copy later on in the tar file. – kurtm – 2013-10-08T04:35:25.963

9

If you're extracting just one file from a large tar file, you're using GNU tar, and you can guarantee that the tar file has never been appended to then you can get a significant performance boost by using --occurrence.

This option tells tar to stop as soon as it finds the first occurrence of each file you've requested, so e.g.

tar xf large-backup.tar --occurrence etc/passwd etc/shadow

will not spool through the whole tarball after it finds one copy of each of passwd and shadow, instead it will stop. If those files appear near the end the performance gain won't be much, but if they appear even half way through a 500G file you'll save a lot of time.

For people using tar for single shot backups and not using real tape drives this situation is probably the typical case.

Note that you can also pass --occurrence=NUMBER to retrieve the NUMBERth occurrence of each file, which helps if you know that there are multiple versions in the archive. By default the behavior is equal to a NUMBER of 1.

phogg

Posted 2013-10-08T00:28:23.887

Reputation: 899

Is there a way to create the tar so that a specific file would be first to come out? so that --occurrence would kick in immediately on the first file? I'm guessing it's about filenames, so something called aaaaa.jpg would come out first for example? – Jeff – 2019-01-16T17:20:51.747

1@Jeff: Not really. This merely prevents tar from continuing to search the tarball for newer versions of a file it has found. Instead it returns, as the man page says, the Nth occurrence. If you specify one file to extract on the command line and you say --occurrence then tar will exit as soon as it has found that file, and thus effectively stop at the "first file." – phogg – 2019-01-17T21:48:42.103

2

When dealing with a large tarball use:

--fast-read to extract only the first archive entry that matches filename operand, path/to/file in this case - which is always unique in tarball anyway

tar -xvf file.tgz --fast-read path/to/file

the above will search until it finds a match and then exit

ryan

Posted 2013-10-08T00:28:23.887

Reputation: 131

1

I wanted to understand why this is still at 0 points. man tar (GNU tar 1.29) doesn't even print this option. However Ubuntu seems to have it enabled by default. Reading quickly, I'm not sure what --fast-read does differently from --occurrence. But then --occurrence is not even on the Ubuntu page, but it is in man tar. Are --fast-read and --occurrence the same thing possibly?

– Jeff – 2019-01-21T17:44:45.123

Neither of these options are specified by the standard and, as always with non-standard options, care must be taken to be sure the utility on your system supports them. The --occurrences option is supported by GNU tar. The --fast-read option is supported by recent versions of the FreeBSD tar, packaged as bsdtar by Ubuntu. See here for more.

– phogg – 2019-09-13T22:46:54.677

1

Unfortunately, the tar file format contains no centralized table of contents - so the archive must be read sequentially to locate a particular file. It was originally designed for tape backups ("tar" comes from tape archive), which wouldn't have supported such an operation in any case.

So, you'll probably just have to wait.

user55325

Posted 2013-10-08T00:28:23.887

Reputation: 4 693