2

I have an incredibly large tarball. I'd extract several files out of many thousands within the archive. I'm on CentOS 6.10 running GPFS 4.2.3. I've seen from this answer that pigz is useful in extracting the entire tar.ball. Extracting the entire tar ball is not useful because it will take up terabytes with of space.

I've tried something like :

$ pigz -dc ../test.tar.gz | tar xf test/analysis/something/dist.txt
tar: test/analysis/something/dist.txt: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now

I'm not exactly sure how to pass test/analysis/something/dist.txt as an argument to tar in the context of piping the output of pigz. My intuition says to use xargs, but that fails also.

$ pigz -dc ../test.tar.gz | xargs -I var | tar xf var test/analysis/something/dist.txt
tar: var: Cannot openxargs: Warning: a NUL character occurred in the input.  It cannot be passed through in the argument list.  Did you mean to use the --null option?
: No such file or directory
tar: Error is not recoverable: exiting now
xargs: /bin/echo: terminated by signal 13

QUESTION

  1. How do I quickly extract a single file from a large tarball using pigz?

2 Answers2

3

The problem with your command is, that you decompress the file to stdout, but instead of reading it from stdin with tar you tell it to extract from a nonexisting file.

The correct command would be:

$ pigz -dc ../test.tar.gz | tar xf - test/analysis/something/dist.txt
#                                  ^- this dash tells tar to read from stdin

However, basically you are decompressing the file into your memory, so unless you have terabytes of memory available, it will fill up even faster than decompressing to disk.

Gerald Schneider
  • 19,757
  • 8
  • 52
  • 79
  • To make it clear. `However, basically you are decompressing the file into your memory, so unless you have terabytes of memory available, it will fill up even faster than decompressing to disk`. It will not. Both gzip/pigz and tar are perfectly able to operate on streaming data and neither will consume lot of memory. Pigz will feed pipe and if tar won't be able to take it fast enough it will wait for it. Similarly tar will read archive and discard everything that is not ask for (here: dist.txt); then it will write it to drive. That's it. – tansy Jul 08 '22 at 15:50
  • I just tested it with big .tar.gz archive I could come up with and neither used more than 4MB. Not for a second. – tansy Jul 08 '22 at 15:52
1

Agree with the author above, just mention about file navigation inside the tar archive:

pigz -dc <archive.tar.gz> | tar xf - <file-with-path-inside-archive>

try to test/list archive (tar option -t) for your file:

pigz -dc <archive.tar.gz> | tar tf -

lookup full file name in the archive:

pigz -dc <archive.tar.gz> | tar tf - | grep <file-name>
rook
  • 137
  • 1
  • 1
  • 14