Why are not all files compressed and how to improve the solution

8

I have a folder with about 20K files. The files are named according to the pattern xy_{\d1,5}_{\d4}\.abc, e.g xy_12345_1234.abc. I wanted to compress the first 10K of them using this command:

ls | sort -n -k1.4,1.9 | head -n10000 | xargs tar -czf xy_0_10000.tar.gz

however the resulting file had only about 2K files inside.

ls | sort -n -k1.4,1.9 | head -n10000 | wc -l however returns 10000, as expected.

It seems to me that I am misunderstanding something basic here...

I am using zsh 5.0.2 on Linux Mint 17.1, GNU tar 1.27.1

EDIT:

forking as suggested by @Archemar sounds very plausible, with the latest fork overwriting the resulting file - the file contains the 'tail' of the files - 7773 to 9999.

result of xargs --show-limit: Your environment variables take up 3973 bytes POSIX upper limit on argument length (this system): 2091131 POSIX smallest allowable upper limit on argument length (all systems): 4096 Maximum length of command we could actually use: 2087158 Size of command buffer we are actually using: 131072

replacing -c with -r or -u did not work in my case. The error message was tar: Cannot update compressed archives

using both -r and -u is invalid and fails with tar: You may not specify more than one '-Acdtrux', '--delete' or '--test-label' option

replacing -c with -a seems to be invalid as well and fails with the same tar: You must specify one of the '-Acdtrux', '--delete' or '--test-label' options though I dont recognize the issue azf and Acdtrux seem disjunct to me.

EDIT 2:

-T looks like a good way, I have also found an example here.

However when I try

ls | sort -n -k1.4,1.9 | head -n10000 | tar -czf xy_0_10000.tar.gz -T - i get tar: option requires an argument -- 'T'

well, perhaps the filenames dont reach tar? But it looks like they, do because when I execute

ls | sort -n -k1.4,1.9 | head -n10000 | tar --null -czf xy_0_10000.tar.gz -T - i get tar: xy_0_.ab\nxy_1_...<the rest of filenames separated by literal \n>...998.ab Cannot stat: File name too long

So why is tar not seeing the filenames?

kostja

Posted 2015-09-22T14:22:16.823

Reputation: 467

and if you try a instead of c, in the tar command? – Olivier Dulac – 2015-09-22T16:12:23.313

5

Relevant: Don't parse the output of ls

– 8bittree – 2015-09-22T17:35:25.780

1OP's file do not have tricky names. – Archemar – 2015-09-23T08:42:10.827

@8bittree - well as a general advice for robust shell scripts, yes. but what do you suggest instead for working with lists of files with the regular one-off oneliners? – kostja – 2015-09-23T11:39:56.443

@Archemar True, but future people coming here for help might have tricky file names, and the OP may do something similar in the future with tricky file names. Might as well learn the safe way now. – 8bittree – 2015-09-23T12:30:36.867

1

@kostja I'd use find, which has a -print0 option to use a null byte as the delimiter instead of a newline. sort can handle that with the -z flag. head, unfortunately does not handle understand null byte delimiters, but this answer has a solution using tr to swap \n and \0 before and after head. tar has --null -T - to read null delimited file names from stdin.

– 8bittree – 2015-09-23T13:08:08.033

@8bittree - cool, this works as well :) Probably I will ignore your (still valid and reasonable) advice for most of what I do on the command line because I mostly do simple oneliners not meant for sharing and the find/nullbyte solution has some added churn. Until I run into an error because of that and learn the hard way :) Thank you. – kostja – 2015-09-23T17:40:50.770

Answers

12

you've hit xargs limit ?

xargs --show-limit

try :

  • create a dummy .tgz file tar czf xy_0_10000.tar.gz /hello/world
  • replace -czf by -Azf

when xarg hit its limit, it will fork command, so command you ultimatly ran was

  tar czf xy_0_10000.tar.gz file1 file2 .... file666
  tar czf xy_0_10000.tar.gz file667 file668 ... file1203
  tar czf xy_0_10000.tar.gz file1024 ... file2000

as each tar overide previous one, you sould be getting only last tar c run.

Edit:

1) according to man tar on unbuntu, -a and -r seems equivalent append is done by (either) -A, --catenate, --concatenate

2) zip (not gzip) can be used to add file, maybe a gzip option will do the trick. (use | xargs zip -qr xy_0_0000.zip , this will result in a zip file, not a .tar.gz however)

3) to use @rsanchez's solution
It is important to add option to tar in a proper way, try

ls | sort -n -k1.4,1.9 | head -n10000 |tar -czf xy_0_10000.tar.gz -T -

where - -T - mean use option -T and use - as argument to -T (you could have generate a list of file in /tmp/foo.lst , then use -T /tmp/foo.lst )

Archemar

Posted 2015-09-22T14:22:16.823

Reputation: 1 547

could a (=add) instead of c (=create/overwrite) work around that limitation? – Olivier Dulac – 2015-09-22T16:13:24.113

@OlivierDulac (Warning: This is a pure guess) It probably won't solve since tar can't create empty files. You may compress an empty folder first and use a (add) to add the files to the tar file. Then, you can open the tar and remove the folder (using 7zip or something) – Ismael Miguel – 2015-09-22T16:20:08.903

@ismaelmiguel: I m pretty sure it will happily create the file. if not, just : touch xy_0_10000.tar.gz && { _the full command here_ ; } – Olivier Dulac – 2015-09-22T16:23:43.730

1@OlivierDulac That will be an invalid .gz file. – Ismael Miguel – 2015-09-22T16:54:12.013

All the manpages I see from http://manpages.ubuntu.com/manpages/vivid/en/man1/tar.1.html (15.04) back to precise (12.04) have -r append but -a auto-compress which is not equivalent. And -rz doesn't work: zip can add to an existing archive because the directory is not compressed, but tar with compression compresses the metatdata along with the data. You can tar -r piecewise into an uncompressed archive and then gzip the result. Or ...

– dave_thompson_085 – 2015-09-23T00:51:11.340

... You can increase the amount xargs will do in one "chunk" with -s num; here -s 200000 should be plenty. (There is some OS limit and I don't know what it is for Ubuntu, but I'd be surprised if 200k is too much.) Or just use tar -T as @rsanchez said. – dave_thompson_085 – 2015-09-23T00:58:12.900

@dave_thompson_085 - do you happen to have a complete working example with xargs -s200000? because when I execute ls | sort -n -k1.4,1.9 | head -n10000 | xargs -s200000 tar -czf xy_0_10000.tar.gz, it seems to try to compress the contents of the files and fails with <line_of_content> Cannot stat: No such file or directory. Am I using it wrong? – kostja – 2015-09-23T08:39:07.180

@Archemar - have tries your advice but seem to be doing it wrong. Mind to take a look at the edit? – kostja – 2015-09-23T08:45:34.637

@Archemar ah, I have been using it wrong :) Have tried it as you advised, only to encounter a different error - the file names seem not to reach tar for some reason. Please see the edit. – kostja – 2015-09-23T11:34:50.560

why --null ? you are not sending null separated string. – Archemar – 2015-09-23T12:01:44.327

@Archemar - not intended for real use, only to test whether the output of head reaches tar, the T option requires an argument error made me a bit paranoid :) still not sure how to get around it – kostja – 2015-09-23T12:48:43.447

@kostja For test case I just created a directory and 10k files using your name pattern and did ls | xargs -s200000 tar -czf outfile.gz. The only thing I can think of is a race condition between the ls and creating the outputfile -- or replacing it you do repeated attempts and a prior one is still there. Maybe try putting the output in another directory like ~/output or /tmp/output. (Although getting the -T- approach to work is probably more useful.) – dave_thompson_085 – 2015-09-24T14:49:33.540

12

There's no need for xargs. If you directly give tar the -T - option it will read the filenames from standard input.

For instance:

... | tar -T - -czf xy_0_10000.tar.gz

rsanchez

Posted 2015-09-22T14:22:16.823

Reputation: 221

I seem to be using the option incorrectly, cannot get it to work with the pipe. Have tried ...| tar Tczf xy_..., ...| tar Tcz -f xy_... ...| tar -czf xy_... -T and several other permutations, but am getting only tar: You must specify one of the '-Acdtrux', '--delete' or '--test-label' options, tar: -f: Cannot stat: No such file or directory if using -f separately from other options and tar: option requires an argument -- 'T'. Could you please add a usage example? – kostja – 2015-09-23T08:34:07.170

@kostja example added. – rsanchez – 2015-09-23T14:26:37.970

Many thanks, rsanchez. Not sure why the variant with -T - at the end of the tar option list did not work, but your example did. Unfortunately, my question actually had two parts - the source of the error and a possible improvement. While you aced the latter, Archemar excelled at the former and almost had the latter right. I am not sure which of your answers to accept since they both were obviously helpful. – kostja – 2015-09-23T17:34:34.467

1

I want to complement the two other answers with a zsh solution, which neither parses ls, nor needs xargs. However, I am not sure right now, if it suffers also from the limitation of the command line length.

  1. Define a function which generates your desired sorting key by modifying $REPLY.

    sortkey() { REPLY=${REPLY[4,9]} }
    

    This is equivalent to your sort -n -k1.4,1.9

  2. Generate an array $files with the filenames sorted with the above function:

    files=(*(o+sortkey))
    

    This is equivalent to ls | sort -n -k1.4,1.9

  3. Return the first 10 000 files with

    ${files[0,9999]}
    

    This is equivalent to ls | sort -n -k1.4,1.9 | head -n10000

So, all in all this should do the trick:

sortkey() { REPLY=${REPLY[4,9]} }
files=(*(o+sortkey))
tar -czf xy_0_10000.tar.gz ${files[0,9999]}

mpy

Posted 2015-09-22T14:22:16.823

Reputation: 20 866