bash find list of archived files with wildcard in while loop

2

I have a few thousand gzipped text files in different subdirectories and used a subset of these files as input for a project a few years ago. Back then I had an unzipped copy of the files I actually used in one directory, but deleted this and kept only a list of these unzipped files in that folder

This was my initial idea, LIST is the list of files. PARENTDIR is the toplevel directory in which all files reside in various sub directories. The idea was to find all the archives in whatever sub directory they are and gunzip them to NEWDIR

#!/usr/bin/env bash    

LIST="listfile.txt"
PARENTDIR="/home/user/old/project"
NEWDIR="/home/user/old/project/2016"

while read line;
do
    ARCHIVE="$(find $PARENTDIR -name "$line*")"
    gunzip --stdout $ARCHIVE >$NEWDIR/$line
done <$LIST

I don't seem to get the find command right. It works without the variables, but not with, even without the command substitution, calling on the command line. My combination of quotes and wild cards is not quite correct, but I can't get it right, variable expansion doesn't help either and I guess I'm stuck...

Carambakaracho

Posted 2016-06-29T07:51:50.110

Reputation: 43

Add echo "$ARCHIVE" to see what it happened... or set + before and set - after the part to debug. If there are spaces you want to use " $variable"... what happens if find found more than one archive compliant with the key? Better find... -exec gzip {} ; – Hastur – 2016-06-29T08:29:00.443

Thank you for the suggestion, echo "$ARCHIVE" outputs an empty line, echo $line shows what I expect. I included set + and set - but nothing seemed to happen either - but I'm not exactly familiar with set – Carambakaracho – 2016-06-29T08:46:45.583

@Hastur, you're right, my tests show I have a few thousand duplicates, but not all. In principle, the first text file could be overwriten by the second, would find... -exec gzip {} ; just decompress the archive twice as well? – Carambakaracho – 2016-06-29T08:54:25.870

Answers

1

I thought of using the -exec option for find but this would not work due to the redirection operator used with the gunzip command. One solution would be to perform the operation in two steps:

1. Copy the archives into $NEWDIR:

    while read line
    do
        find "$PARENTDIR" -name "$line*" -exec cp -v {} "$NEWDIR" \;
    done < "$LIST"

This should work with POSIX-compatible versions of find – not only GNU find.

Avoid over-writing of similarly named files

If you have duplicate filenames, they’ll be over-written in $NEWDIR. If you want to avoid over-writing the files, you’d have to recreate the directory tree inside $NEWDIR. This can be done using the install command from GNU coreutils which creates all parts of the path similar to mkdir -p.

    cd "$NEWDIR"
    while read line
    do
        find . -name "$line*" -exec install -D {} $NEWDIR/{} \;
    done < "$LIST"
    cd -

2. Decompress the copied files:

    find "$NEWDIR" -exec gunzip {} \;

Anthony Geoghegan

Posted 2016-06-29T07:51:50.110

Reputation: 3 095

Thanks, the actual solution to the initial problem was to change the directory, but I wouldn't have tried it without your suggestion. I executed the script from within $NEWDIR and thought this should do given the absolute paths. When moving it to $PARENTDIR find works, which I don't really understand. I used the -exec cp statement because of course I created duplicates back then – Carambakaracho – 2016-06-29T09:20:10.550

1Thanks for the install command, I didn't know this one! In this case I can accept overwriting, it's not critical but I'll keep the install command in mind for future use. – Carambakaracho – 2016-06-29T09:25:42.433

@Carambakaracho I tried to recreate your setup on my system and I didn't need to change directory to $PARENTDIR. Without debugging information, it's hard to say why you had to. BTW, I've edited my answer to include a way to avoid over-writing similarly named files. – Anthony Geoghegan – 2016-06-29T09:26:29.970