Compare list of files from a database using find to locate missing files

I have a list of 2000 files from a database that looks like this:

./aa/0f/unique-string/IMG_0987.JPG
./ab/cf/unique-string/IMG_0987.JPG

I want to compare that list to the actual directory contents in order to identify missing files.

The following command works individually, but not when I script it...

find . -false -samefile ./ab/cf/unique-string/IMG_0987.JPG

The closest I have come is the following:

#!/bin/bash
TEST=`cat ./list.lst`
find . -false -samefile "$TEST"

I am doing it wrong. what is the right way?

jakethedog

Posted 2015-06-26T07:39:53.963

Reputation: 740

Pls take a look at my answer, for a faster solution. – MariusMatutiae – 2015-06-26T09:15:48.567

Answers

Your find command is using the entire contents of list.lst because you aren't feeding it through 1 line at a time.

while read f; do
find . -false -samefile "$f"
done<./list.lst

This reads the file list.lst 1 line at a time.

Jack

Posted 2015-06-26T07:39:53.963

Reputation: 589

This is correct, but much slower than it need be. Pls see my answer. – MariusMatutiae – 2015-06-26T09:16:24.043

By following your strategy, you will be making about 2000x2000 = 4,000,000 comparisons. You can do better than this.

Suppose the list is in file_t1; now we first generate a list of all files on the pc by means of

      find . -type f > file_t2

Then we sort both files:

      sort -n file_t1 > file1
      sort -n file_t2 > file2

Now we use comm to generate a list of differences:

      comm -X file1 file2

where:

      X = 12 -> lines that appear in **both** files
      X = 13 -> lines unique to file 2
      X = 23 -> lines unique to file1

This could be done with a one-liner, at the expense of clarity.

If you are interested: this is much faster because the files are already sorted (a strict requirement for comm), so they take of order N steps to compare, if the file size is N. Sorting requires N log N operations, which is thus the most expensive part of this solution, while the one you proposed requires N^2 operations, which is significantly longer for your file sizes.

MariusMatutiae

Posted 2015-06-26T07:39:53.963

Reputation: 41 321