deduplicating and indexing directories of images across 150 linux machines

Question

I have a client with 150 Linux servers spread about various cloud services and physical data-centres. Much of this infrastructure is acquired projects/teams and pre-existing servers/installs.

The client is largely about image processing, and many of the servers have large SAN or local disk arrays with millions of jpeg/png files.

There is a configuration management agent on each box, I can see that many disks are 100%, some are pretty empty, and there is a lot of duplicated data.

The client now has access to a CDN. But at the moment just enumerating what is possible is a daunting task.

Are there any tools to create useful indexes of all this data?

I see tools like GlusterFS for managing these distributed filesystems, and Hadoop HDFS

I am wondering whether I can use the indexing tools of these systems without actually implementing the underlying volume management tools.

What should the starting point for generating an index of potential de-duplication candidates?

score 2 · Accepted Answer · answered May 20 '12 at 12:30

The easiest way I found to find duplicate files across a bunch of systems is to create a list of files with their MD5 sums for each system, combine them into one file, then use sort + an AWK script to find the duplicates, as follows:

First, run this on each of the systems, replacing the path as appropriate:

#!/bin/sh
find /path/to/files -type f -exec md5sum {} \; |\
while read md5 filename
do
    echo -e "${HOSTNAME}\t${md5}\t${filename}"
done >/var/tmp/${HOSTNAME}.filelist

This will produce a file /var/tmp/HOSTNAME.filelist on each host, which you will have to copy to a central location. Once you have gathered up all these filelists, you can then run the following:

#!/bin/sh
export LANG=C
cat *.filelist |sort -t$'\t' +1 -2 |\
awk '
BEGIN {
    FS = "\t"
    dup_count = 0
    old_md5 = ""
}

{
    if ($2 == old_md5) {
        if (dup_count == 0 ) {
            printf("\n%s\n", old_inline)
        }
        printf("%s\n", $0)
        dup_count++
    }
    else {
        dup_count = 0
    }
    old_md5 = $2
    old_inline = $0
}'

This should produce an output file which groups in blocks files who's contents are duplicate either within the same host, or across hosts.

Oh, and as an alternative to the first script (which gets run on every host), check with the backup system in use to see if you can get something similar from the backup report (something that includes md5 and filename, at least).

I've put some thought to this idea, but some initial investigations suggest a converged dataset of at least 500 million rows. At 60bytes a record that's 30GB of data just to list the files.... ;-) (i think, need a byte calculator...) — Tom, May 20 '12 at 12:59
well the good news is that generating the data sets will be distributed. So all you need is to beef up wherever you will be doing the processing. You could load this into a distributed Oracle RAC cluster, but I've done this style sort/compare before on fairly large data sets and it has actually gone faster than I would have thought. Main thing is to watch your temp space -- you may have to pass the -T flag to sort if your default /tmp isn't big enough. Oh, and try it on a small data set first, just to make sure there's no typos in the code. — Derek Pressnall, May 20 '12 at 13:22

deduplicating and indexing directories of images across 150 linux machines

1 Answers1