Find all duplicate files by MD5 hash

I'm trying to find all duplicate files (based on MD5 hash) and ordered by file size. So far I have this:

find . -type f -print0 | xargs -0 -I "{}" sh -c 'md5sum "{}" |  cut -f1 -d " " | tr "\n" " "; du -h "{}"' | sort -h -k2 -r | uniq -w32 --all-repeated=separate

The output of this is:

1832348bb0c3b0b8a637a3eaf13d9f22 4.0K   ./picture.sh
1832348bb0c3b0b8a637a3eaf13d9f22 4.0K   ./picture2.sh
1832348bb0c3b0b8a637a3eaf13d9f22 4.0K   ./picture2.s

d41d8cd98f00b204e9800998ecf8427e 0      ./test(1).log

Is this the most efficient way?

Jamie Curran

Posted 2012-10-14T21:31:33.160

Reputation: 73

– artfulrobot – 2015-05-20T14:39:28.727

Ok, that's a fair point. But looking at this as a learning exercise for linux cmd, can this be improved? For instance, originally I started off with -exec 'md5sum.....' but research found (using google) xargs was more efficient. – Jamie Curran – 2012-10-14T22:00:35.560

If you want to learn new techniques, I suggest looking how these tools are solving the problem and you will get a lot of clever ideas (the source, Luke, use the source). – Paulo Scardine – 2012-10-14T22:06:42.090

Answers

From "man xargs": -I implies -L 1 So this is not most efficient. It would be more efficient, if you just give as many filenames to md5sum as possible, which would be:

find . -type f -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Then you won't have the file size of course. If you really need the file size, create a shell script, which does the md5sum and du -h and merge the lines with join.

Olaf Dietsche

Posted 2012-10-14T21:31:33.160

Reputation: 421

Sometimes we are working on reduced sets of linux commands, like busybox or other things that comes with NAS and other linux embedded hardwares (IoTs). In these cases we can't use options like -print0, getting troubles with namespaces. So we may prefer instead:

find | while read file; do md5sum "$file"; done > /destination/file

Then, our /destination/file is ready for any kind of process like sort and uniq as usual.

robo

Posted 2012-10-14T21:31:33.160

Reputation: 1

Use either btrfs + duperemove or zfs with online dedupe. It works on the file system level and will match even equal file parts and then use the file system's CoW to retain only one of each while leaving the files in place. When you modify one of the shared parts in one of the files it will write the change separately. That way you can have things like /media and /backup/media-2017-01-01 consume only the size of each unique piece of information in both trees.

orange_juice6000

Posted 2012-10-14T21:31:33.160

Reputation: 115