13

I have a large and growing set of text files, which are all quite small (less than 100 bytes). I want to diff each possible pair of files and note which are duplicates. I could write a Python script to do this, but I'm wondering if there's an existing Linux command-line tool (or perhaps a simple combination of tools) that would do this?

Update (in response to mfinni comment): The files are all in a single directory, so they all have different filenames. (But they all have a filename extension in common, making it easy to select them all with a wildcard.)

Daryl Spitzer
  • 2,946
  • 9
  • 33
  • 40

4 Answers4

23

There's the fdupes. But I usually use a combination of find . -type f -exec md5sum '{}' \; | sort | uniq -d -w 36

Hubert Kario
  • 6,351
  • 6
  • 33
  • 65
  • 1
    This variation worked for me: `find . -name "*.csv" | xargs md5sum | sort | uniq -D -w 34` (I used uniq -D, and I prefer xargs to find -exec.) – Daryl Spitzer Oct 07 '10 at 19:12
  • +1 I was not aware of fdupes, that tool looks very useful. – Zoredache Oct 07 '10 at 20:44
  • 3
    @Daryl: Using `xargs` like this does not work for filenames with blanks, using `-exec` however does. Using `-type f` as additional argument to `find` (can be used together with `-name`) restricts the search to files. – fuenfundachtzig Jun 04 '12 at 16:17
  • +1 for fdupes, since it is fast for huge binary files, as well. – Bengt Nov 07 '12 at 22:17
  • On some rare occasions I have had xargs not working (crashing after a certain amount of processed files) but not find -exec wich woked all the time. @fuenfundachtzig, one can use xargs -0 --delimiter="\n" to handle these kind of files. – ychaouche Apr 14 '13 at 22:29
6

Well there is FSlint - which I haven't used for this particularly case, but I should be able to handle it: http://en.flossmanuals.net/FSlint/Introduction

faker
  • 17,326
  • 2
  • 60
  • 69
3

You almost certainly don't want to diff each pair of files. You probably would want to use something like md5sums to get all the checksums of all the files and pipe that into some other tool that will only report back duplicate checksums.

Zoredache
  • 128,755
  • 40
  • 271
  • 413
  • 2
    You could reduce the number of md5sums calculated by only calculating md5sums for files that are of a size, for which there is more than one file of that size. So for all files that are of a unique size in bytes, you do not need an md5sum, since they cannot be duplicates of anything. – tomsv Jun 07 '13 at 09:33
0

I see fdupes and fslint mentioned as answers. jdupes is based on fdupes and significantly faster than either, fdupes ought to be considered deprecated at this point.

Mr. T
  • 131
  • 3