Linux tools to find duplicate files?

Question

I have a large and growing set of text files, which are all quite small (less than 100 bytes). I want to diff each possible pair of files and note which are duplicates. I could write a Python script to do this, but I'm wondering if there's an existing Linux command-line tool (or perhaps a simple combination of tools) that would do this?

Update (in response to mfinni comment): The files are all in a single directory, so they all have different filenames. (But they all have a filename extension in common, making it easy to select them all with a wildcard.)

I'm assuming that the filenames are not duplicates, only the contents? — mfinni, Oct 07 '10 at 18:53
http://www.pixelbeat.org/fslint/ if you are looking for contents above link may help you do that. :) — Rajat, Oct 07 '10 at 19:05

score 23 · Accepted Answer · answered Oct 07 '10 at 19:03

23

There's the fdupes. But I usually use a combination of find . -type f -exec md5sum '{}' \; | sort | uniq -d -w 36

answered Oct 07 '10 at 19:03

Hubert Kario

6,351
6
33
65

1

This variation worked for me: `find . -name "*.csv" | xargs md5sum | sort | uniq -D -w 34` (I used uniq -D, and I prefer xargs to find -exec.) – Daryl Spitzer Oct 07 '10 at 19:12
+1 I was not aware of fdupes, that tool looks very useful. – Zoredache Oct 07 '10 at 20:44
3

@Daryl: Using `xargs` like this does not work for filenames with blanks, using `-exec` however does. Using `-type f` as additional argument to `find` (can be used together with `-name`) restricts the search to files. – fuenfundachtzig Jun 04 '12 at 16:17
+1 for fdupes, since it is fast for huge binary files, as well. – Bengt Nov 07 '12 at 22:17
On some rare occasions I have had xargs not working (crashing after a certain amount of processed files) but not find -exec wich woked all the time. @fuenfundachtzig, one can use xargs -0 --delimiter="\n" to handle these kind of files. – ychaouche Apr 14 '13 at 22:29

score 6 · Answer 2 · answered Oct 07 '10 at 19:03

6

Well there is FSlint - which I haven't used for this particularly case, but I should be able to handle it: http://en.flossmanuals.net/FSlint/Introduction

answered Oct 07 '10 at 19:03

faker

17,326
2
60
69

+1 for FSlint. Very intuitive interface and powerful customization options. – Glutanimate Feb 16 '13 at 04:39

score 3 · Answer 3 · answered Oct 07 '10 at 19:02

3

You almost certainly don't want to diff each pair of files. You probably would want to use something like md5sums to get all the checksums of all the files and pipe that into some other tool that will only report back duplicate checksums.

answered Oct 07 '10 at 19:02

Zoredache

128,755
40
271
413

2

You could reduce the number of md5sums calculated by only calculating md5sums for files that are of a size, for which there is more than one file of that size. So for all files that are of a unique size in bytes, you do not need an md5sum, since they cannot be duplicates of anything. – tomsv Jun 07 '13 at 09:33

Mr. T · Answer 4 · 2021-09-17T10:43:16.880

0

I see fdupes and fslint mentioned as answers. jdupes is based on fdupes and significantly faster than either, fdupes ought to be considered deprecated at this point.

edited Sep 17 '21 at 10:43

answered Jul 18 '21 at 06:28

Mr. T

131
3

Linux tools to find duplicate files?

4 Answers4