How to replace all duplicate files with hard links?

I know of 4 command-line solutions for linux. My preferred one is the last one listed here, rdfind, because of all the options available.

fdupes

This appears to be the most recommended/most well-known one.
It's the simplest to use, but its only action is to delete duplicates.
To ensure duplicates are actually duplicates (while not taking forever to run), comparisons between files are done first by file size, then md5 hash, then bye-by-byte comparison.

Sample output (with options "show size", "recursive"):

$ fdupes -Sr .
17 bytes each:                          
./Dir1/Some File
./Dir2/SomeFile

hardlink

Designed to, as the name indicates, replace found files with hardlinks.
Has a --dry-run option.
Does not indicate how contents are compared, but unlike all other options, does take into account file mode, owner, and modified time.

Sample output (note how my two files have slightly different modified times, so in the second run I tell it to ignore that):

$ stat Dir*/* | grep Modify
Modify: 2015-09-06 23:51:38.784637949 -0500
Modify: 2015-09-06 23:51:47.488638188 -0500

$ hardlink --dry-run -v .
Mode:     dry-run
Files:    5
Linked:   0 files
Compared: 0 files
Saved:    0 bytes
Duration: 0.00 seconds

$ hardlink --dry-run -v -t .
[DryRun] Linking ./Dir2/SomeFile to ./Dir1/Some File (-17 bytes)
Mode:     dry-run
Files:    5
Linked:   1 files
Compared: 1 files
Saved:    17 bytes
Duration: 0.00 seconds

duff

Made to find files that the user then acts upon; has no actions available.
Comparisons are done by file size, then sha1 hash.
- Hash can be changed to sha256, sha384, or sha512.
- Hash can be disabled to do a byte-by-byte comparison

Sample output (with option "recursive"):

$ duff -r .
2 files in cluster 1 (17 bytes, digest 34e744e5268c613316756c679143890df3675cbb)
./Dir2/SomeFile
./Dir1/Some File

rdfind

Options have an unusual syntax (meant to mimic find?).
Several options for actions to take on duplicate files (delete, make symlinks, make hardlinks).
Has a dry-run mode.
Comparisons are done by file size, then first-bytes, then last-bytes, then either md5 (default) or sha1.
Ranking of files found makes it predictable which file is considered the original.

Sample output:

$ rdfind -dryrun true -makehardlinks true .
(DRYRUN MODE) Now scanning ".", found 5 files.
(DRYRUN MODE) Now have 5 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Now removing files with zero size from list...removed 0 files
(DRYRUN MODE) Total size is 13341 bytes or 13 kib
(DRYRUN MODE) Now sorting on size:removed 3 files due to unique sizes from list.2 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes:removed 0 files from list.2 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes:removed 0 files from list.2 files left.
(DRYRUN MODE) Now eliminating candidates based on md5 checksum:removed 0 files from list.2 files left.
(DRYRUN MODE) It seems like you have 2 files that are not unique
(DRYRUN MODE) Totally, 17 b can be reduced.
(DRYRUN MODE) Now making results file results.txt
(DRYRUN MODE) Now making hard links.
hardlink ./Dir1/Some File to ./Dir2/SomeFile
Making 1 links.

$ cat results.txt 
# Automatically generated
# duptype id depth size device inode priority name
DUPTYPE_FIRST_OCCURRENCE 1 1 17 2055 24916405 1 ./Dir2/SomeFile
DUPTYPE_WITHIN_SAME_TREE -1 1 17 2055 24916406 1 ./Dir1/Some File
# end of file

Izkata

Posted 2015-05-04T20:13:07.847

Reputation: 342

"then either md5 (default) or sha1." That doesn't mean the files are identical. Since computing a hash requires the program to read the entire file anyway, it should just compare the entire files byte-for-byte. Saves CPU time, too.

– endolith – 2016-01-21T22:05:12.293

@endolith That's why you always start with dry-run, to see what would happen... – Izkata – 2016-01-22T15:02:36.807

1But the point of the software is to identify duplicate files for you. If you have to manually double-check that the files are actually duplicates, then it's no good. – endolith – 2016-01-22T15:09:01.803

@endolith You should do that anyway, with all of these

– Izkata – 2016-01-22T15:16:52.393

Yeah I use AllDup in Windows and it graphically shows all the matches, their properties, and lets you look through for potential issues before deleting/hardlinking. I can't find anything like it for Linux. – endolith – 2016-01-22T15:27:31.997

2If you have n files with identical size, first-bytes, and end-bytes, but they're all otherwise different, determining that by direct comparison requires n! pair comparisons. Hashing them all then comparing hashes is likely to be much faster, especially for large files and/or large numbers of files. Any that pass that filter can go on to do direct comparisons to verify. (Or just use a better hash to start.) – Alan De Smet – 2017-03-08T21:52:32.660

If you trust all of the people who might write files involves (easiest if it's just you) and none of those people has reason to intentionally collect files with identical hashes, MD5 or SHA1 are probably more than good enough to identical identical files. The risk of accidental collision is vanishingly small. Not useful for random home directories, but fine for your own specialized collections of, say, audio, video, photos, particle accelerator models, or radiative transfer models. – Alan De Smet – 2017-03-08T21:58:11.237

Certainly is better to use hashes to search instead of byte-to-byte comparison as @AlanDeSmet wrote. Even is better when you can save the hashes in a hash table an then only verify the modified files after its insertion in that table. There's a way to generate a hash table using the cli? Something like "hash", "filename", "mtime". – Andrés Morales – 2018-05-04T17:46:34.650

@AndrésMorales As stated in the answer, that's how these work already. Only for ones that have the same hash do some of them double-check with a byte-by-byte comparison. – Izkata – 2018-05-04T19:30:34.450

@Izkata, I understand. But when a command is invoked many times (in different moments, of course), the hash table is preserved? Or is recreated? If the table is preserved, it could be automated (in a cron task or after system started). – Andrés Morales – 2018-05-04T19:38:39.473

Do you know why running several times rdfind -deleteduplicates true -makehardlinks true <folders> keeps creating links? Why it does not create all the links in the first pass? – Luis A. Florit – 2019-07-17T16:32:46.817

2Please provide OS and filesystem. – Steven – 2015-05-04T20:24:02.977

Well, I use ext4 on ubuntu 15.04, but if someone provides an answer for another OS, I am sure it can be helpful for someone reading this question. – qdii – 2015-05-04T20:34:51.923

Here is a duplicate question on Unix.SE.

– Alexey – 2018-05-04T08:46:26.023

How to replace all duplicate files with hard links?

Answers