Duplicate File Scanner

I have a 15TB storage network, and I am down to about 2.5TB now (due to a large amount of duplicates). I have tried many scanners, but I have had little success, eventually they all crash due to the massive amounts of data. Is there any program that you know of that will be able to handle these large loads. I don't care about the platform that it runs on.

Thank you.

Reid

Posted 2012-05-11T22:47:02.140

Reputation: 171

It depends. For example if you have a copy of Windows Server 2008 R2 around (I forget whether you need a specific SKU, sorry!) then it has some file management stuff that can generate exactly these kinds of reports. If I had to kludge one together myself I'd probably do something terrible with Perl and hashes, serializing the hashes to files based on oh I don't know letters of the alphabet or something. It would be fun. – Mark Allen – 2012-05-11T22:50:54.737

I do have Windows Server 2008 R2, however I have not used it in a while, after we switched to Linux servers. Do you have a walk through on doing this? – Reid – 2012-05-11T22:53:24.813

1What is the program supposed to do with duplicates? – Der Hochstapler – 2012-05-11T22:55:28.717

1I would say, lets just start with finding them. After that I will need to figure out some way (a self written script) to compare all of the metadata, and then backup the files onto some backup HD's and then delete them off of the servers. – Reid – 2012-05-11T23:16:08.750

What type of data and/or metadata are you looking at? Is it just the filenames or do you also want check the contents of the file? Do you know any scripting languages? – Emil Vikström – 2012-05-11T23:44:38.687

2What programs have you tried without success? – Scott McClenning – 2012-05-12T00:34:09.450

Did you already max out your RAM in the machine running the scanner? – rob – 2012-05-12T03:43:36.407

64GB DDR3. I would put the program at fault more than I would the PC. – Reid – 2012-05-13T03:33:15.747

What programs have you tried? Were they compiled for 64-bit architecture? Do you have a rough idea of the number of files you're scanning? (on the order of millions? hundreds of millions? billions?) – rob – 2012-05-13T05:27:25.723

@Reid: http://technet.microsoft.com/en-us/library/cc771212.aspx talks about running a report to identify duplicate files.

– Mark Allen – 2012-05-14T19:54:44.703

Answers

If you haven't done so already, you may be able to work around your problem by cramming more RAM into the machine that's running the duplicate detector (assuming it isn't already maxed out). You also can work around your problem by splitting the remaining files into subsets and scanning pairs of those subsets until you've tried every combination. However, in the long run, this may not be a problem best tackled with a duplicate detector program that you have to run periodically.

You should look into a file server with data deduplication. In a nutshell, this will automatically only store 1 physical copy of each file, with each "copy" hardlinked to the single physical file. (Some systems actually use block-level deduplication rather than file-level dedup, but the concept is the same.)

Newer advanced filesystems such as ZFS, BTRFS, and lessfs have dedup support, as does the OpenDedup fileserver appliance OS. One or more of those filesystems might already be available on your Linux servers. Windows Storage Server also has dedup. If you have some money to throw at the problem, some commercial SAN/NAS solutions have dedup capability.

Keep in mind, though, that dedup will not necessarily help with small, slightly modified versions of the same files. If people are littering your servers with multiple versions of their files all over the place, you should try to get them to organize their files better and use a version control system--which only saves the original file and the chain of incremental differences.

Update:

64 GB should be sufficient for caching at least 1 billion checksum-file path entries in physical memory, assuming 128-bit checksums and average metadata (filesystem path, file size, date, etc.) no longer than 52 bytes. Of course, the OS will start paging at some point, but the program shouldn't crash--that is, assuming the duplicate file finder itself is a 64-bit application.

If your duplicate file finder is only a 32-bit program (or if it's a script running on a 32-bit interpreter), the number of files you can process could be vastly less if PAE is not enabled: more on the order of 63 million (4 GB / (128 bits + 52 bytes)), under the same assumptions as before. If you have more than 63 million files, you use a larger checksum, or if the average metadata cached by the program is larger than 52 bytes, then you probably just need to find a 64-bit duplicate file finder. In addition to the programs mgorven suggested (which I assume are available in 64-bit, or at least can be easily recompiled), there is a 64-bit version of DupFiles available for Windows.

rob

Posted 2012-05-11T22:47:02.140

Reputation: 13 188

I would have thought that 64GB DDR3 was good enough... We do have our storage servers mirrored so another site, using RSync. My problem is mostly with other people making copies of large presentations, or other files, for backup or otherwise. After space started to become limited, we did train our employees to "clean up better", but in the mean time the damage is already done. – Reid – 2012-05-13T03:25:56.890

Thanks for the info. Setting up a fileserver with deduplication support and simply transferring the files onto that would effectively merge all the duplicates and would automatically address the cases in which users make copies of their files. This might not be practical now, but you should consider it the next time you expand your storage. I thought of another issue that may or may not be related to the duplicate file finders crashing and added it to my answer. – rob – 2012-05-13T05:41:31.167

Have you tried rdfind, fdupes and findup from fslint?

mgorven

Posted 2012-05-11T22:47:02.140

Reputation: 2 539

This is an ancient post, but please consider expanding the answer. Just pointing to a product isn't considered an answer by current standards because it doesn't indicate anything about why it's a good solution or how to accomplish the solution. Good guidance on recommending software here. Thanks.

– fixer1234 – 2016-08-03T16:50:47.317

Findup is the only one on your list I have tried, but I will give those a try with a light weight install of linux, on a virtual cluster. Thank you. – Reid – 2012-05-13T03:37:01.087