Is there a way to search for files by hash value?

Is there a way I can have a hash value as input when searching for files and a complete list of files and their locations as output?

This could be helpful when trying to pin point file duplicates. I often times find myself in situations where I have a bunch of files that I know I already have stored in some location but I don't know where. They are essentially duplicates.

For instance, I could have a bunch of files on a portable hard drive, and also hard copies of those files on the internal hard drive of a desktop computer... but I'm not sure of the location! Now if the files are not renamed, I could do a file name search to try to locate the hard copy on the desktop. I could then compare them side by side and in case they are the same I could delete the copy I have on the portable hard drive. But if the files have been renamed on either one of the hard drives this would probably not work (depending on how much the new names differ from the original).

If a file is renamed, but not edited, I could calculate its hash value, e.g. SHA1 value is 74e7432df4a66f246b5214d60b190b67e2f6ce52. I would then like to have this value as input when searching for files and have the operating system search through a given directory or the entire file system for files with this exact SHA1 hash value and output a complete list of locations where these files are stored.

I'm using Windows, but I am generally interested in knowing how something like this could be achieved, regardless of operating system.

Samir

Posted 2013-12-24T12:52:11.167

Reputation: 17 919

1Unless the file system keeps a table of hashes (most don't) you need to calculate those as part of the search. I would rather use a program that does this for you–it will likely use hashes internally as one mechanism to compare files–than make your own solution. If you do make your own solution, I'd recommend using something like md5 for the hashing. While not cryptographically secure, it's faster than SHA* and provides good enough entropy for the application, for files not intentionally forged to create collisions. – nitro2k01 – 2013-12-24T13:04:33.263

Hashing a file will rarely be faster than comparing the data in two files (most will fail fairly quickly) – Bandrami – 2013-12-24T13:15:36.133

If hashing is not a good option, then by what other means can I uniquely identify a file? – Samir – 2013-12-24T13:53:26.527

Approximately, how long will it take to hash 60 GiB in 135000 files? This is the entire content of my Users folder. Is there any upper limit on how big files I can hash? I know that small files are hashed fairly quickly, but the big ones might take several minutes to hash. – Samir – 2013-12-24T13:59:19.010

Answers

Linux example:

echo '74e7432df4a66f246b5214d60b190b67e2f6ce52' | { read hash ; find -type f -exec sh -c 'sha1sum "$1" | cut -f 1 -d " " | sed "s|^\\\\||" | grep -Eqi "$0"' "$hash" "{}" \; -print ; }

This code is more complex than you would think it should be because:

it is intended to correctly handle filenames with spaces, newlines, backslashes, quotations, special characters etc. (change -print to -print0 to parse them further);
it is intended to accept hash(es) as regex (compatible with grep -E i.e. egrep),
e.g. '(^00)|(00$)' will match if the file hash starts or ends with 00.

You can use other *sum tools with compatible interface (e.g. md5sum).

Kamil Maciorowski

Posted 2013-12-24T12:52:11.167

Reputation: 38 429

If you have PowerShell v.4.0 or higher, you can use the command:

Get-ChildItem _search_location_ -Recurse | Get-FileHash | 
Where-Object hash -eq (Get-FileHash _search_file_).hash | Select path

Where _search_location_ is folder or disk where you want to search for a duplicate and _search_file_ is a file that has a duplicate somewhere. You can put this command in a loop to search for several files or add | Remove-Item at the end of the line to automatically delete duplicates.

Also note that this command is suitable for small search folders only - it will take a lot of time if your search location has thousands of files (like a whole HDD).

Alex K

Posted 2013-12-24T12:52:11.167

Reputation: 11

I like to use simple tools that I happen to already have so here is a way to do that with Windows PowerShell (so it obviously only works on windows). It is actually a small edit to Alex K's answer however the question was how to search using hashes, whereas his answer searched for a copy of a specific file.

Get-ChildItem "_search_location_" -Recurse | Get-FileHash | Where-Object hash -eq _hash_here_ | Select path

Simply replace _search_location_ with what directory you wish to search and replace _hash_here_ with the hash of the file you wish to find.

user746340

Posted 2013-12-24T12:52:11.167

Reputation: 11

1Please edit your answer instead of posting a second one. While you do mention it's a slight variation you're missing any information on what you changed or why it makes it better. – Seth – 2017-07-07T08:22:11.923

This is an intriguing question. I have been using a tool called fdupes to accomplish something similar. Fdupes will recursively search through directories and compare every file with every other file. First it compares size, and if the sizes are identical then it creates hashes of the files and compares that, if the hashes are the same then in actually goes through each file byte by byte and compares it.

When if finds all the files that are truly identical you can have it do several things. I have it delete the duplicate and create a hardlink in it's place (thus saving me HDD space), although you can have it simply output the locations of the duplicate files and not do anything with them. This is the scenario you are asking about.

Some downsides with fdupes are that as far as I know it's Linux only, and since it compares every file to every other file it takes quite a bit of I/O and time to run. It does not "search" for a file per say, but it would list all the files that have an identical hash.

I would highly recommend it and I set it to run in a cron job every day so that I never have any unnecessary duplicates of my data (it excludes my backups of course).

Fdupes Source Page

tbenz9

Posted 2013-12-24T12:52:11.167

Reputation: 5 868

Here's an example for an MD5 algorithm:

Get-ChildItem "_search_location_" -Recurse | Get-FileHash -Algorithm MD5 | Where-Object hash -eq _hash_here_ | Select path

Replace _search_location_ with what directory you wish to search and replace _hash_here_ with the hash of the file you wish to find.

If you want to search for a hash besides the sha256 hash you add -Algorithm _algorithm_ after Get-FileHash where _algorithm_ is the chosen algorithm.

Beware that this requires PowerShell 4.0 and will recalculate every hash for every file for every search!

user746347

Posted 2013-12-24T12:52:11.167

Reputation: 1

There's a tool ($) called FileLocator Pro that can search by file hash (SHA-x or MD5).

Excerpt from this page: http://www.mythicsoft.com/filelocatorpro/help/en/advanced_criteria.htm

Note: If the expression type is set to 'File Hash' then the containing text box can include a comma separated list of hash values or a pointer to a file containing a list of hash values, e.g.

5A9C9B42A16F5E1985B7B0A019114C7A,675C9B42A16F5E1985B7B0A019114C7A

or,

=c:\FileHashTable.txt

The actual algorithms used to calculate the hash, e.g. SHA1, MD5, are specified in the Options tab.

snowdude

Posted 2013-12-24T12:52:11.167

Reputation: 2 560