2
3
Is there a way I can have a hash value as input when searching for files and a complete list of files and their locations as output?
This could be helpful when trying to pin point file duplicates. I often times find myself in situations where I have a bunch of files that I know I already have stored in some location but I don't know where. They are essentially duplicates.
For instance, I could have a bunch of files on a portable hard drive, and also hard copies of those files on the internal hard drive of a desktop computer... but I'm not sure of the location! Now if the files are not renamed, I could do a file name search to try to locate the hard copy on the desktop. I could then compare them side by side and in case they are the same I could delete the copy I have on the portable hard drive. But if the files have been renamed on either one of the hard drives this would probably not work (depending on how much the new names differ from the original).
If a file is renamed, but not edited, I could calculate its hash value, e.g. SHA1 value is 74e7432df4a66f246b5214d60b190b67e2f6ce52
. I would then like to have this value as input when searching for files and have the operating system search through a given directory or the entire file system for files with this exact SHA1 hash value and output a complete list of locations where these files are stored.
I'm using Windows, but I am generally interested in knowing how something like this could be achieved, regardless of operating system.
1Unless the file system keeps a table of hashes (most don't) you need to calculate those as part of the search. I would rather use a program that does this for you–it will likely use hashes internally as one mechanism to compare files–than make your own solution. If you do make your own solution, I'd recommend using something like md5 for the hashing. While not cryptographically secure, it's faster than SHA* and provides good enough entropy for the application, for files not intentionally forged to create collisions. – nitro2k01 – 2013-12-24T13:04:33.263
Hashing a file will rarely be faster than comparing the data in two files (most will fail fairly quickly) – Bandrami – 2013-12-24T13:15:36.133
If hashing is not a good option, then by what other means can I uniquely identify a file? – Samir – 2013-12-24T13:53:26.527
Approximately, how long will it take to hash 60 GiB in 135000 files? This is the entire content of my Users folder. Is there any upper limit on how big files I can hash? I know that small files are hashed fairly quickly, but the big ones might take several minutes to hash. – Samir – 2013-12-24T13:59:19.010