In practice, yes, an identical cryptographic hash means the files are the same, as long as the files were not crafted by an attacker or other malicious entity. The odds of random collisions with any well-designed cryptographic hash function is so small as to be negligible in practice and in the absence of an active attacker.
In general, however, no, we cannot say that two arbitrary files having the same hash definitely means that they are identical.
The way a cryptographic hash function works is to take an arbitrary-length input, and output a fixed-length value computed from the input. Some hash functions have multiple output lengths to choose from, but the output is still to some degree a fixed-length value. This value will be up to a few dozen bytes long; the hash algorithms with the longest output value in common use today have a 512-bit output, and a 512-bit output is 64 bytes.
If an input to a hash function is longer than the output of the hash function, some fidelity must be removed to make the input fit in the output. Consequently, there must exist multiple inputs of lengths greater than the length of the output, which generate the same output.
Let's take the current workhorse, SHA-256, as an example. It outputs a hash of 256 bits, or 32 bytes. If you have two files which are each exactly 32 bytes long, but different, these should (assuming no flaw in the algorithm) hash to different values, no matter the content of the files; in mathematical terms, the hash is a function mapping a 2256 input space onto a 2256 output space, which should be possible to do without collisions. However, if you have two files that are each 33 bytes long, there must exist some combination of inputs that give the same 32-byte output hash value for both files, because we're now mapping a 2264 input space onto a 2256 output space; here, we can readily see that there should, on average, exist 28 inputs for every single output. Take this further, and with 64-byte files there should exist 2256 inputs for every single output!
Cryptographic hash functions are designed such that it's computationally difficult to compose an input that gives a particular output, or compose two inputs that give the same output. This is known as preimage attack resistance or collision attack resistance. It's not impossible to find these collisions; it's just intended to be really, really, really, really hard. (A bit of a special case of a collision attack is a birthday attack.)
Some algorithms are better than others at resisting attackers. MD5 is generally considered completely broken these days, but last I looked, it still sported pretty good first preimage resistance. SHA-1 is likewise effectively broken; preimage attacks have been demonstrated, but require specific conditions, though there's no reason to believe that will be the case indefinitely; as the saying goes, attacks always get better, they never get worse. SHA-256/384/512 are currently still believed safe for most purposes. However, if you're just interested in seeing if two non-maliciously-crafted, valid files are the same, then any of these should be sufficient, because the input space is sufficiently constrained already that you'd be mostly interested in random collisions. If you have any reason to believe that the files were crafted maliciously, then you need to at the very least use a cryptographic hash function that is currently believed safe, which puts the lower bar at SHA-256.
First preimage is to find an input that yields a specific output hash value; second preimage is to find one input that gives the same output as another, specified input; collision is to find two inputs that yield the same output, without regard to what that is and sometimes without regard to what the inputs are.
All that said, it's important to keep in mind that the files may have very different data representations and still display exactly the same. So they can appear to be the same even though their cryptographic hashes don't match, but if the hashes match then they are extremely likely to appear the same.
8cryptohashes and sometimes even normal hashes can be useful for comparing files on different systems, or searching among large numbers of files, but if two files are on the same system you can easily just compare them with
cmp
on Unix orfc
(file compare) on Windows. – dave_thompson_085 – 2018-05-21T14:00:36.04010https://shattered.io/ - SHA1 is a "stronger" hashing algorithm than md5 and still https://shattered.io/static/shattered-1.pdf and https://shattered.io/static/shattered-2.pdf have the same hash value while being completely different. – styrofoam fly – 2018-05-21T15:38:29.183
30Side note: check their sizes first. If they have different sizes, don't bother opening the files, they're different. – Emilio M Bumachar – 2018-05-21T18:48:49.077
42Simplistic version: an MD5 hash is good enough to protect against an accident, it is not good enough to prevent agains maliciousness. Whether that's good enough for you, you have to decide based on your circumstances. – Euro Micelli – 2018-05-21T19:23:36.213
9
diff -s file1 file2
if it says they are identical, they are identical (it actually compares the files byte-per-byte so even hash collisions are excluded). checksums are used when you only have one hash and an item that is thought to be identical to the originator of that hash. – Bakuriu – 2018-05-21T21:24:05.950@EmilioMBumachar depends on the definition of "different". Bytes content may be different, but not semantic content. Example if you just add whitespaces after a final text. Or in some structured format if you have padding, that can be any length without any displayed content. – Patrick Mevzek – 2018-05-21T22:59:27.060
4Pigeonhole Principle – technical_difficulty – 2018-05-22T14:47:15.247
6Comparing two files takes less computation than hashing them. Where hashes are useful is when you have a large number of files and want to check whether any pair are identical. – Acccumulation – 2018-05-22T15:18:54.297
2TL;DR: Probably. – Nonny Moose – 2018-05-23T01:08:27.703
@Bakuriu Or
cmp -s
, which is probably more efficient. – Konrad Rudolph – 2018-05-23T11:17:31.5502What do you mean by their contents being identical? If I have two files, both with identical cell values but the fonts are different, are they identical? If I have two files where every cell value and styling is the same, but the file stores them in different orders, are they they same? – David Rice – 2018-05-23T14:10:04.143
2Don't forget that some operating systems may store more than one data stream in a file. NTFS has alternate streams, *nix has posix extended user attributes, the old MacOS had the resource fork. So, if you are afraid of someone adding hidden information to a file, it's not enough to hash the main data stream. – b0fh – 2018-05-23T22:39:19.600
@Acccumulation comparing two files over a network requires much less bandwidth with a hash, though. – Eric Duminil – 2018-05-26T16:16:19.640