Does an identical cryptographic hash or checksum for two files mean they are identical?

59

17

I have 2 excel documents and I want to check if they are exactly the same, apart from the file name.

For example, the files are called fileone.xls and filetwo.xls. Apart from the file names, their contents are presumed to be identical but this is what i want to check.

I've been looking for ways to review this and without installing a bunch of plugins. There doesn't seem a straight forward way.

I've tried generating MD5 hashes for both files. When the hashes are identical, does this mean that the file contents are 1:1 the same?

sam

Posted 2018-05-21T12:38:52.147

Reputation: 3 411

8cryptohashes and sometimes even normal hashes can be useful for comparing files on different systems, or searching among large numbers of files, but if two files are on the same system you can easily just compare them with cmp on Unix or fc (file compare) on Windows. – dave_thompson_085 – 2018-05-21T14:00:36.040

10https://shattered.io/ - SHA1 is a "stronger" hashing algorithm than md5 and still https://shattered.io/static/shattered-1.pdf and https://shattered.io/static/shattered-2.pdf have the same hash value while being completely different. – styrofoam fly – 2018-05-21T15:38:29.183

30Side note: check their sizes first. If they have different sizes, don't bother opening the files, they're different. – Emilio M Bumachar – 2018-05-21T18:48:49.077

42Simplistic version: an MD5 hash is good enough to protect against an accident, it is not good enough to prevent agains maliciousness. Whether that's good enough for you, you have to decide based on your circumstances. – Euro Micelli – 2018-05-21T19:23:36.213

9diff -s file1 file2 if it says they are identical, they are identical (it actually compares the files byte-per-byte so even hash collisions are excluded). checksums are used when you only have one hash and an item that is thought to be identical to the originator of that hash. – Bakuriu – 2018-05-21T21:24:05.950

@EmilioMBumachar depends on the definition of "different". Bytes content may be different, but not semantic content. Example if you just add whitespaces after a final text. Or in some structured format if you have padding, that can be any length without any displayed content. – Patrick Mevzek – 2018-05-21T22:59:27.060

4Pigeonhole Principle – technical_difficulty – 2018-05-22T14:47:15.247

6Comparing two files takes less computation than hashing them. Where hashes are useful is when you have a large number of files and want to check whether any pair are identical. – Acccumulation – 2018-05-22T15:18:54.297

2TL;DR: Probably. – Nonny Moose – 2018-05-23T01:08:27.703

@Bakuriu Or cmp -s, which is probably more efficient. – Konrad Rudolph – 2018-05-23T11:17:31.550

2What do you mean by their contents being identical? If I have two files, both with identical cell values but the fonts are different, are they identical? If I have two files where every cell value and styling is the same, but the file stores them in different orders, are they they same? – David Rice – 2018-05-23T14:10:04.143

2Don't forget that some operating systems may store more than one data stream in a file. NTFS has alternate streams, *nix has posix extended user attributes, the old MacOS had the resource fork. So, if you are afraid of someone adding hidden information to a file, it's not enough to hash the main data stream. – b0fh – 2018-05-23T22:39:19.600

@Acccumulation comparing two files over a network requires much less bandwidth with a hash, though. – Eric Duminil – 2018-05-26T16:16:19.640

Answers

92

When the hashes are identical, does this mean that the file contents are 1:1 the same?

All files are a collection of bytes (values 0-255). If two files MD5 hashes match, both those collections of bytes are extremely likely the exact same (same order, same values).

There's a very small chance that two files can generate the same MD5, which is a 128 bit hash. The probability is:

Probability of just two hashes accidentally colliding is 1/2128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456. (from an answer on StackOverflow.)

Hashes are meant to work in "one direction only" - i.e. you take a collection of bytes and get a hash, but you can't take a hash and get back a collection of bytes.

Cryptography depends on this (it's one way two things can be compared without knowing what those things are.)

Around the year 2005, methods were discovered to take an MD5 hash and create data that matches that hash create two documents that had the same MD5 hash (collision attack). See @user2357112's comment below. This means an attacker can create two executables, for example, that have the same MD5, and if you are depending on MD5 to determine which to trust, you'll be fooled.

Thus MD5 should not be used for cryptography or security. It's bad to publish an MD5 on a download site to ensure download integrity, for example. Depending on an MD5 hash you did not generate yourself to verify file or data contents is what you want to avoid.

If you generate your own, you know you're not being malicious to yourself (hopefully). So for your use, it's OK, but if you want someone else to be able to reproduce it, and you want to publicly publish the MD5 hash, a better hash should be used.


Note that it's possible for two Excel files to contain the same values in the same rows and columns, but for the bytestream of the file to be completely different due to different formatting, styles, settings, etc.

If you are wanting to compare the data in the file, export it to CSV with the same rows and columns first, to strip out all formatting, and then hash or compare the CSV's.

LawrenceC

Posted 2018-05-21T12:38:52.147

Reputation: 63 487

107Excel files, and other office documents can also have different hashes because they have been opened and re-saved without changing anything, due to the metadata in the file having a new value stored in there for the last saved datetime. – BeowulfNode42 – 2018-05-21T13:39:24.340

29Bonus: if you've exported to CSV, you can use the venerable diff or similar utility to actually confirm the files are byte-for-byte identical, rather than just having the same hash. – Monty Harder – 2018-05-21T15:51:08.137

18Taking a hash and creating data that matches the hash is a preimage attack. I believe MD5 is currently vulnerable to collision attacks, but I don't think preimage or second-preimage attacks are currently viable. – user2357112 supports Monica – 2018-05-21T20:54:34.850

2@Tim what are you saying? He said: export them to CSV and use diff -s to check whether the CSV are identical. In fact you can diff -s even the excel files: if diff says they are identical you don't need to go to CSV comparison. – Bakuriu – 2018-05-21T21:25:59.697

2@Bakuriu Clearly my comment was very poorly worded - I meant exporting to CSV will lose lots of information - notably formulae, charts, conditional and standard formatting. – Tim – 2018-05-21T21:27:14.810

2How do you write about the insecurity of MD5 without suggesting a better hash like SHA256... – user541686 – 2018-05-23T01:18:27.753

1Honestly question is about comparison and not security. – LawrenceC – 2018-05-23T01:22:14.443

You could emphasize more that it is a logical necessity that several blobs of data share the same hash. That is independent of the hashing algorithm and the difficulty to produce collisions: There simply are many more variations of data (namely, an infinite amount) than of hashes of a given size (e.g. 2^128 for a 128 bit hash). The relation data -> hash is a true function (whih is probably, but not necessarily surjective). The relation hash -> data is not a function. – Peter - Reinstate Monica – 2018-05-23T07:04:19.730

@MontyHarder No need for CSV export, you can diff the Excel files directly. In fact, I’d strongly recommend doing exactly that (or, rather, using cmp, which is much more efficient in case the files mismatch). – Konrad Rudolph – 2018-05-23T11:14:42.243

1@KonradRudolph but the two files could be identical in all meaningful ways and still have some metadata differences that cause diff to report that they are different binaries. The whole point of exporting to CSV was to remove that possibility. – Monty Harder – 2018-05-23T19:19:11.187

@LawrenceC, Is probability of two hashes colliding is constant? I think it will depend on data size, larger the data, the probability will be more. is't it? – JITHIN JOSE – 2018-05-24T08:28:33.227

@MontyHarder but you can get diff to tell you where the Excel files differ. If it's just some metadata garbage at the beginning or end of the file, you can then safely assume that the spreadsheet contents and formulae are otherwise reasonably identical. – Doktor J – 2018-05-24T15:17:20.287

1If it helps any, Excel (and other Office files whose filename extensions are all 4-character .???x) are just XML trees stored as ZIP format. You can rename myletter.docx or mycharts.xlsx into mywhatever.zip and then expand the file to see how the format is arranged. So if you plan to diff two such files, you should use a skip-past-leading-offset count appropriate to ZIP data, along with all the other caveats that go along with diff'ing ZIP files. – Ti Strga – 2018-05-24T21:21:32.493

1In industrial applications where we need to be quite certain that data or application files are the same as expected, we compare the hash and the file size with expected values. I don't know whether comparing the file sizes adds a layer of security, but it's standard practice. – Jason – 2018-05-28T01:38:35.387

1@TiStrga: Correct. And remember that XML files may differ in whitespaces, attribute order, namespace prefixes, encoding etc. but still be the same. – Thomas Weller – 2018-05-28T11:42:47.397

If you wanted to check a range of cells, you could accumulate a tree hash, but you would have to be keen. – mckenzm – 2018-09-20T23:38:17.143

37

In practice, yes, an identical cryptographic hash means the files are the same, as long as the files were not crafted by an attacker or other malicious entity. The odds of random collisions with any well-designed cryptographic hash function is so small as to be negligible in practice and in the absence of an active attacker.

In general, however, no, we cannot say that two arbitrary files having the same hash definitely means that they are identical.

The way a cryptographic hash function works is to take an arbitrary-length input, and output a fixed-length value computed from the input. Some hash functions have multiple output lengths to choose from, but the output is still to some degree a fixed-length value. This value will be up to a few dozen bytes long; the hash algorithms with the longest output value in common use today have a 512-bit output, and a 512-bit output is 64 bytes.

If an input to a hash function is longer than the output of the hash function, some fidelity must be removed to make the input fit in the output. Consequently, there must exist multiple inputs of lengths greater than the length of the output, which generate the same output.

Let's take the current workhorse, SHA-256, as an example. It outputs a hash of 256 bits, or 32 bytes. If you have two files which are each exactly 32 bytes long, but different, these should (assuming no flaw in the algorithm) hash to different values, no matter the content of the files; in mathematical terms, the hash is a function mapping a 2256 input space onto a 2256 output space, which should be possible to do without collisions. However, if you have two files that are each 33 bytes long, there must exist some combination of inputs that give the same 32-byte output hash value for both files, because we're now mapping a 2264 input space onto a 2256 output space; here, we can readily see that there should, on average, exist 28 inputs for every single output. Take this further, and with 64-byte files there should exist 2256 inputs for every single output!

Cryptographic hash functions are designed such that it's computationally difficult to compose an input that gives a particular output, or compose two inputs that give the same output. This is known as preimage attack resistance or collision attack resistance. It's not impossible to find these collisions; it's just intended to be really, really, really, really hard. (A bit of a special case of a collision attack is a birthday attack.)

Some algorithms are better than others at resisting attackers. MD5 is generally considered completely broken these days, but last I looked, it still sported pretty good first preimage resistance. SHA-1 is likewise effectively broken; preimage attacks have been demonstrated, but require specific conditions, though there's no reason to believe that will be the case indefinitely; as the saying goes, attacks always get better, they never get worse. SHA-256/384/512 are currently still believed safe for most purposes. However, if you're just interested in seeing if two non-maliciously-crafted, valid files are the same, then any of these should be sufficient, because the input space is sufficiently constrained already that you'd be mostly interested in random collisions. If you have any reason to believe that the files were crafted maliciously, then you need to at the very least use a cryptographic hash function that is currently believed safe, which puts the lower bar at SHA-256.

First preimage is to find an input that yields a specific output hash value; second preimage is to find one input that gives the same output as another, specified input; collision is to find two inputs that yield the same output, without regard to what that is and sometimes without regard to what the inputs are.

All that said, it's important to keep in mind that the files may have very different data representations and still display exactly the same. So they can appear to be the same even though their cryptographic hashes don't match, but if the hashes match then they are extremely likely to appear the same.

a CVn

Posted 2018-05-21T12:38:52.147

Reputation: 26 553

2If the hashes match then either the files are a result of a deliberate collision, or they aren't and then they are guaranteed to be the same. The probability of an accidental collision is purely theoretic. Saying that “if the hashes match then they are highly likely to appear the same” is misleading: if there's malice afoot and it's a collision situation then they aren't likely to be the same, and otherwise the probability is effectively zero, it isn't some low-probability event that needs to be defended against. – Gilles 'SO- stop being evil' – 2018-05-21T21:43:52.217

9@Gilles: On the contrary. Michael's wording is exactly right, and "guaranteed" is misleading (or, well, factually wrong). The likelihood of two files with identical hashes not matching (notwithstanding malicious modification) is extremely low, and can be neglected in practice. It is, however, not zero. There is generally a chance, that for whatever reason different inputs will produce the same hash, and possibly even with a likelihood much higher than 2^-128 (cryptographic algorithms are black art, the algortihm may be flawed in a subtle, unknown way and we have no way of being 100% sure). – Damon – 2018-05-22T10:15:04.310

I tweaked the answer slightly to try to address the comments above. – a CVn – 2018-05-22T11:06:05.917

5@Gilles "effectively zero" is still not zero, which means there is still some (admittedly small) probability that two different sets of data will result in the same hash. You can't argue against that. – Attie – 2018-05-22T11:18:43.730

1"we cannot say that" -- We can say it with more certainty than pretty much every mundane belief that any of us holds true about the world. Saying it is true is a perfectly legitimate use of that word; or else nothing is. – smls – 2018-05-22T16:50:01.130

5@Attie: The probability of two unrelated files hashing to the same value is so far below the probability of many other things that can go wrong (e.g. random bit errors corrupting files on disk) that it's not worth guarding against coincidental matches. Guarding against deliberately-engineered matches may be worthwhile, but accidental matches are so improbable that any effort spent guarding against them could likely be spent better elsewhere. – supercat – 2018-05-22T18:53:35.963

@Damon The likelihood of an accidental MD5 collision is too small to matter. It would be meaningless to take it into account without taking more likely events into account, such as the probability of a RAM error leading to the computer reporting the wrong result. An accidental collision is guaranteed not to happen because it will not happen. If it was false that identical hashes guarantee that there is no accidental collisions, then a counterexample would exist. – Gilles 'SO- stop being evil' – 2018-05-22T19:47:51.127

3@Gilles wrong. You can't in one breath tell me that there's a chance, however small you rate it, that an accidental collision may occur then in the very next grantee no collision can occur. Saying that is highly misleading as it implies a property of the hashing algorithm that is already known to be completely false. – iheanyi – 2018-05-22T22:34:27.347

1@Gilles: I am well aware that that likelihood is so low it can be ignored. However, "does not matter" and "does not exist" are not the same things. It is possible for a collision to occur, and indeed collisions are guaranteed to happen (with extremely low likelihood). Stating that something is guaranteed not to happen means none less than it is impossible (with absolute certitude). That's really what "guaranteed" means, but it is verifiably not what is the case. – Damon – 2018-05-23T13:01:02.763

As a not-quite-right analogy (because there are about 10 orders of magnitude in between the two cases), you can state that you are guaranteed not to win the lottery. If that's guaranteed, why do people play the lottery then? Because, you know, every other week or so, some fool does win. Think about it, who is the bigger fool, the fool who got rich by winning what couldn't be won, or the fool who was smart enough not to play? The odds may be ridiculously, unreasonably unlikely -- but as long as there is a way, guaranteed is the wrong word. – Damon – 2018-05-23T13:07:57.853

@Damon Where that analogy fails is that there are many orders of magnitude between the two cases. Winning the lottery is very unlikely, but it's still more likely as, say, getting hit by a meteorite. And that is still much more likely than finding an MD5 collision (IIRC there exists a human who has been hit by a meteorite, so it can't be that unlikely). – Gilles 'SO- stop being evil' – 2018-05-23T18:15:17.487

Finding two inputs that give the same hash is a collision attack, not a second preimage attack. A second preimage attack takes one input and finds a second input that gives the same hash; a collision attack is much easier, because you don't have to match a specific target. – user2357112 supports Monica – 2018-05-23T21:40:30.857

If the file lengths are the same, and the files have the same hashcodes for more than one hash (using different hashing algorithms, of course), then the odds are even better that the files are identical. – David R Tribble – 2018-05-24T16:53:24.897

@Damon: every week or two someone wins the lottery, but it is not the case that every week or two someone randomly generates an MD5 collision. It just depends whether Michael is talking about the abstract mathematical hash function (in which case we can talk about the probability), or actual real hash comparions on real computers (in which case it doesn't really make sense to insist on a probability of a random hash collision, but not insist on the probability of a false match due to a far more likely hardware error in computing the hashes). – Steve Jessop – 2018-05-24T17:17:43.050

That said, I agree that "negligible" is a better way of putting it that what Gilles proposes. If someone is too dumb to know what the words negligible and/or probability mean, or is considering what to actually do in practice, then fall back to telling them that if the hashes match then the two cases to consider are that the files are the same or that it's an intentionally-generated collision. There is no point adding, "or a random MD5 collision; or a cosmic ray hit your RAM; or there was a previously undetected bug in sshlib", even though they technically are all candidates. – Steve Jessop – 2018-05-24T17:23:14.427

@user2357112 Fixed. – a CVn – 2018-05-24T17:48:00.360

10

It's a probability game... hashes are able to represent a finite number of values.

If we consider a hypothetical (and very weak) 8-bit hashing algorithm, then this can represent 256 distinct values. As you start to run files through the algorithm, you will start to get hashes out... but before long you will start to see "hash collisions". This means that two different files were fed into the algorithm, and it produced the same hash value as its output. Clearly here, the hash is not strong enough, and we cannot assert that "files with matching hashes have the same content".

Extending the size of the hash, and using stronger cryptographic hashing algorithms can significantly help to reduce collisions, and raise our confidence that two files with the same hash have the same content.

This said, we can never reach 100% certainty - we can never claim for sure that two files with the same hash truly have the same content.

In most / many situations this is fine, and comparing hashes is "good enough", but this depends on your threat model.

Ultimately, if you need to raise the certainty levels, then I would recommend that you do the following:

  1. Use strong hashing algorithms (MD5 is no longer considered adequate if you need to protect against potentially malicious users)
  2. Use multiple hashing algorithms
  3. Compare the size of the files - an extra data point can help to identify potential collisions, but note that the demonstrated MD5 collision did not need to alter the data's length.

If you need to be 100% sure, then by all means start with a hash, but if the hashes match, follow it up with a byte-by-byte comparison of the two files.


Additionally, as pointed out by others... the complexity of documents produced by applications such as Word and Excel means that the text, numbers, visible layout can be the same, but the data stored in the file can be different.

Excel is particularly bad at this - simply opening a spreadsheet saving it (having done nothing) can produce a new file, with different content.

Attie

Posted 2018-05-21T12:38:52.147

Reputation: 14 841

6MD5 is no longer considered adequate is very true cryptographically but for uniqueness checking (in the absence of malice, e.g. if you control the input) it's nice and fast (and 128 bits should be plenty) – Chris H – 2018-05-21T14:04:02.527

"a hash collision with identical file sizes is less likely." – Actually, the collisions built for MD5 usually have same file size. – Paŭlo Ebermann – 2018-05-21T20:20:16.850

4"follow it up with a byte-by-byte comparison of the two files." If you're going to do a file-comparison, you may as well do it first... no point reading all of each file to compute their hashes only to re-read both files to compare them! – TripeHound – 2018-05-21T21:21:27.590

3@TripeHound It depends if the files are both local or not... if you already have a hash of one and are introducing a new file to the system, if the new file needs a hash stored in a database anyway, etc... Make the call that suits your situation. – Attie – 2018-05-21T21:26:31.227

@PaŭloEbermann I'd misremembered - thanks & fixed. – Attie – 2018-05-21T21:28:35.603

@ChrisH fair point, thanks - I've updated my answer. – Attie – 2018-05-21T21:29:28.727

5No, it is not a probability game. You're misestimating how unlikely an accidental collision is. It just won't happen. Flipping a bit during the comparison is more likely. On the other hand, in some scenarios, a deliberate collision might happen, and that is not a probability game at all. – Gilles 'SO- stop being evil' – 2018-05-21T21:45:11.520

2@Gilles no, as Attie says, its literally a probability game, based on how many bits are in the hash and how many files you are expecting to work with. A 32 bit hash would probably work just fine (barring malicious-ness) for an average desktop user (CRC32 is still popular for some kinds of downloaded video files), but not for a google-scale big data filesystem. – mbrig – 2018-05-22T15:22:23.143

3@mbrig: A 32-bit hash would have a significant risk of accidental mismatch. Going to 128 or 256 bits, however, makes a huge difference. With 128 bits, a billion monkeys each typing a billion decently-sized genuinely-random documents would have about a 0.3% chance of creating two documents with the same hash. With 256 bits, even if billions monkeys could type a billion decently-sized random documents per second for a billion years, the likelihood of any of those nonillions of documents having coincidentally-matching hash values would be vanishingly small. – supercat – 2018-05-22T19:12:24.180

@mbrig With a CRC, it's partly a probability game. With a cryptographic hash, even a broken one, probability is irrelevant: you will never win that game. Conversely, no matter what the hash is, you need to figure out whether deliberate collisions are an issue, and that is not about probability at all. – Gilles 'SO- stop being evil' – 2018-05-22T19:41:22.693

1@Gilles: The only thing that would make it more of a probability game with CRC than some other hash is that CRCs are generally shorter. Otherwise, a CRC would often be less of a probability game because certain factors may create correlations. As a simple example, if Acme Spreadsheet ensures file integrity by storing a 32-bit CRC with every file, other tools to compute a files' CRC might report that all Acme Spreadsheet files have a CRC of zero since the appended CRC would be the pattern of bits that, when appended to the preceding content, would make the CRC of the combined file zero. – supercat – 2018-05-24T19:39:09.227

6

Short answer: A cryptographic hash is supposed to help you be reasonably confident that files with matching hashes are the same. Unless deliberately crafted, the chances of two slightly different files having similar hash values is ridiculously small. But when it comes to comparing and verifying files that could be deliberately tampered with, MD5 is poor choice. (Use another hash function like SHA3 or BLAKE2.)

Long answer: An ideal hash function is one that creates an almost unique cryptographic hash for a every unique piece of data. In other words, we definitely know that there are two files in this universe whose hash values collide, the chance of these two files naturally coming together is ridiculously small.

Ten years ago, I decided I must stay as far as I can from MD5. (Of course, until yesterday, I remembered the wrong reason for doing so; ten years is a long time, you see. I revisited my past memos to remember why and edited this answer.) You see, in 1996, MD5 was found to be susceptible to collision attacks. 9 years later, researchers were able to create pairs of PostScript documents and (ouch!) X.509 certificates with the same hash! MD5 was clearly broken. (Megaupload.com was also using MD5, and there was a lot of hanky-panky around hash collisions that gave me trouble at the time.)

So, I concluded that while MD5 was (and still is) reliable for comparing benign files, one must stop using it altogether. I reasoned that reliance on it has the risk of turning into indulgence and false confidence: Once you start comparing files using their MD5 hashes, one day you forget the security fineprint and compare two files that are deliberately crafted to have the same hash. In addition, CPUs and cryptoprocessors were unlikely to add support for it.

The original poster, however, has even less reasons to use MD5, because:

  1. As long as one is comparing two files only, byte-for-byte comparison is actually faster than generating one's own MD5 hashes. For comparing three or more files... well, now you have a legitimate cause.
  2. The OP specified "ways to review this and without installing a bunch of plugins". Windows PowerShell's Get-FileHash command can generate SHA1, SHA256, SHA384, SHA512 and MD5 hashes. On modern computers with hardware support for SHA hash functions, generating them is faster.

user477799

Posted 2018-05-21T12:38:52.147

Reputation:

6You can create your own cryptographic hash function of any length you choose, true; but then it has a fixed length and the pigeonhole principle applies anyway. The general answer is: "by comparing their hashes only, you cannot be sure the two files are identical". – Kamil Maciorowski – 2018-05-21T13:09:56.667

2@KamilMaciorowski In theory, yes I can. My custom-made hash function can simply generate a copy of the largest file. But I have no interest in discussing this further; the truth is, you downvoted for a reason that amounts to nitpicking just to prove you are smarter and it backfired on you. Now you can't take the vote back. – None – 2018-05-21T13:13:15.767

I agree with @KamilMaciorowski... It's a probability game... using a single hash, you can be "reasonably confident" that files with matching hashes are the same, but there is no 100% guarantee. Using better algorithms, or using multiple algorithms can improve your confidence - even comparing file sizes can help... but you can never be 100% confident without checking byte-for-byte. – Attie – 2018-05-21T13:20:32.143

1@Attie Huh! That's what I originally meant. Thanks. Only I am not familiar with chic phrases like "you can be reasonably confident". Sorry. Still, that's why we have an edit button. I personally would never trash a good answer just because one word in it is wrong. I edit it. – None – 2018-05-21T13:25:58.833

1About "trashing a good answer": please note I ensured first it's not a typo and you really mean it; then downvoted and at the same time I gave you feedback, disclosed my reason in a hope your answer will get better. It did, so my downvote is no more. Basically I told you what I think was wrong with your answer, Attie helped to clarify, you improved the answer. From my point of view we all handled this situation properly and the whole story turned out very well. Thank you. – Kamil Maciorowski – 2018-05-21T14:08:54.103

@KamilMaciorowski Very well; I take that as an act of good faith. – None – 2018-05-22T04:58:39.010

6

If two files have the same MD5 hash, and they haven't both been specially crafted, then they're identical. How hard it is to craft files with the same MD5 hash depends on the file format, I don't know how easy it is with Excel files.

So if you have files of your own that are just lying around and want to find duplicates, MD5 is safe. If you wrote one of the files, and the other file is of dubious origin, MD5 is still safe (the only way to get different files with the same MD5 checksum is to craft both files). If someone you don't trust sends you a budget proposal, and later sends another file which they claim is the same, then MD5 may not be enough.

To avoid any risk, use SHA-256 or SHA-512 instead of MD5. If two files have the same SHA-256 hash, then they're identical. The same goes for SHA-512. (There's a theoretical possibility that they could be different, but the probability of this happening accidentally is so much less than the probability of your computer flipping a bit during the verification than it just isn't relevant. As for someone deliberately crafting two files with the same hash, nobody knows how to do this for SHA-256 or SHA-512.)

If two Excel files have different hashes, then they're different, but there's no way to know by how much they differ. They could have identical data but different formatting, or they could just differ in the properties, or they might have been saved by different versions. In fact if Excel is anything like Word then merely saving a file updates its metadata. If you only want to compare the numerical and text data and ignore formatting and properties, you can export the spreadsheets to CSV to compare them.

If you have Unix/Linux tools available, then you can use cmp to compare two files. To compare two files on the same machine, checksums only make things more complicated.

Gilles 'SO- stop being evil'

Posted 2018-05-21T12:38:52.147

Reputation: 58 319

If two files have the same MD5 hash, and they haven't both been specially crafted, then they're identical. That is incorrect. There are an infinity of possible messages, yet there are only 2^64 possible 64-bit hashes. It's called the "pigeonhole principle": " the pigeonhole principle states that if n items are put into m containers, with n > m, then at least one container must contain more than one item." If you create more than 2^64 messages you will have collisions without any "special crafting". And you might with just 2. – Andrew Henle – 2018-05-26T20:45:34.897

@AndrewHenle, MD5 is not 64 bits, it's 128. If generating an accidental collision gets us into heat-death-of-the-universe timescales, it's "possible" only for an extremely academic (hence useless) definition thereof. – Charles Duffy – 2018-05-26T21:20:07.750

@CharlesDuffy You're assuming the hash is randomly distributed. It's not. – Andrew Henle – 2018-05-26T22:47:59.937

Being effectively equivalent to random distribution is part of the definition of what constitutes a good cryptographic hash -- you have a lot of rounds of mixing for a reason. Certainly, there are weak hash algorithms, but focusing on those weaknesses gets us into the previously-stated caveats around intentional attacks. (Or are you saying that MD5 has been shown to only have 64 bits that are effectively random? I'll admit that I haven't been keeping up, so that's plausible -- link please?) – Charles Duffy – 2018-05-27T00:14:29.463

@AndrewHenle I do not state that a collision is mathematically impossible, which would be wrong, but not relevant here. I state that it has not happened, which is true. Your comment is incorrect in a way that completely changes the deal. There are 2^128 possible MD5 hashes, not 2^64. This means you would need to generate 2^128 hashes to be certain to generate a collision. Actually, by the birthday paradox, 2^64 would give you a macroscopic chance of a collision between the hashes you generated (not with a previously-generated hash). But this is moot since we know how to craft collision. – Gilles 'SO- stop being evil' – 2018-05-27T09:46:22.073

If two files have the same MD5 hash, and they haven't both been specially crafted, then they're identical. That is a statement claiming the only way to get a collision is to "specially craft" one. That is wrong. Collisions are possible, no matter how unlikely. I state that it has not happened, which is true. Again, you're wrong. Collisions can happen. I've been witness to one. – Andrew Henle – 2018-05-27T11:36:28.110

@CharlesDuffy Or are you saying that MD5 has been shown to only have 64 bits that are effectively random? I changed 128 to 64 in order to fit the post limit - it was that tight. The principle remains the same, though. – Andrew Henle – 2018-05-27T11:42:38.093

@AndrewHenle If you've been a witness to an accidental MD5 collision, please publish it. You'd be the first. ”I changed 128 to 64 in order to fit the post limit - it was that tight. The principle remains the same, though.” You what? No, 2^128 is so many orders of magnitude larger than 2^64 that it make a qualitative difference. It's the difference between ”run your computer for a while“ and “not in your lifetime”. – Gilles 'SO- stop being evil' – 2018-05-27T11:57:04.467

5

Hashes such as MD5 or SHA have fixed length, lets say it's 300 alphanumeric characters (in reality they are shorter and don't use the whole set of alphanumeric characters).

Lets say that files are made of alphanumeric characters and up to 2GB in size.

You can easily see that there are way more files (with size of up to 2GB) than possible hash values. The pigeonhole principle says that some (different) files must have the same hash values.

Also, as demonstrated on shattered.io1 you can have two different files: shattered.io/static/shattered-1.pdf and shattered.io/static/shattered-2.pdf that have the same SHA-1 hash value while being completely different.

1SHA1 is a "stronger" hashing algorithm than md5

styrofoam fly

Posted 2018-05-21T12:38:52.147

Reputation: 1 746

The probability of accidental collisions is too low to take into account. The risk of a deliberate collision exists for MD5 as well and is worse than for SHA-1 which is not terribly relevant here. – Gilles 'SO- stop being evil' – 2018-05-21T21:48:12.823

5

I have 2 excel documents and I want to check if they are exactly the same, apart from the file name.

From a practical perspective, directly comparing the files to find out if they're different will be faster than computing a hash for each file and then comparing that hash.

To compute the hashes you have to read the entirety of the contents of both files.

To determine if they're identical through a direct comparison, you just need to read the contents of both files until they don't match. Once you find a difference, you know the files aren't identical and you don't have to read any more data from either file.

And before you do either, you can simply compare the sizes of the two files. if the sizes differ then the contents can't be the same.

Andrew Henle

Posted 2018-05-21T12:38:52.147

Reputation: 200

When using two files on one physical drive, using a hash function that can keep up with the I/O speed on each file separately may be slightly faster than comparing the files, since there would be no need to switch between reading the two files. The place hashes really shine, though, is when trying to do comparisons involving many files that are too large to fit in memory. Even if you merely want to find out if they all match, comparing file 1 to file 2, then file 1 to file 3, then file 1 to file 4, etc. may be almost twice as slow as computing all their hashes. – supercat – 2018-05-25T15:45:34.123

@supercat If the files are read in chunks larger than a MB or so, the switching between files won't be noticeable. And if a work flow involves comparing a bunch of files to find duplicates, the hash might as well be computed as each file is written - since doing it then can pretty much be done for free. – Andrew Henle – 2018-05-26T20:39:03.447

If one has enough space to buffer large chunks of the files, the switching times need not be a problem, but otherwise they might be. As for computing the hashes when files are written, that may be fine if one could guarantee that files could not be modified without changing or at least invalidating stored hashes. If one is trying to avoid backing up files redundantly, looking only at stored hash values may cause one to back up an accidentally-corrupted file but not bother to back up the non-corrupted files which the corrupted file should match but doesn't. – supercat – 2018-05-27T18:01:14.900

"Once you find a difference, you know the files aren't identical" - not necessarily. XLSX files are ZIP files which potentially could store the content in different order still havng the same content. But even if you decompress them and compare each individual file, the XLSX file contains XML documents which might have e.g. different line endings without affecting the content. – Thomas Weller – 2018-05-28T11:38:03.530

4

NO. Different values guarantee the files are different. The same values are not a guarantee the files are the same. It is relatively easy to find examples using CRC16.

On the balance of probability with contemporary hashing schemes they are the same.

mckenzm

Posted 2018-05-21T12:38:52.147

Reputation: 829

1The question is about MD5, which has no risk of accidental collisions. It does have a risk of deliberate collisions, but that's not a matter of probabilities. – Gilles 'SO- stop being evil' – 2018-05-21T21:46:00.760

1It's also about excel spreadsheets with different names, how large can they be that a byte for byte comparison can't be an option? Two hashing schemes together would provide certainty. – mckenzm – 2018-05-22T04:51:01.403

2@Gilles All hashcodes have risk of accidental collisions, by definition. The only way out of that is to use the entire file as the hashcode. Your comment makes no sense. – user207421 – 2018-05-24T10:35:28.527

3

Your question is backwards, though - let's assume that the hash means that they have the same data (which isn't 100% guaranteed, but is good enough for a lifetime of comparing files every second to not hit a collision). It doesn't necessarily follow that having the same data means that they'll have the same hash. So no - you can't compare the data in an excel file with the data in another excel file by hashing the file because there are a lot of ways that two files can differ without the underlying data being different. One obvious way - the data is stored as XML, each cell has its own XML node. If those nodes are stored in different orders then the data is the same but the file is different.

David Rice

Posted 2018-05-21T12:38:52.147

Reputation: 131

3

To add on the other answers, here are many examples of couples of files with the same MD5 hash and different content.

Giulio Muscarello

Posted 2018-05-21T12:38:52.147

Reputation: 161

Quite a link-only answer, but interesting. – Thomas Weller – 2018-05-28T11:33:15.650

2

The answer for this OP has been given but might benefit from a summary.

If you want to check whether two files are the same, a lot depends on whether or not the files and hashes are under your control.

If you generate the hashes yourself from the files, and you are pretty sure nobody else had opportunity/skill/motivation to deliberately try and make you reach the wrong conclusion, then almost any hash - even "known broken" hashes like MD5 and SHA1 are almost certain to be sufficient. But that, I mean you could generate files at high speed for millions of years and you'd still be unlikely to end up with any two files that are actually different but have the same hash. It's almost certainly safe.

This is the scenario you have, when you want to quickly check if two directories on your PC or file server have the same content, if any files in a directory are exact duplicates, etc, and you're pretty sure the files haven't been engineered/illicitly modified, and you trust your hashing app/utility to give correct results.

If you are in a scenario where one of the files - or a precalculated hash - might have been manipulated or engineered to fool you into a wrong conclusion, then you need a stronger (unbroken) hash, and/or other security. For example, if you download a file and check if it's valid by examining a hash, then an attacker might be able to engineer a bad file with the correct hash, or attack the website to place an incorrect hash when you look for the "right" (expected) value. This comes down to wider security issues.

Stilez

Posted 2018-05-21T12:38:52.147

Reputation: 1 183

2

On the Windows command line, you can use the comp utility to determine whether two files are exactly the same. For example:

comp fileone.xls filetwo.xls

Chad

Posted 2018-05-21T12:38:52.147

Reputation: 1 343

1

When the hashes are identical, does this mean that the file contents are 1:1 the same?

No. If the hashes are different, it does mean that the contents are different. Equal hashcodes do not imply equal content. A hashcode is a reduction of a large domain to a smaller range, by definition: the implication is that hascodes over unequal content can be equal. Otherwise there would be no point in computing them.

user207421

Posted 2018-05-21T12:38:52.147

Reputation: 214

Otherwise there would be no point in computing them. If you've broken the laws of mathematics and invented a lossless compression function that can compress random data, violating the pigeonhole principle, it would be very valuable to use it! It would be highly convenient if a 128-bit hash did uniquely represent the whole contents of a file. Even if there was no decompression function to turn the hash back into the file a mathematically-impossible collision-free hash would be nice to have, e.g. to speed up dup-finding in untrusted data like in VM images. – Peter Cordes – 2018-05-25T00:10:39.653

"If the hashes are different, it does mean that the contents are different." Not necessarily. XLSX files are ZIP files and it would be possible to have the same content stored in different file order. – Thomas Weller – 2018-05-28T11:30:55.663

1

This answer is intended to be a handy map of scenarios that can or cannot happen, and reasonings you can apply. Refer to other answers to learn why hash functions work this way.


After you choose a hash function and stick to it, these are all combinations to consider:

          |    identical   |   different    |
          |   hash values  |  hash values   |
----------+----------------+----------------+
identical |   can happen,  | cannot happen, |
  files   |     common     |   impossible   |
----------+----------------+----------------+
different |   can happen,  |   can happen,  |
  files   |      rare*     |     common     |
----------+----------------+----------------+

* rare, unless whoever generates (at least one of) the files
  purposely aims at this scenario

The scenario in which identical files generate different hash values is the only one that is strictly impossible.


Two reasonings that always apply:

  • If files are identical then hash values are identical for sure.
  • If hash values are different then files are different for sure.

Two reasonings that are not strict:

  • If files are different then hash values are probably different.
  • If hash values are identical then files are probably identical.

Kamil Maciorowski

Posted 2018-05-21T12:38:52.147

Reputation: 38 429

0

For your purposes, yes, identical hashes means identical files.

As other answers make clear, it's possible to construct 2 different files which result in the same hash and MD5 is not particularly robust in this regard.

So use a stronger hashing algorithm if you plan on comparing a large number of excel documents or if you think someone might want to manipulate the comparison. SHA1 is better than MD5. SHA256 is better again and should give you complete confidence for your particular usage.

jah

Posted 2018-05-21T12:38:52.147

Reputation: 243

-1

The files are probably identical if their hashes are identical. You can increase confidence by modifying both files in an identical way (e.g. put the same value in the same unused cell) then comparing hashes of the modified files. It is hard to create a deliberate collision for a file which is changed in a way not known in advance.

ibft2

Posted 2018-05-21T12:38:52.147

Reputation: 7

This won't work due to additional data stored in office files. You need to e.g. put the cursor in the same cell before saving, save at the exact time etc. But even then, XLSX files are zip files internally, so if that algorithm stores the individual files in a different order (for whatever purpose), the file is identical but the hash isn't – Thomas Weller – 2018-05-28T11:29:49.533

-2

Let's look at this in a practical way. Instead of saying "the hashes are identical" I'll say "I wrote a computer program that calculates the hashes of two files and prints out whether they are the same or not", and I run the program with two files, and it says "identical". There are several reasons why it might do that:

The files may be identical. My code may have bugs (one that has actually happened in practice was comparing two long (256 byte) hashes not with memcmp but with strcmp: The comparison will return "same" if the first byte in each hash is zero, and the chance for that is 1 in 65536. There may be a hardware fault (cosmic ray hitting a memory cell and switching it). Or you may have the rare case of two different files with identical hash (a hash collision).

I would say that for non-identical files, by far the most likely cause is programmer error, then comes the cosmic ray that changed a boolean variable with the result of comparing the hashes from "false" to "true", and much later comes the coincidence of a hash collision.

There are enterprise backup systems that avoid backing up identical files from 10,000 users by hashing each file and checking for a file with an identical hash already stored on the server. So in case of a collision a file won't get backed up, possibly leading to data loss. Someone calculated that it is much more likely that a meteorite hits your server and destroys all backups than losing a file because its checksum matched a different file.

gnasher729

Posted 2018-05-21T12:38:52.147

Reputation: 277