9

Many sites these days offer MD5 and SHA256 hashes to check the integrity of downloaded files or archives.

I wonder how much safer is the use of the SHA256 hashes for integrity checks?

Note: Consider the file content as random input (no attacks)

Note: Seems to be a simple question (and I read about collisions on wikipedia), yet I have not found an answer on this site

Marcel
  • 3,494
  • 1
  • 18
  • 35
  • What is your attack scenario? An attacker can change the file but not the hash? Aren't they usually put in the same directory, making this unlikely? Aren't these hashes meant for integrity and not for security? – Sjoerd Feb 11 '19 at 10:34
  • @Sjoerd No attack, as mentioned in the first note and title. Just for the typical download integrity check. – Marcel Feb 11 '19 at 12:48

7 Answers7

14

There seems to be some confusion about the capabilities of a collision attack.

Two of the properties a cryptographic hash must have are collision resistance and preimage resistance.

If a hash is collision resistant, it means that an attacker will be unable to find any two inputs that result in the same output. If a hash is preimage resistant, it means an attacker will be unable to find an input that has a specific output. MD5 has been vulnerable to collisions for a great while now, but it is still preimage resistant.


What does this mean for integrity?

If you trust that the party that originally hashed the data to provide you with the integrity check is not malicious, and they did not allow anyone to modify the data beforehand (any part of the data, even if 2 images, videos, or pdfs look identical they can be vastly different), then MD5 should be sufficient to verify integrity, and SHA-256 shouldn't offer much more security (barring any future attacks on MD5's preimage resistance).

If an attacker may have been able to make any modifications to the data (even seemingly benign modifications), then SHA-256 will be more secure, as with MD5 the attacker could have crafted a malicious file with the same hash.


Are these integrity checks useful?

In many cases, not really. If you're downloading the file over HTTPS from the same website providing the hash value, then you're already benefiting from the MAC TLS uses for authenticity checking, so a MitM will be unable to change the file in-transit. If someone is able to modify the file on the site maliciously, they can also modify the hash.

One case where it does make sense to verify an MD5 or SHA-256 hash for a file is if you download the file from a mirror and check the hash against one provided by the original trusted site.

AndrolGenhald
  • 15,436
  • 5
  • 45
  • 50
  • thanks for this insightful answer. As I read your answer (and others) the consensus seems to be "SHA256 is somewhat safer, but not really". Originally, though, I was looking for something like "SHA256 is x times safer because of y". – Marcel Feb 11 '19 at 15:32
  • 1
    No, he’s saying they’re roughly comparable in some circumstances, but MD5 has an exploitable weakness under certain conditions, and SHA256 doesn’t share that weakness. He isn’t computing “256 bits is X effort vs 128 bits is Y effort.” This answer is “MD5 is broken and has a vulnerability; here’s how it could be exploited in the use case you describe.” – John Deters Feb 13 '19 at 19:10
11

I wonder how much safer is the use of the SHA256 hashes for integrity checks?

Note: Consider the file content as random input (no attacks)

Based on your note of "no attacks" it seems to me that you are asking:

"What is the probability that a random change (e.g., bit flip during download) to a file will result in creating a new/different file with the same checksum as the original file?"

For the case of MD5, this probability is: 1/(2128) = 2.94e-39 = 0.00000000000000000000000000000000000000294

For the case of SHA256, this probability is: 1/(2256) = 8.64e-78 = 0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000864


Important Caveat: In the above-mentioned hypothetical case of random changes, both MD5 and SHA256 are fine choices. However, in real life, the MD5 hash function is frowned upon because it has been broken (collisions have been found). So, the real life advice is: use SHA256 not MD5 for file integrity.


Update based on comments: I'm referring to MD5 as "broken" to mean (basically) that collisions have been found. One of the main conjectured properties of MD5 was "that it is computationally infeasible to produce two messages having the same message digest..." (RFC 1321) Because it is possible to violate this property, I've called MD5 "broken," which is perhaps a little harsh. I still see MD5 used all the time, and I still use it myself all the time. It is fine to use MD5 in certain circumstances, especially when there is no other option.

forest
  • 64,616
  • 20
  • 206
  • 257
hft
  • 4,910
  • 17
  • 32
  • I converted your 2.93/10^39 into standard scientific notation. You can always roll back my edit if you don't approve. However, the way you had it written is a bit non-standard and therefore potentially confusing. As long as you're writing down large numbers like that, I figure you might as well use proper scientific notation. Also, you had a slight rounding mistake. – Conor Mancone Feb 14 '19 at 20:51
  • 1
    This is actually the only answer that directly compares the collision probabilities under the question's conditions. – Marcel Feb 15 '19 at 06:57
  • @Marcel I'm glad you found what you're looking for. Just because, I think the reason the other answers didn't bother including the actual numbers is because in both cases the odds are so low as to be completely negligible in real-world circumstances. – Conor Mancone Feb 15 '19 at 13:51
  • You've still got a rounding error @hft. The number should be 2.94e-39 for MD5. See https://www.google.com/search?q=1%2F(2%5E128)&oq=1%2F(2%5E128) – Conor Mancone Feb 15 '19 at 13:57
  • I agree with the first part of your answer, but I don't believe that md5 is broken for file integrity constraints. The vulnerabilities in md5 allow collisions to be found, but md5 has no known pre-image attacks required to violate file integrity. https://en.wikipedia.org/wiki/Preimage_attack. Though in designing file-integrity in 2019, I'd still (almost always) choose sha256. – Steve Sether Feb 15 '19 at 22:06
  • I'm using the word "broken" here to mean (basically) that collisions have been found. One of the main conjectured properties of MD5 was "that it is computationally infeasible to produce two messages having the same message digest..." (RFC 1321) Because it is possible to violate this property, I've called MD5 "broken," which is perhaps a little harsh. I still see MD5 used all the time, and I still use it myself all the time. I agree with you that it is fine to use MD5 in certain circumstances, especially when there is no other option. – hft Feb 16 '19 at 20:28
  • Those are not collisions. The "different" inputs (one hex, one base64) decode to the same binary input to the sha256 function. – hft May 12 '22 at 16:42
  • Not really, I've seen most of these tricks – hft May 12 '22 at 20:35
4

MD5 creates an 128-bit hash, whereas SHA256 creates a 256-bit hash.

You could say that SHA256 is "twice as secure" as MD5, but really the chance of a random collision is negligible with either. I would say MD5 provides sufficient integrity protection.

There are attacks to create MD5 collisions on purpose, but the chance of finding a collision on accident is still determined by the size of the hash, so is approximately 2/2128.

There are currently no two distinct files in the world that have the same SHA256 hash. There are distinct files that have the same MD5 hash, but only because they have been purposely so created.

Sjoerd
  • 28,707
  • 12
  • 74
  • 102
  • 1
    I may be misunderstanding, but I find these two statements of yours contradictory: "I would say MD5 provides sufficient integrity protection." and "There are attacks to create MD5 collisions on purpose". Certainly accidental collisions are very unlikely, but is it really safe for integrity protection if it is possible for an attacker to create a collision on purpose? – Conor Mancone Feb 11 '19 at 14:58
  • 1
    @ConorMancone the only issue collisions have for integrity is if the attacker is able to modify the file before your create the initial hash (which is a valid concern in some cases, but not all). – AndrolGenhald Feb 11 '19 at 15:05
  • As the OP, note that the question is about "how much more safe", not about absolute safety and not regarding a purposeful change. – Marcel Feb 11 '19 at 15:25
  • 6
    Doubling the digest size from 128 does not double the "security". It increases it by a factor of 2^128. – forest Feb 12 '19 at 03:54
  • I downvoted due to a misunderstanding. I can't change my vote to an upvote, so bounty. – Conor Mancone Feb 14 '19 at 15:28
  • @ConorMancone You should be able to change your vote, why not? – Marcel Feb 14 '19 at 16:09
  • @Marcel the vote gets locked after a bit and can only be reversed if the author edits their post – Conor Mancone Feb 14 '19 at 17:18
3

MD5 collision vulnerabilities exist and it's feasible to intentionally generate 2 files with identical MD5 sums.

No SHA256 collisions are known, and unless a serious weakness exists in the algorithm, it's extremely unlikely one will be found.

For verifying a file was not accidentally corrupted, MD5 is probably sufficient. If it's possible it was intentionally altered, MD5 isn't safe and you should stick with SHA256.

Alexander O'Mara
  • 8,774
  • 6
  • 34
  • 38
  • So, you say, there are currently no distinct 2 files in the world, that have the same SHA256 hash, and most likely never will? – Marcel Feb 11 '19 at 07:52
  • @Marcel (correct me if I'm wrong alexander). "Most likely never will" is probably not true. Right now there are no known weaknesses in SHA256, and therefore generating collisions is effectively impossible. However, this was also true of MD5 once. Researchers continue to develop new algorithms partly because continued growth of processing power makes it easier to break old ones, but also because finding weaknesses seems inevitable. Certainly there are currently no known weakness, but it's a safe bet that that will change eventually (although hopefully not for a long time). – Conor Mancone Feb 11 '19 at 14:57
  • @ConorMancone I think you have underestimated just how many SHA256 hashes are possible (which is understandable). The real reason MD5 and now SHA1 is broken is because it has weaknesses such that a match could be found. Also, Moore's law is broken. – Alexander O'Mara Feb 11 '19 at 16:37
  • @AlexanderO'Mara It doesn't matter how large the hash is and how many possibilities there are if critical weaknesses are later discovered in the function itself. This is what happened with MD5. It also has a large space (although substantially smaller than SHA256), but because of severe weaknesses there are ways to create files with collisions almost instantaneously in some circumstances. If similarly critical weaknesses are later found in SHA256 then the hash size won't matter. – Conor Mancone Feb 11 '19 at 17:05
  • @ConorMancone See the *"unless a serious weakness exists in the algorithm"* quantifier then. – Alexander O'Mara Feb 11 '19 at 17:06
  • 1
    Of course, I missed the fact that the OP was specifically talking about accidental collisions, in which case the chances of that ever happening remain astronomically small – Conor Mancone Feb 11 '19 at 17:08
  • At such a point as we collectively have astronomically large collections of files, it seems like MD5 will have natural collisions more frequently than a hash that's twice as long if they're both cryptographically secure. I would use the word much except I have a difficult time conceiving of that many files. I just know that either life as we know it will end or it will happen eventually. – Ed Grimm Feb 12 '19 at 03:40
3

Both MD5 and SHA256 resist a preimage attack, nowadays.

This means that it would be near to impossible for someone to replace the file with a different one with the same {MD5|SHA256} hash.

However, you should note

  • MD5 is a broken hash function. Attacks will only increase (there was a theoretical attack 10 years ago with computational complexity of 2123.4), there's little reason to start using this hash on a new project in 2019.
  • It is frowned upon to use MD5. Your actual usage of MD5 may not be exploitable, but it looks bad on you to be using this hash (only).
  • Your inputs may not be as random as you expect. CAs were using MD5 for certificate signatures, over contents they created, thinking they it was safe. Then on December 2008 a real world proof of concept of the attack was published.
  • You can always use both If you are targeting end users, you can simply provide multiple hashes (MD5, SHA1, SHA256...) and move the decision to the final users.

So, if you have to make a decision between using MD5 and SHA256, go for SHA256.

Ángel
  • 17,578
  • 3
  • 25
  • 60
  • I think this misses some of OP's question -that defense against deliberate attacks are to be ignored- in all the examples, there was a significant effort and computation time put into finding these issues. In the absence of targeted behaviour the "safety increase", even multiplicatively is tiny. Other factors are far more significant. – drjpizzle Feb 15 '19 at 16:50
  • The question lacks a threat model. No attacks but it asks about the safety level. If there are no 'attacks' whatsoever, why does it use a hash function? – Ángel Feb 17 '19 at 00:03
  • Accident, like data corruption. CRC is widely used for this kind of thing though its completely 'broken' as a hash. – drjpizzle Feb 18 '19 at 08:34
  • The your 'attacker' would be data corruption / cosmic rays :P – Ángel Feb 24 '19 at 23:38
  • talk about the universe being out to get you.. – drjpizzle Feb 25 '19 at 22:09
2

If you exclude malice or other intentional/MD5 aware behaviour, MD5 is really is fine.

There is of course a chance of accidental collision of MD5 and SHA256 the odds of the SHA256 are a lot lower. However for some context: the odds of an accidental collision on MD5 is far lower that the chances that the check flag get accidentally flipped by a comsic ray, to make it look like like the hashes where the same when they weren't see here, with some caveats.

If you are interested in non-random inputs (like malice), SHA256 might be a better choice, but it depends what you think the attacker could control.

If you're just interested in making sure the system is accident proof, there are better places to spend your time than which hashing algorithm you use.

drjpizzle
  • 199
  • 4
  • 1
    Malice is not the only situation that may result in collisions. What if you are performing a deduplication task based on MD5 on a file hosting service, and someone uploads two files with intentionally colliding MD5 digests? It need not be done maliciously (they might just want to upload two PoC files to show to someone), but treating MD5 as fine would result in their files being mangled. This even happened with GitHub where some data was corrupted when the colliding SHA-1 PDFs were uploaded. – forest Feb 15 '19 at 12:13
  • @forest You're right, malice may not be the best choice of words. Perhaps intentional is a better measure. I will think if there is a non-waffley way of making that change, but it think from the point of view of 'random input' the "No" remains. – drjpizzle Feb 15 '19 at 16:40
0

Chosen-prefix collision attacks on md5 are fairly easy to pull off. This is an attack where the attacker can choose two arbitrary files, then append different calculated bytes to each, so that both files produce the same md5 hash.

In 2012, the authors of the Flame malware took advantage of this attack to make it appear as if the malware was signed by a legitimate Microsoft code signing certificate. The certificate had been used to sign a legitimate file, where the signature was done over the (weaK) md5 hash of the file. The authors of the Flame malware used a chosen-prefix attack to make their malware file produce the same md5 hash as the legitimate file, thus making it appear as if the malware file was signed by the Microsoft certificate. For more info, see https://en.wikipedia.org/wiki/Flame_%28malware%29#Operation.

There have been no known attacks of this kind on SHA256.

mti2935
  • 19,868
  • 2
  • 45
  • 64