57

I know that collision for MD5 has been documented since the 90s and that digital certificates based off of MD5 has been demonstrated to be completely compromised back in 2010 but how effective is MD5 in ensuring that small amounts of data have not been tampered with?

I have some small text files that are a few pages in size (let's say 15kb in size). I've been using SHA-256 on them but it would be much more convenient to be able to use MD5 instead.

How secure would MD5 be as a hash digest for these small 15kb text files? Would a malicious party be able to produce collisions for such a small amount of data or does the small size make this a difficult endeavor?

forest
  • 64,616
  • 20
  • 206
  • 257
thebunnyrules
  • 673
  • 1
  • 5
  • 10
  • 54
    Why? Why would you want that? Why would MD5 be "much more convenient"? – ilkkachu May 29 '18 at 10:29
  • 2
    See also [Does an identical cryptographic hash or checksum for two files mean they are identical?](https://superuser.com/q/1324629/53590) on [su]. – user May 29 '18 at 14:17
  • Is secure the word you're looking for? Or do you mean reliable? There is is probably no need to focus on security for this task. – Anthony May 29 '18 at 19:15
  • 5
    2004 is not exactly the 90s. – kasperd May 29 '18 at 21:40
  • 4
    Small files aren't safe. The [2008 rogue CA certificate](https://www.win.tue.nl/hashclash/rogue-ca/) used an MD5 collision on a "to-be-signed" part of only 927 bytes, and only needed control of 204 of those bytes (the "collision block") to create the collision. – Gordon Davisson May 30 '18 at 06:44
  • 4
    Aside from anything else, bear in mind that if MD5 is more convenient than SHA-256, then you already have a problem. Maybe not a problem you can do anything about yourself (e.g. maybe you have to comminicate with something that uses some old protocol with MD5 inherently bound into it, or some piece of hardware with accelerated MD5 but not SHA-256). But as well as considering whether to surrender and use MD5, it's also worth considering whether you can attack whatever constraints it is that are trying to impose MD5 on you :-) – Steve Jessop May 30 '18 at 08:54
  • 6
    The answers below are quite technical and illuminating, but they don't change the bottom line. MD5 is obsolete and should not be used in any new applications needing a cryptographic hash. This has been true for nearly 20 years, so long that the same thing can now be said about it's successor SHA1. If by "convenient" you are referring to the 128-bit output, you are better off truncating a modern hash to 128-bits than using MD5. Not defining what you mean be "convenient" makes this question likely an instance of an [X-Y problem](https://meta.stackexchange.com/q/66377/141193) – President James K. Polk May 30 '18 at 16:51
  • @James K Polk Using a truncated hash sha2, sha3 may be interesting but wouldn't the truncating make them as weak as the md5? – thebunnyrules May 31 '18 at 21:28
  • @kaspered from wiki entry on md5 "In 1996, Dobbertin announced a collision of the compression function of MD5". That's not even close to 2004. – thebunnyrules May 31 '18 at 21:42
  • @Gordon Davidson: From what I understood from Forest's answer below: a collision is only a threat if the file wasn't made by me and I wasn't the person to take the initial md5 hash. If someone is to take my file and replace with a file with the same hash (a preimage attack), it would be much more difficult. Is there something that I'm missing or is there something wrong with the answer itself? If so please explain, maybe directly in the comments of the answer so that Forest can respond. – thebunnyrules May 31 '18 at 21:55
  • No, truncating the hash won't make it *as weak* as MD5. It will be vulnerable to collision attacks with a work factor on the order of 2^64 steps, but that's the price you have to pay for having a 128-bit hash. And it doesn't seem like your application is vulnerable to collision attacks. More importantly, it helps keep MD5 from further polluting the crypto ecosystem. – President James K. Polk May 31 '18 at 23:15
  • Well, since you're putting so poetically, how can I say no... – thebunnyrules Jun 01 '18 at 00:23
  • 1
    @thebunnyrules You're still in danger if an attacker can control *part* of a file (and predict what comes before the part they control). That's essentially the situation with the cert collision: a legit certificate authority created the certificate, but the "attacker" was able to control parts of the generated certificate and predict the rest; that was enough. In your case, if no attacker can control *any part of* any of the files, attackers are stuck with a second-preimage attack, which MD5 isn't known to be significantly weak to. But I'd still rather truncate a better hash. – Gordon Davisson Jun 01 '18 at 00:38
  • Just as @GordonDavisson says, an attacker only needs to influence both of the files, not necessarily be the sole author. While MD5 is technically safe for a limited threat model, it's still a good idea to upgrade to something better. After all, MD5 has a 2^18 collision, whereas something like SHA-256 truncated to 128 bits would require 2^64 for a collision. – forest Jun 01 '18 at 03:40

5 Answers5

95

The size of the input is irrelevant. In fact, because of the birthday paradox, collisions are guaranteed as soon as the size of the message exceeds that of the hash. The best way to avoid collisions is to use a stronger hash which is not vulnerable to them, such as SHA-2. However, you are describing a more difficult attack than a collision attack, called a preimage attack, which MD5 is safe from.

There are three types of attacks* that result in having two files with the same digest:

  • 1st preimage - Find an input that resolves to a specific hash.

  • 2nd preimage - Modify an input without changing the resulting hash.

  • Collision - Find any two distinct inputs that have the same hash.

These are vulnerabilities when they can be carried out more efficiently than by brute force search. Collisions can still occur naturally, and in fact they are guaranteed with any non-trivial amount of input due to the pigeonhole principle, but hashes are designed to make it difficult to intentionally perform. For a hash with an output the size of MD5's, the chance of a random, accidental collision is extremely low. Even if you hash 6 billion random files per second, it would take 100 years before you get a 50% chance of two hashes colliding. MD5 is great for detecting accidental corruption.

A strong n-bit hash function is designed to have a security level of 2n against both 1st and 2nd preimage attacks, and a security level of 2n/2 against collision attacks. For a 128-bit hash like MD5, this means it was designed to have a security level of 2128 against preimages and 264 against collisions. As attacks improve, the actual security level it can provide is slowly chipped away.

MD5 is vulnerable to a collision attack requiring the equivalent of only 218 hash invocations instead of the intended 264 to exploit. Unless the attacker generates both files, it is not a collision attack. An attacker who has a file and wants to maliciously modify it without the hash changing would need to mount a 2nd preimage attack, which is completely infeasible against MD5 with modern technology (the best attack has a complexity of 2123.4, compared to MD5's theoretical maximum of 2128). Collision attacks are relevant in different situations. For example, if you are given an executable made by an attacker without a backdoor, you may hash it and save the hash. That executable could then later be replaced with a backdoored version, yet the hash would be the same as the benign one! This is also a problem for certificates where someone could submit a certificate for a domain they do own, but the certificate would intentionally collide with one for a domain they do not own.

It is safe to use MD5 to verify files as long as the stored hash is not subject to tampering and can be trusted to be correct, and as long as the files being verified were not created (or influenced!) by an attacker. It may still be a good idea to use a stronger hash however, simply to prevent a potential practical preimage attack against MD5 in the future from putting your data at risk. If you want a modern hash that is very fast but still cryptographically secure, you may want to look at BLAKE2.

* While there are other attacks against MD5 such as length extension attacks that affect all Merkle–Damgård hashes as mentioned by @LieRyan, these are not relevant for verifying the integrity of a file against a known-correct hash.

A variant of the collision attack called a chosen-prefix collision attack is able to take two arbitrary messages (prefixes) and find two values that, when appended to each message, results in a colliding digest. This attack is more difficult to pull off than a classic collision attack. Like the length extension attack, this only applies to Merkle–Damgård hashes.

forest
  • 64,616
  • 20
  • 206
  • 257
  • It should be added that in addition to a second preimage attack being unfeasible in itself, a _meaningful_ second preimage attack is even harder. You need not only find _any_ blob that produces the same hash (that's a 2^123 complex task), but one that is plausible (i.e. in this case a readable, meaningful, non-gibberish text file). Now add the file length to your hash to strengthen it further (making e.g. adding non-print characters impossible), and I'd really want to see someone pulling that one off. – Damon May 29 '18 at 08:32
  • 1
    @Damon You don't need to add the file length to a hash. In fact, Merkle–Damgård hashes (of which MD5 is one) encode the length of the data to be hashed in the final block. Also, restricting the input types does not make a preimage attack that much harder. It just means you are more limited in the bits you can toggle, which really isn't a limitation as long as you are able to toggle a few dozen bits. – forest May 29 '18 at 08:38
  • What exactly do you mean by saying that SHA-2 is "not vulnerable" to collisions? – chrylis -cautiouslyoptimistic- May 29 '18 at 08:57
  • @forest: I'm not sure you are 100% correct there. While it is true that M-D hashes encode the length of the data, you can nevertheless hash "foo" and "foo2" just fine, and they will both produce a 128-bit blob which is agnostic of the original message's length. If "foo" and "foo2", or maybe "foo1234" happen to collide, and that's all you want in the end, you have a working attack. If, however, it is _explicitly_ known that "foo" has 3 letters (not 4, nor 6, nor 8), then that's hard to pull off. – Damon May 29 '18 at 11:54
  • @Damon It is explicitly known that "foo" has 3 letters (or rather, 3 × 8 = 24 bits). See [RFC 1321](https://www.ietf.org/rfc/rfc1321.txt) § 3.2. This means it is not agnostic to the original length. _H(m || length(m))_ gives you no benefit over _H(m)_. – forest May 29 '18 at 13:16
  • @forest: I guess then the guys who hacked Flickr back in 2009 were all liars, since no such thing as a length extension attack on MD5 exists. My bad, I misunderstood the approach. – Damon May 29 '18 at 13:38
  • 2
    @Damon I think you misunderstand how a length extension attack works. There is a [useful answer](https://crypto.stackexchange.com/a/3979/54184) over on the Cryptography site which explains why it is possible despite the padding. You can prevent it by _prepending_ a length value however (or by using a construction designed to resist such attacks such as HMAC), but simply "explicitly encoding the length" is not nearly enough. – forest May 29 '18 at 13:42
  • 1
    @chrylis I meant that all hashes in the SHA-2 family are not subject to any known attacks that make collisions significantly more easy to construct than via exhaustive search. – forest May 29 '18 at 13:51
  • It's not quite enough that the files are not "created by an attacker" -- even if an attacker only controls part of the files, they may still be able to force a collision. – Gordon Davisson May 30 '18 at 06:48
  • 2
    FYI, the alleged preimage attack on MD5 has far worse price/performance ratio than generic attacks; it is at best a theoretically interesting observation, not even a theoretical attack. But I strongly advise _against_ quibbling over the details of different types of attacks on MD5 to a general audience without the experience to recognize how difficult protocol design is or to recognize how collisions or length extensions can be exploited for forgery. Naive thinking about this causes people to make irresponsible choices like using MD5 in rsync, and makes them resistant to fixing the problems. – Squeamish Ossifrage May 30 '18 at 13:46
  • For example, the attacker doesn't have to produce both files. MD5 certificates were forged _in practice_ by an attacker who could _predict_ parts of the files not under their control, and control _some of_ the files. Absent a much more specific model of what the legitimate user's actions and the adversary's powers are, this advice is likely to lead someone into trouble if they don't take the time to study protocol design and work out the consequences for themselves. – Squeamish Ossifrage May 30 '18 at 13:54
  • ```you don't need any more than the size of the hash to make collisions guaranteed``` << that's not strictly true, you'll need hash size + 1 bit to guarantee collisions exists – user1067003 Mar 08 '21 at 18:41
13

It depends on what you want to defend yourself against

Security is never a one-size-fits-all game. If it were, then there would not be 12941 different hash algorithms. Instead, you need to understand that every security measure defends you against a specific sort of attack. You put a password in your computer to defend against random people accessing it, not because it's so fun to type whereD1DweG0sowron6 whenever you log in.

As for hash algorithms, you can grossly classify them as "cryptographic hashes" and "non-cryptographic hashes". Cryptographic hash algorithms are designed to withstand a number of attacks, while non-cryptographic hashes are designed to be as fast as possible.1 MD5, for example, is considered a cryptographic hash, but so broken that it's only usable as a non-cryptographic hash.

When to use a non-cryptographic hash

If your goal is to detect bit-flips when copying a file from one location to another (say, a thumb drive to a laptop), then MD5 is absolutely the right choice. I would even go as far as saying any fast, non-cryptographic hash is good. When you copy files, you realistically do not need to fear attacker interference. If you are paranoid about hackers being able to modify your kernel, then adding hashes will not solve your problems.

Verifying file integrity with attacker interference

If you intend to sign and publish those files, then an attacker might have the ability to craft a possibly legitimate file with the same hash - meaning that your signature is just as valid on the malicious file.

An example

Let's say your original message m1 looks like this:

I hereby declare that the bunny rules!

You use your hash function h(m1) and gain the digest d1. Afterwards, you sign the digest d1 and get a signature s1.

You then publish your message m1, your signature s1 and your hash function h().

I might be the attacker in the scenario and craft a message m2 that has the exact same hash in your chosen hash function:

It is publicly known that dogs are better than bunnies in every regard...

Since h(m1) = h(m2) = d1, the signature s1 is valid for both your original m1 and my malicious m2.

In order to defend yourself against such attacks, it is vital to choose a strong hash algorithm with high resistance to collisions. This means that it becomes very hard for me to find an m2 where h(m2) = h(m1).

Good choices would include SHA256 and SHA512, as well as tons of others. It seems everyone has some favourite non-mainstream hash functions, but SHA256 and SHA512 have very widespread support and it will be hard for you to find a system that does not support these hashes. And since your files are very small, calculating the hash should be almost instant.

For example, on my 800MHz machine, calculating the SHA512 hash of a 16k random file took 3ms, so even on a toaster it should be relatively quick.


1 You can see the same thing with random number generators. Cryptographic PRNGs aim to deliver random numbers that are really hard to guess, while non-crypto PRNGs aim to just give numbers that look random at first glance and do that fast.

  • 1
    While nothing you're saying is incorrect, you don't mention MD5 anywhere in the answer. This question is about MD5 and verifying file integrity. MD5 is AFAIK perfectly suitable for this purpose (with the caveat that sometimes cracks in algorithm design suggest other possible flaws). – Steve Sether May 29 '18 at 20:19
  • The attack you're describing is a first-preimage attack. The scenario where a collision attack would be relevant would be if someone else gave you a program and asked you to verify there was nothing evil in it and publish or sign a message saying "the program with hash __ is safe.". An attacker could (using a collision attack) contrive two programs that were identical except for the contents of a string literal, and contrive things so that (1) one program was benign and the other was malicious, and (2) both programs would hash to the same value. – supercat May 29 '18 at 21:54
  • 1
    @SteveSether I do mention MD5 when talking about non-cryptographic hashes. If it's solely used to verify file integrity without attacker interference, MD5 is safe to be used, but not the best choice. –  May 30 '18 at 08:34
  • @supercat Indeed, but it was merely an example as to what attacker interference could look like. In either case, it is a good idea to use a strong cryptographic hash algorith, as the ones I have suggested –  May 30 '18 at 09:52
  • I'm downvoting this because the answer implies that MD5 isn't suitable for cryptographic hashes. As forest points out above, it's perfectly acceptable for SOME cryptographic purposes, just not the collision attack that @supercat and forest mention. – Steve Sether May 30 '18 at 14:06
  • @DavidStockinger: I haven't profiled the performance of MD5 versus other hashes, but in cases where performance matters, a faster hash that wouldn't be secure in all use cases but is secure in the one that matters, may be better than a slower hash which would also be secure in irrelevant use cases. – supercat May 30 '18 at 14:51
1

Short answer: No, it is not secure to use MD5 to verify the integrity of files, short or long.

The full answer depends on how confident you are in the distribution of errors.

Is there an independent random chance of bit flips in each position in the file owing to transmission on a slightly lossy channel like a serial port? If so, you could use MD5, but it's much cheaper to use a CRC, which is guaranteed to detect a single bit flip, and can be guaranteed by standard choices of CRC polynomial to detect all odd numbers of bit flips.

But you asked about secure, which suggests you are considering slightly more intelligent adversaries than a lossy serial port. If you're not confident that the errors are independent random bit flips, then don't use MD5, or a CRC. It is very easy for intelligent adversaries to find pairs of distinct files that share a common MD5 hash, or CRC checksum, and in many scenarios this can enable an adversary to forge documents that your MD5 system will not detect. The size of the file is not relevant: it is easy to find MD5 collisions in files as short as 64 bytes, with no limit on how long they can be.

There is a place to discuss the technical differences between collision attacks, preimage attacks, and second-preimage attacks. An answer to a general question about whether it is secure to verify the integrity of files is not such a place. When you have a specific protocol in mind where you can articulate the precise powers of the adversary and exactly how the legitimate users will behave in the protocol, and you have implementation constraints that limit your choice of hash functions so that you must consider MD5, then we can discuss (perhaps on crypto.SE) whether it is safe to use MD5 in that protocol to attain the security you hope to achieve against such an adversary.

But it would be much simpler and safer for you to just use SHA-2, or SHA-3, or BLAKE2.

Squeamish Ossifrage
  • 2,636
  • 8
  • 17
0

The size of the file doesn't make a difference. MD5 is based on Merkle–Damgård construction, which is vulnerable to the length extension attack. 15kb is plenty to do the length extension attack. There are plenty of known collisions and methods to generate MD5 collisions that are just a few hundreds bytes in length, and once a base collision is found, being vulnerable to length extension means that they can be used to generate an arbitrary number of further collisions.

Wai Ha Lee
  • 113
  • 1
  • 7
Lie Ryan
  • 31,089
  • 6
  • 68
  • 93
  • 1
    How is the length extension attack relevant to ensuring files have not been tampered with? All the arbitrary collisions require files actually start out with a collision, after which it can be further extended. It doesn't allow an arbitrary, benign file to be replaced with a malicious file with the same hash. – forest May 29 '18 at 05:23
  • 1
    The explanation of length extension attack that you link to on Wikipedia does not match this situation at all, since it is unlikely to produce a compromised file that would have a /matching/ hash. – mlibby May 29 '18 at 18:12
  • The length-extension attack is designed to guard against the scenario that someone who knows the hash of a message but not its contents can determine what the hash of the message would be if some specific additional content were appended. Giving the attacker a way of figuring out what a file's hash value would be doesn't seem like much of a vulnerability in the OP's scenario. – supercat May 29 '18 at 22:07
0

The size per-se isn't hugely relavent, the actually collision data can be as small as a single block.

However you are much safer with a collection of text files than with a collection of pdfs or similar.

Why? because the results of a collision attack generally result in both files of the pair containing some "random looking garbage". In a rich format this random-looking garbage can be hidden from sight so the attacker can trick the collection administrator into accepting one of their pair of colliding files.

In a text file though, the content is plain for everyone to see.

Peter Green
  • 4,918
  • 1
  • 21
  • 26
  • This is partially incorrect. An MD5 collision attack does not need to involve flipping random bits. You can easily restrict the bits you intend to flip to those that merely change printable ASCII characters to unprintable ones (such as ENQ or NAK). For any non-trivial ASCII file, there will be _plenty_ of bits that you can freely change without corrupting the text visibly. – forest May 30 '18 at 01:36