17

For my work I'll need to provide my customer a specific file which will be the result of the work I have done for them.

To protect the integrity of the work I have done and to guarantee it has never been modified, I intend to add a checksum to my documentation which will be provided with the file.

Since MD5 and SHA-1 are not secure for a long time now, I was wondering if we were still using them for this purpose or if there are better algorithms which could do the same job but more safely.

I'm looking for the best reliable solution. I'm aware that a 100% proof will never be possible but I was wondering if MD5 was still rated "good" for this purpose or if there are really new and safer tools.

zahypeti
  • 3
  • 3
gouaille
  • 173
  • 1
  • 1
  • 4
  • SHA256 is the recommendation these days, but your question makes it unclear this actually accomplishes anything. If the hash is provided in the same communication channel as the file, what prevents an attacker from replacing the hash along with the file? What is the hash trying to accomplish here? – AndrolGenhald Nov 28 '18 at 16:06
  • Hi @AndrolGenhald, thanks for your answer. My work can vary a lot but here is the theory with an example : If my work is a video, I'll hand it over on a USB Key with a paper documentation. The paper documentation will mention the hash of my file in order to compare it in case someone say my video was edited since the delivery – gouaille Nov 28 '18 at 20:03
  • For an off-the-shelf solution: consider using a service such as [Notarizer](https://notarizer.app/). Disclosure: I made this – kemp Nov 29 '18 at 21:01

2 Answers2

12

Choice of hash algorithm

Use SHA-256 or SHA-512: either of the two “main” members of the SHA-2 family. SHA-2 is the successor of SHA-1 and is considered secure. It's the hash to choose unless you have a good reason to choose otherwise. In your case the choise between SHA-256 and SHA-512 is indifferent. There is a SHA-3 but it isn't very widely supported yet and it isn't more secure (or less secure) than SHA-2, it's just a different design.

Do not use MD5 or SHA-1. They are not obviously unsuitable in your scenario, but they could be exploited with a bit of extra work. Furthermore the fact that these algorithms are already partially broken makes them more at risk of getting more broken over time.

More precisely, for both of these hashes, it is possible to find collisions: it is possible to find two documents D1 and D2 such that MD5(D1) = MD5(D2) (or SHA-1(D1) = SHA-1(D2)), and such that D1 and D2 each end with a small bit that needs to be calculated and optionally a common chosen suffix. The bit that needs to be calculated will look like garbage, but it can be hidden in a comment, in an image that's shifted off-page, etc. Producing such collisions is trivial on a PC for MD5 and is doable but expensive for SHA-1 (unless you want it for two PDF files, in which case researchers have already spent the money on the calculation to find one and published it).

In your scenario, you mostly don't care about collisions, because you'll be producing D1. You aren't going to craft this bit in the middle. However, there's a risk that somebody could trick you into injecting this bit, for example by supplying an image to include in the document. It would be pretty tricky to achieve a collision that way, but it's doable in principle.

Since there's risk in using MD5, and zero benefit compared to using SHA-256, use SHA-256.

What to do with a hash

With a non-broken cryptographic hash like SHA-256, what you know is that if two files have the same hash then they're identical. Conversely, this means that if two files have different hashes, then they're different. This means that if you keep a trusted copy of the hash (for example you print it out and store it, or notarize it), then you can tell later “yes, this file you're showing me is the same file” or “no, this file you're showing me is different”.

Knowing the hash of the file doesn't prove that you wrote it. There's no cryptographic way to prove authorship. The best you can do is to prove that you had the file earlier than anyone else who can prove it. You can do that without revealing the file by communicating the hash to a third party who everyone trusts to correctly remember the date at which you showed them the hash; this third party could be a public notary, or the Wayback Machine if you put the hash on a web page that it indexes. (If you publish the hash, then in theory someone could figure out the file from it, but there's no better way to do that than to try all plausible files until they find the right one. If you are concerned about this then use a signature of the file instead of a hash, and notarize the signature and the public key but keep the private key to yourself.)

Example of something a hash is good for: your customer wants support, but you're only prepared to support your original product and not a modified product. So you get them to calculate the hash of what they want you to support. If the hash value is not what you provided, you refuse to provide support. Note that you need to trust the customer to calculate the hash of the product, and not calculate the hash of some copy of the original or read it off the delivery slip.

Example of something a hash is not good for: somebody else claims that they're the author of the document. You say “no, look, I know its hash, it's 1234…”. That doesn't help: anybody can calculate the hash.

Example of something a hash is good for if used appropriately: somebody else claims that they just wrote the document. You say “no, look, I notarized the hash 6 last year, so you can't have written it last week”.

Example of something a hash is not good for: somebody makes a slight modification of the document. It'll then have a different hash. All you can say is that the document is now different, but that doesn't convey any information about how different they are. The hash of a completely different document is just as different as the hash of a version with a typo fix, or a version that's encoded differently.

Gilles 'SO- stop being evil'
  • 50,912
  • 13
  • 120
  • 179
  • Signature alone doesn't prove that you wrote it, only that you saw it. (Assuming the signature scheme is unbroken and your key uncompromised, of course.) If the signature is (reliably) _timestamped_ at an earlier time than anyone else can establish, or close enough to minimum time of creation based on the contents (e.g. it includes lottery results for $date), _that_ proves authorship. – dave_thompson_085 Nov 29 '18 at 12:29
  • as mentioned on [shattered.io](https://shattered.io/ ) website, the SHA-1 collision is not exactly a collision it is called *identical-prefix collision attack*. You need some freedom to execute it, see page 3 of the article. – kelalaka Nov 29 '18 at 18:28
  • 2
    @kelalaka An identical-prefix collision attack is a collision. – Gilles 'SO- stop being evil' Nov 29 '18 at 19:47
  • If you must use older hash algorithms for compatibility (for example MD5 and SHA-1) then consider using both and verifying multiple checksums. – David Nov 29 '18 at 22:51
  • SHA-3 is more secure than the SHA-2 family in some circumstances, such as when length extension attacks are possible. Of course, that won't necessarily matter for OP's threat model. – forest Nov 30 '18 at 06:59
  • @forest SHA-3 is more secure than SHA-2 when you use it as something other than a hash. (Namely, SHA-3 is also good as a pseudorandom function, whereas SHA-2 isn't, because of the length extension property.) In the OP's case, he's only using it as a hash, so SHA-3 has no security benefit. – Gilles 'SO- stop being evil' Nov 30 '18 at 08:24
  • You wrote "With a non-broken cryptographic hash like SHA-256, what you know is that if two files have the same hash then they're identical.". This is wrong. The SHA256 hash is only 32 bytes, so there is more than one input data longer than 32 bytes which results in the same hash. It is just very unlikely to get such a collision by coincidence, and there is no known way to compute such a collision at the moment. – Frank Buss Feb 20 '21 at 15:35
  • @FrankBuss No, my statement is correct. Sure, there exists two distinct files with the same hash, but you are not going to find any such pair of files. If you do find two files that have the same hash, you can be sure that they are identical. The probability of accidentally finding a collision is less than the probability of a bit flip in the hardware, and nobody knows how to improve on these odds. Fearing an accidental collision is falling prey to a misjudged risk fallacy. – Gilles 'SO- stop being evil' Feb 20 '21 at 17:04
  • @Gilles'SO-stopbeingevil' As you confirm, there exists two different files with the same hash. I was specifically referring to the part "if two files have the same hash then they're identical". This is clearly wrong. You are right that in practice you can be sure they are identical, but this doesn't change the fact that the part of the sentence I cited is wrong. – Frank Buss Feb 21 '21 at 19:13
6

For ensuring that work product is unchanged, even MD5 is reasonable.

The ability of an attacker to engineer a collision is dangerous when they may, for example, generate an executable. That executable may take 500 Kb to do something bad, and spend another 50,000 Kb spinning out unused bits just to get the collision. That's okay if those bits are unused; you simply see an executable with the right hash, and you're fooled.

To engineer a collision that both matches the MD5 hash -and- represents credibly incorrect documentation is not feasible. You're more likely to end up with documentation that reads "Take the plug and insert it into the $#WG%ga 940[2aj2'rj09[3j59g;qa1j; socket" - anybody who looks at that will realize the documentation has been tampered with. Even a phased array of Shakespearean monkeys can't spin an MD5 collision that still looks like documentation.

Looking more closely, I see it's not the documentation you're protecting; you'd include the hash of the "specific file which will be the result of the work I do for them". Again, not knowing that that file is - executable? source code? - it is computationally infeasible that they could modify it in such a way as to credibly claim it is what you gave them, and engineer a hash collision at the same time.

See also this answer on Crypto.SE which summarizes:

MD5 is currently considered too weak to work as a cryptographic hash. However, for all traditional (i.e. non-cryptographic) hash uses MD5 is often perfectly fine.

You're not looking at a cryptographic use of a hash, so MD5 is fine for you. It will prevent replacement of modified or credibly forged replacements of the work product you've provided them.

gowenfawr
  • 71,975
  • 17
  • 161
  • 198
  • 3
    This answer is quite misleading about how hash collisions behave. A collision doesn't take anywhere near 500 Kb to produce. That estimate is off by three orders of magnitude. There are collision attacks which need just two blocks, which in case of MD5 and SHA1 is just 128 bytes. And most file formats have chunks of data that aren't immediately visible to the end-user and thus are suitable for a collision. The first meaningful collision against MD5 with two postscript files was demonstrated in 2005. – kasperd Nov 28 '18 at 16:50
  • @kasperd The numbers were not meant to be literally meaningful, merely to illustrate that collisions are only easy when there is sufficient 'fluff' space. That said, what you say is definitely interesting, I'd love to see a reference for the 2005 collision and see what they managed? Would also be useful to know what the "work product" really is in this case, C has less fluff space than PDF. – gowenfawr Nov 28 '18 at 16:54
  • 1
    It was in the rump session of Eurocrypt 2005. The Wikipedia article on MD5 has a link to the presentation: https://web.archive.org/web/20100327141611/http://th.informatik.uni-mannheim.de/people/lucks/HashCollisions/ – kasperd Nov 28 '18 at 17:01
  • @kasperd fascinating - but I wonder if the fact that they altered both "official" and "forged" document metadata makes the attack easier than one where they can only alter the "forged" document. (e.g., they didn't collide with Caesar's document; they collided with Alice's re-rendering of Caesar's document). – gowenfawr Nov 28 '18 at 17:06
  • 1
    That's the difference between collision attacks and second preimage attacks. Collision attacks have been demonstrated against both MD5 (in 2004) and SHA1 (in 2017). No feasible second preimage attack has yet been demonstrated against either. However it's easier to switch to a stronger hash (such as SHA2 or SHA3) than to prove that your threat model only involves second preimage and not collision attacks. – kasperd Nov 28 '18 at 17:13
  • @kasperd I still believe MD5 is sufficient for this user's threat model ("keeping honest clients honest"), but I strongly encourage you to write up a competing answer - I think you have good points that people should be considering, and clearly more expert knowledge of hashing nuances than I do. – gowenfawr Nov 28 '18 at 17:27
  • Preventing forgery *is* a cryptographic use. So is any attack by an intelligent adversary. Non-cryptographic usages involve protection against random (not necessarily independent -- burstiness is very common -- but not intelligently chosen) errors. – Ben Voigt Nov 29 '18 at 20:15