For common hashes, are their collision risks similar when considering only the first N bytes?

Question

Question in short: If I consider just the first 16 bytes of SHA-1 and SHA-256 hashes, do they have substantially the same collision risk?

Background: I have an application where I need a 16-byte (exactly) hash of a short string (a few bytes to tens of bytes). My plan is to use SHA-1 and simply truncate to 16 bytes. A colleague will say "Don't use SHA-1 -- it's BROKEN!"
My belief is that for SHA-1 and SHA-256 (for example), the first 16 bytes will have very similar collision probabilities. (Furthermore, since the input message is short, any collision you can find would probably NOT be of a similar length. Is that true? I don't care if an attacker can find a 200 byte message that gives a hash collision.)

This question addresses the actual collision probability for the first N bytes for MD5 in particular, making the rather strong assumption that the hashes would be uniformly distributed in the first N bytes. If that is a good assumption, wouldn't it be true that any well-designed hash algorithm would have the same collision risk when considering just the first N bytes? (Obviously, I am quite capable of designing a very bad hash algorithm that gives lots of collisions in any number of bytes. But I'm considering only well-known and well-tested hash algorithms.)

My claim: If I'm going to use only the first 16 bytes, it doesn't matter if I choose SHA-1, SHA-256, or SHA-OneZillion.

(This does seem like a question that would have been asked a hundred times, but I could not find it. I apologize if I missed it, and will be grateful if someone points me to previous postings.)

I think the search term you want is "truncate": https://security.stackexchange.com/questions/72673/how-bad-is-it-to-truncate-a-hash — schroeder, Feb 27 '22 at 19:51
What is your application? Do you really need collision resistance or do you just need pre-image resistance? SHA-1 is out of the league of collision resistance - truncated or not. It seems you need a fast one then use BLAKE2b which is even faster than MD5.. — kelalaka, Feb 27 '22 at 21:49
Also, truncated to 16-byte SHA-2, SHA-3 is not secure against standard collision attack that requires 2^64 tries with 50% probability. And actually, 50% is too high for the advantage of an adversary. They even will try for 0.01.% probability that requies way lower tries. — kelalaka, Feb 27 '22 at 21:57
@schroeder --Thank you! You're exactly correct. But I'm actually glad I did not find that question, because the answers here answered questions that I didn't even think to ask. — Still_Learning, Feb 28 '22 at 18:19
Please do not add editorials to questions. You ask to be pointed to previous postings. That's all I did. — schroeder, Feb 28 '22 at 18:56

score 3 · Accepted Answer · answered Feb 28 '22 at 02:47

A cryptographically secure hash function is typically expected to be indistinguishable from random. That means, effectively, that the outputs are going to be uniformly distributed, and therefore that every set of bytes is equally likely. For such a hash function, a truncation should provide the normal strength (2^n for preimage attacks and 2^(n/2) for collision attacks), unless the hash function states otherwise in its design.

However, we know that SHA-1 is not a cryptographically secure hash function. With a 128-bit (16-byte) output, we'd expect a collision probability of 2^-64, but instead, we know that we can simply perform a collision on the entire output for 2^63.4. SHA-1 is not going to become stronger, and it is likely that this attack could be improved in the future. Hence, using SHA-1 would be irresponsible. In general, using SHA-1 in any new system is irresponsible when there are better alternatives.

Because the cost of performing a 2^63.4 attack is approximately USD 45,000, you should also know that for any hash function with a 128-bit output, a brute-force collision attack on a secure hash function is not substantially harder than that, and therefore if your application requires collision resistance, it is vulnerable to attack by any well-paid software developer, as well as any corporation or government of non-trivial size. Thus, you cannot use this size output if collision resistance is important, since your code will contain a security vulnerability.

If you don't need collision resistance, only preimage resistance, then a 128-bit output is fine. However, if you're not sure, you should adopt an output length of 32 bytes (256 bits).

I should note that in most cases, a collision will in fact be about the same size. Collisions typically operate on an identical number of blocks with small differences between them, since the way these attacks tend to be developed lends itself to this approach.

While you can truncate SHA-2, my recommendation is to use BLAKE2b. It can be configured with various different output lengths up to 512 bits, is cryptographically secure, and is extremely fast even in plain C. In fact, its specific design goal was to be a replacement for MD5 and SHA-1 which was both faster and cryptographically secure, which it has succeeded at. The only time you would want to necessarily avoid it is if you had regulatory requirements mandating certain hash algorithms, but those requirements would also instruct you not to use SHA-1, so I expect that's not the case here.

score 2 · Answer 2 · answered Feb 27 '22 at 21:44

The probability of an accidental collision will be the same, but there are known (non-accidental) ways to find collisions in SHA-1, which will also apply to any truncated version of it. Whether this is a risk in your application would require a detailed analysis of how your application uses the hash, what the relevant threat models are, etc.

But even if that analysis shows your application isn't vulnerable, your application may shift or expand over time, and so may the relevant threat models, and attacks on cryptographic primitives (like hashes) only get stronger over time, so even if it's safe to use a truncated SHA-1 hash now, it may not be later. It's better to use a primitive like SHA-256 that just doesn't have the weakness than to use a weak primitive and hope the weakness doesn't turn into a problem down the road.

So: don't use known-weak cryptography unless you need to for compatibility. Just don't.

For common hashes, are their collision risks similar when considering only the first N bytes?

2 Answers2