Reducing hash output length

Question

I am using output of (keyed) hash function to make sure the input was not modified, I store it along with input. Now output of SHA-1 is 20 bytes. Say I want to reduce the output. And have it output 10 bytes - I know I can truncate it (e.g., just take first 10 bytes of 20 bytes).

But my question is: how much can I truncate output so that it is still reasonably secure? (assuming like I said I use this for verification, e.g., make sure string that I hashed didn't change).

Can I reduce it to 8 bytes say? Is it still reasonably secure? (for my needs)

Edit: like I said I want to make sure attacker does not change the string- hash of which i store. In other words I am interested do collision attacks apply in my case? These strings I need to protect are meaningful english sentences. This is how my question is different from so called duplicate- there is no such information.

Ps Existing answer does not fully address this

@M'vy: I've seen similar questions just I am not sure how much truncating affects my case, e.g., when I am using it for verifying if its input was modified. I guess collisions is not an issue for me? — , Mar 20 '15 at 15:02
@user30024 Of course collisions are an issue. If there are collisions, you have a different message which is accepted as correct. Why do you want to reduce it, anyway? 20 bytes isn't exactly a lot to store or transmit. — Polynomial, Mar 20 '15 at 15:09
@Polynomial: depends in some cases 20 bytes is a lot. the answer here says collisions is not an issue: http://security.stackexchange.com/questions/47445/how-secure-is-a-partial-64bit-hash-of-a-sha1-160bit-hash. But I don't understand why. Anyway that is why I was interested for my specific scenario - where I use hash for checking if input was modified, whether reducing it to 8 bytes is OK? — , Mar 20 '15 at 15:11
@user30024 Thomas didn't say collisions don't matter. He's saying there's a difference between a preimage and a collision. In your case you're talking about checking for modified input (collisions matter, preimages are irrelevant) whereas that question concerned hashing passwords (preimages matter, collisions are mostly irrelevant). — Polynomial, Mar 20 '15 at 15:13
@Polynomial: That is why I wanted explanation which matters in my case - collisions? Or preimage? I had intuition it was collision that didn't matter in my case. So if someone can explain that would be nice. PS. Like I said-I store some string and hash of it along -with the goal to make sure someone didn't change the string later. — , Mar 20 '15 at 15:15
I still don't follow why you need to shrink the size of the hash down to 10 bytes. What's the reasoning behind this? You've asked whether it's "reasonably" secure "for your needs", but you haven't defined your needs, your threat model, your application's function, or the reason for reducing the hash size in the first place. — Polynomial, Mar 20 '15 at 15:15
@Polynomial: It will increase size too much I need to store many such hash codes. ps. Again I think the question I linked applies to my case? I am also using it for verification like in that question - I am verifying if someone changed the string — , Mar 20 '15 at 15:16
Please go back and edit your question to include the details I mentioned above. At the moment this is a vague question which cannot be answered, and as it stands it's a direct duplicate of the one M'vy linked. — Polynomial, Mar 20 '15 at 15:18
@Polynomial: Then from that question I have impression collisions is not important for me(why you say opposite?). For attacker it makes sense to take some existing hash which I produced and then try make another sentence which creates similar hash - this is 2nd preimage attack right? Not collision. — , Mar 20 '15 at 15:23
@user30024 Collisions are only irrelevant if the attacker has no control over the "legitimate" data. This is true with passwords; an attacker with control over your password already has your password. The term "verification" in the linked question is if the hash on the real data was already computed, and the attacker didn't get to control the real data. If the attacker can send legitimate data as well, you need collision resistance. — cpast, Mar 20 '15 at 18:48
Here's a few questions, which need to be answered to know what properties you need: Can an untrusted party also send legitimate data that's hashed normally, or tamper with the data from a trusted user before it gets hashed? If not, can they influence the data a legitimate party will send? If the attacker can send real data, you need collision resistance. — cpast, Mar 20 '15 at 18:50
I will use keyed hash so attacker should not be able to directly generate hash — , Mar 20 '15 at 18:56

score 3 · Answer 1 · answered Mar 20 '15 at 15:27

Answer depends on your security model.

Classically, a cryptographic hash function has three properties:

It resists preimages: given y, it is infeasible to find x such that h(x) = y.
It resists second preimages: given x, it is infeasible to find x' such that x ≠ x' and h(x) = h(x').
It resists collisions: it is infeasible to find x and x' such that x ≠ x' and h(x) = h(x').

For a "perfect" hash function with an output of n bits, the resistance is up to effort, respectively, 2ⁿ, 2ⁿ and 2^n/2 (regardless of how strong the function is, "luck" still works with a small probability, and that gives these average costs for finding a preimage, a second preimage, or a collision).

When you truncate the output of an existing hash function, you are in fact defining a new hash function, and since that hash function has a smaller output, its resistance is correspondingly smaller.

The good question is then: which of these properties are you relying upon ? This depends a lot on the context.

For instance, suppose that the attacker's goal is to alter an existing piece of data, without being detected through a hash function mismatch. This is a second preimage situation: attacker sees a message m and tries to find a modified message m' that hashes to the same value. In that case, truncating to 10 bytes means that you still have resistance 2⁸⁰, a huge number that will deter all but the most determined or irrational attackers. Note, though, that an attacker may have a choice of targets: if the attacker sees 100 messages, and wants to modify one of them (with no preference over which can be modified), then "luck" works 100 times better. Thus you may want some extra security margin and keep, say, 12 or 13 bytes of output.

Now if the attacker can get to inject his own innocent-looking message m, that you validate and hash, and will later on replace it with a distinct message m' with the same hash value, then this is a question of collisions, thus working with resistance 2^n/2 for a n-bit output. In that case, truncating the hash function output is a quite bad idea.

The safe course is not to truncate anything. After all, as I write above, truncating the output is equivalent to designing your own hash function, and, in all generality, this is very hard to do properly. SHA-1 has a 160-bit output precisely so that you get "at least 2⁸⁰ security" in all cases. Determining whether collisions apply to your situation or not can be hard.

Also, remember that a hash value does not create integrity; it just concentrates the issue. When you hash a piece of data, you get a hash value, and that hash value will guarantee that the original data is unchanged only insofar as you can be sure, through some other means, that the hash value itself was not altered. If you store the hash value along with the data that is hashed (your "input"), and the attacker is assumed to be able to alter the stored data, then what exactly would prevent him from modifying the hash value as well ? Whenever he wants to put m' instead of m in some database cell, he may also put SHA-1(m') into the neighbouring cell, overwriting the SHA-1(m) that was there, and you will be none the wiser.

If you want to use hash values to protect data integrity, then you must ensure the integrity of the hash values themselves in some other way. Hash functions reduce the problem: instead of protected a 1-gigabyte file, you "just" need to protect a 20-byte hash value. But you still have to do something.

Cryptography can help you further concentrate things by using a MAC instead of a hash function. With a MAC, you have a secret key, and you can use one key for millions of MAC computed over millions of "inputs". There again, you still have to do something, but that something is: keep confidentiality of a single key (of, say, 128 bits) for the whole of the server.

Thank you I know I need to protect hash itself that is why I mentioned keyed hash. Anyway I am concentrating on attacker model now. If I only care about 2nd preimage attack is 8 byte Ok? In my case attacker may get hash which I generated and try to create some other sentence of his liking which has similar hash value. I don't see how collisions is useful for attacker in my case - because if he generates some random message (m,m` which have same hash - e.g., collision attack) that might not even make sense in my application — , Mar 20 '15 at 15:37
how would he carry out collision attack in practice in my case? so that it also benefits him? (maybe you can concentrate on that) ps. I know I need to secure hash value itself. Thanks. — , Mar 20 '15 at 15:41
For example the messages which I am trying to protect using hash functions need to be meaningful english sentences - does not that eliminate collision threat? How come random message will be meaningful English sentence? — , Mar 20 '15 at 15:54
Regarding meaningful text sentences, those are even more enabling of manipulation because humans can easily be fooled. An attacker might attempt to generate collisions by appending or inserting whitespace, non-printable characters, or by inserting Unicode glyphs that look like English characters. — John Deters, Mar 20 '15 at 17:58
@John Deters: are you sure? What is the chance to produce two distinct messages that give same hash AND which are both meaningful? Further some sentrnce may make sensr only if it fits application context — , Mar 20 '15 at 18:04
@user30024 That depends on the collision-resistance of your system. In general, it is a terrible, terrible, terrible idea to assume there's a big gap between "collision" and "useful collision." — cpast, Mar 20 '15 at 18:38
Yes but does not it reduce attack surface? Now not only must attacker find two messages giving same hash but these messages must be meaningful english . If collisions make sense at all in my case. Attacker should try to generate some new message of his desire for given hash — , Mar 20 '15 at 18:58
I used that as an example because some of the broken hash algorithms had collisions that required just a few bytes of permutation at the end. (The attack that forged certificate signatures by exploiting this weakness in MD5 a few years back took advantage of this.) — John Deters, Mar 20 '15 at 21:13
@John Deters: that doesnt answer my question please read my above comments carefully — , Mar 21 '15 at 09:55

score -1 · Answer 2 · answered Mar 20 '15 at 15:13

-1

Why not just use a CRC-token? or alike. without knowing what it is you are securing against with the hash no-one can tell you what to do (and the least amount of data added for verification the better for speed).

truncating a hash does not help in general. because all you do is expose more of it to collision attacks.

answered Mar 20 '15 at 15:13

LvB

8,217
1
26
43

CRC is not a cryptographically secure measure against modification. It's a checksum designed to detect corrupted data, not forging. – Polynomial Mar 20 '15 at 15:14
If your truncating a hash, your not interested in cryptographic security at all (its what you sacrifice with the truncation). so basic data-corruption is enough to detect it (especially if you mix it in with your data value and use more than 1 bit, that was just for pointing out the minimum). Making it more complex without cause is called bad programming and makes it less secure. If you just want to protect a small data set I owuld use a scheme like 3-2-3 (first 3 bits + 1 CRC and 1 CRC + last 3 bits) its harder to see than just the end bit. and not worse than sha1 for small data sets. – LvB Mar 20 '15 at 21:57

Reducing hash output length

2 Answers2

Linked