General purpose slow/unique hash routine for dup checking of private data, without storing the data itself?

Question

I am wondering if there is a percentage of uniqueness known to be lost with every repeat of various hash routines, like MD5, SHA1, and how that might compare with other algorithms.

If theoretically I could store 256^16* 99% different values and each have unique MD5 equivalents, Hashing 100 times would give me .99^100= 36.6% of the namespace. Which is not bad given the vastness of the namespace, but what is the real percentage? Is it more like 90% or worse? Do you have an alternate recommendation for a slow, general purpose hash?

I want to ensure that brute force is expensive, and there may be values with less-than-ideal entropy, I will need to slow the thing down, the obvious solution is repeating the process an absurd number of times.

I am considering also SHA-1, or an XOR combination, but I want to hear your thoughts on this.

score 28 · Accepted Answer · edited Jul 10 '14 at 17:46

Space reduction does occur, but not like that.

Secure hash functions are supposed to behave like what a random function would do on average (i.e., a function chosen uniformly among the set of possible functions with the same input and output lengths). MD5 and SHA-1 are known not to be ultimately secure (because we can find collisions for them more efficiently than we could with a random function), but they are still close enough approximations here (still, you should not use them except for interoperability with legacy implementations; you really should use a more modern and secure function like SHA-256).

So, assuming an n-bit output, if you hash all 2ⁿ strings of n bits, you can expect all the outputs to cover about 63.21% of the space of 2ⁿ possible output values (that’s 1-(1/e)). You lose a bit more than a third of the space. However, if you rehash all these obtained values again, you will loose much less than a third of them. The reduction factor is not constant for each round. After a few rounds, extra reduction per round is very small.

When you hash a given value many times successively, you are actually walking a “rho structure”, as in the figure below:

Rho structure when walking in a random function graph

This structure is so named because it looks roughly like the Greek letter ρ, “rho”. Each dot is an n-bit value, and each arrow represents the application of your hash function. The blue dot is your starting point. So the idea is that by applying the function sufficiently many times, you end up in a cycle. Once on the cycle, successive applications of the hash function no longer reduce the space of possible values: you just walk around the cycle, indefinitely. The cycle length is then the minimal space size that you can ever achieve.

The lengths of both the “tail” (the values you get through before entering the cycle) and the “cycle” itself are, on average, 2^n/2. So, for a hash function with a 128-bit output, successive applications will not reduce the space below 2⁶⁴, a quite large value; an even reaching that minimal space will take, on average, 2⁶⁴ hash function evaluations, which is expensive.

Finding the cycle by walking the rho structure is actually the basis for Floyd’s cycle-finding algorithm, which is one of the generic algorithms which can be used to build collisions for a hash function: in the rho structure, at the point where the tail is attached to the cycle, there is a collision (two distinct values which hash to the same value). For a cryptographically secure hash function, finding collisions is supposed to be hard; in particular, Floyd’s algorithm should be computationally infeasible, which is achieved by making n big enough (e.g., n = 256 for SHA-256: this raises the cost of Floyd’s algorithm to 2¹²⁸, i.e., not realistically feasible on Earth).

In other words: if you select a secure hash function, with a wide enough output (a prerequisite for being secure), then you will not be able to reach the cycle and/or walk it, and your security will not suffer from any “space reduction”. Corollary: if you have a problem of space reduction, then you are not using a secure hash function, and that is the problem you need to fix. So, use SHA-256.

Even if you use a variant that avoids circles (say `h_i=hash(i || h_{i-1})`) the entropy loss becomes really small once only a small part of the hash space is used. — CodesInChaos, Mar 11 '13 at 08:13
IMHO, it's a little bit misleading to call the structure rho structure. It's usually more like hairy furballs. The strength of a hash function under very large iteration is basically determined by the number of separate furballs and the size of the furballs. The length of the hairs are only relevant when the number of iterations are very low. — Lie Ryan, Jun 25 '14 at 22:53

General purpose slow/unique hash routine for dup checking of private data, without storing the data itself?

1 Answers1

Linked