34

There are many different hash functions, md5, sha, and others. They take a value V and produce a H via transformation Function(V) = H, where Function is md5, sha, etc.

My question is: Does every hash value H have a value V?

For example, given md5 hash value f2c057ed1807c5e4227737e32cdb8669 (totally random), can we find what it was from?

In other words, if we list all hashes:

00000000000000000000000000000000
00000000000000000000000000000001
...
fffffffffffffffffffffffffffffffe
ffffffffffffffffffffffffffffffff

Can we find a value for each one of them?

Edit (from OP's comment): I want to know if there exists an input for each possible output. I'm not interested in finding inverse

user13267
  • 359
  • 6
  • 16
bodacydo
  • 829
  • 9
  • 14
  • 3
    This answer might be different depending on which hash algorithm you are referring to. – Philipp Dec 02 '15 at 18:26
  • 5
    Based on the current answers to your question most understand that you want to find the inverse. I read your question more that you want to know if there exists an input for each possible output of the hash but don't want to know the exact input (i.e. exists vs. value). Maybe you should rephrase your question to make it more clear. – Steffen Ullrich Dec 02 '15 at 18:46
  • You cannot list all possible hash values in the first place. At least not for MD5, to say nothing of algorithms with a larger output. – Xander Dec 02 '15 at 18:52
  • 14
    Related question: 2010-04-17: SO: [*Do cryptographic hash functions reach each possible value, e.g. are they surjective?*](http://stackoverflow.com/questions/2658601/do-cryptographic-hash-functions-reach-each-possible-value-e-g-are-they-surject) – StackzOfZtuff Dec 02 '15 at 20:58
  • @Xander I just listed them 0..0 to f..f. – bodacydo Dec 03 '15 at 00:12
  • 2
    @SteffenUllrich Your're right - I want to know if there exists an input for each possible output. I'm not interested in finding inverse. – bodacydo Dec 03 '15 at 00:14
  • Not individually, you didn't. You only listed 4 out of 2^128 of them. As a percentage, that's not a lot. – Xander Dec 03 '15 at 00:36
  • @Xander I apologize this whole question has become a huge misundersanding and mess. I didnt ask my question right. I don't want to reverse anything or compress anything. I only want to find if space of H's is full, or as they say in mathematics is hash function surjective. (we can find V's so that all possible H's are generated) In MD5 case it's H's from `00000000000000000000000000000000` to `ffffffffffffffffffffffffffffffff`. Only purely theoretical question about H's. Since hash functions do a lot of bit shifting, reversing, bits can get lost, so I wonder if all H values can be generated. – bodacydo Dec 03 '15 at 01:33
  • 7
    Another duplicate, this time on crypto: [Is every output of a hash function possible?](http://crypto.stackexchange.com/questions/2670/is-every-output-of-a-hash-function-possible) – CodesInChaos Dec 03 '15 at 07:08
  • 1
    Crossposted on http://stackoverflow.com/questions/34050166/does-every-hash-value-have-an-inverse-value and http://math.stackexchange.com/questions/1556897/does-every-hash-value-have-an-inverse-value – Reeno Dec 03 '15 at 13:45
  • FWIW, there is a name for the list of hashes + values. It's called a Rainbow Table. Though, a Rainbow Table is not a list of ALL hashes since for example for MD5 that list would need at least around 300000000000000000000000 petabytes of memory. I think that's more than all data on the internet. – slebetman Dec 03 '15 at 21:35
  • @Reeno: I just found the other duplicates, with your comments, by searching for the given MD5 in the question. – A.L Dec 03 '15 at 23:35
  • I know this wouldn't be a mathematical proof, but it would be suggesting to know if this was true for for weak/old/small hash functions. E.g. if you could generalize SHA to some small n (32?) and spend a few hours brute-forcing, do you eventually find all hashes? – Carl Walsh Dec 04 '15 at 01:40

4 Answers4

62

My question is: Does every hash value H have a value V?

For example, given md5 hash value f2c057ed1807c5e4227737e32cdb8669 (totally random), can we find what it was from?

These are actually two very different questions: whether there is an input for each output, and whether we can find an input for each output.

For your first question: we do not know. For a given hash function, the number of possible inputs is a lot bigger than the number of possible outputs, so we would find it very surprising if the function was not surjective (i.e. if there was an output without any matching input). For instance, with MD5, there are 2128 possible outputs, and 218446744073709551616-1 possible inputs, so we expect that each output has, on average, about 218446744073709551488 corresponding inputs. It is rather implausible that there is an output with no corresponding input.

However, we do not know how to prove it. To a large extent, we expect that property to be very hard to prove for any concrete hash function (such a proof of surjectivity would not, per se, be a weakness of the function, but, handwavingly, the security of a hash function relies on the intractability of its structure with regards to that kind of analysis).

For your second question: we strongly hope that it is infeasible. This is what is called preimage resistance: for a given output y, it should not be computationally feasible to find a x such that h(x) = y. Even if it was mathematically proven that such an input must exist (it is not proven, but it is strongly suspected, as explained above), it should still be atrociously expensive to actually find it.

If the hash function is "ideal" then the best possible attack is luck: you try out possible inputs until a match is found. If the output has size n bits, then "luck" should work with an average effort of 2n; with n large enough (and n = 128 is large enough), this is way out of what can be done with existing computers. A hash function is said to be resistant to preimages if "luck" is still the best known attack.

It is known that MD5 is not ideally resistant to preimages, because an attack with effort 2123.4 has been found -- about 10 times faster than 2128, but still a long way into the realm of infeasibility, so that attack is theoretical only.

I would like to point out two things, though:

  • If I give you h(x) for an unknown x, but you have some idea about the x that I used (e.g. you know that x is a "password" that a human user could choose and remember), then you can try possible x values that match that idea. The actual average effort for finding a preimage to h(x) is the smaller of 2n and M/2, where M is the size of the space of possible inputs x. For a "raw" preimage attack, all sequences of bits are possible inputs, so M is huge and the cost is 2n. However, if, in a specific context, x is known to be taken out of a relatively small space, then finding the "right" x can be substantially faster.

  • As pointed out above, preimages are not supposed to be unique: for a given y, a large number of values x should exist, that hash to y. Thus, for a given output, you do not find "the" corresponding input, but "a" corresponding input, which may or may not be the "right" one.


And the third question ? Indeed, trying to make a complete map of inputs-to-outputs is yet something different.

As @Xander pointed out, the number of possible MD5 outputs (2128) is too large to be stored anywhere on Earth; it exceeds the cumulated size of all hard disks that have ever been built. However, if you could solve that storage issue, then it is possible to think about the cost of making a complete map.

If you use the "luck" attack on all 128-bit outputs independently, you may expect an overall cost of 2256 (2128 times a cost of 2128). However, you can do much better by handling all outputs simultaneously, i.e. trying out random (or sequential) inputs and simply populating your big table as you obtain the outputs. With effort 2128, you should get about 63% of your complete map, while the "independent luck" method would need an effort of a bit more than 2255 to achieve the same result.


Edit: as was pointed out by @Owen and @kasperd, the arguments on the number of inputs are not actually sufficient to induce surjectivity; the internal function structure matters. MD5 and SHA-1 are Merkle–Damgård functions, meaning that they are built as follows:

  • There is an inner pseudorandom permutation P: for an input block b of a given size (512 bits in the case of MD5 and SHA-1), Pb is a permutation of the space of sequences of exactly n bits (n = 128 for MD5, n = 160 for SHA-1).

  • A compression function is defined, as:

        f(b, x) = Pb(x) + x

    That is, for block b, we apply the permutation corresponding to b on the second input x, and then we "add" x to the output of that permutation. (In the case of MD5 and SHA-1, that addition is done on a 32-bit word basis, but the details do not matter here.)

  • To process a complete input message m, the message is first padded with extra bits so that the total size becomes a multiple of the block size, and also encodes the original message length. The padded input is then split into successive blocks b0, b1, and so on. A register r is initialized to a conventional value of size n bits (the "IV", specified in the MD5 and SHA-1 standards). Blocks are then processed one by one: to inject block bi, we compute f(bi, r), and the output is the new value of r. When all blocks have been processed, r contains the complete hash function output.

The addition step in the compression function turns the pseudorandom permutation Pb into a pseudorandom function. A relevant consequence is that, for a given value b, f(b, x) is very unlikely to be surjective. In fact, we expect values f(b, x), for all 2n inputs x, to cover only about 63% of all possible 2n sequences of n bits.

This processing has interesting consequences. First, consider all inputs of size exactly 1 GB (the "traditional" gigabyte of exactly 1073741824 bytes): there are 28589934592 such sequences, i.e. a lot more than 2128. However, when applying MD5 on all these messages, they will all be padded with exactly one extra block (8589934592 = 16777216×512, so an extra block of size 512 will be appended), and, furthermore, this final block will be the same for all 1 GB inputs (it encodes the input length but is otherwise deterministic, with no randomness and no dependence on the values of the input bits). Let's call bz that final block. MD5 on a 1 GB input message m thus begins with a lot of processing on the first 16777216 blocks, resulting in a 128-bit value x, and the hash output MD5(m) is equal to f(bz, x).

Therefore, all 1 GB messages ultimately reduce to a single of the same compression function f(bz, x). Thus we expect the hash outputs to cover only about 63% of all 128-bit sequences. This example demonstrates that the argument on the number of inputs to the hash function is incomplete (though it gives the right idea).

Conversely, if we consider all messages of exactly 300 bits in length, then they will all be hashed as f(b, IV), with 2300 distinct blocks b. We thus have 2300 pseudorandom permutations Pb, all applied on the same 128-bit input (the IV), yielding 2300 128-bit results which are all, then, supposed to be uniformly distributed over the space of 128-bit values. Adding IV to all of them does not change that uniformity. In that case, the counting argument works and thus sujectivity becomes highly probable.


Edit 2: About the "63%". When you generate a random value, uniformly, in a space of size N, then the probability of hitting a given value x is 1/N; thus, the probability of not hitting a given value x is (N-1)/N.

Now try it N times: you generate N values randomly, uniformly and independently (in particular, you may generate several times the same value). For a given x, the probability of not being part of these N values is the probability of having been missed N times, i.e.:

    P = ((N-1)/N)N.

This can be approximated as follows:

    P = eN ln (1-1/N) = eN(-1/N + o(1)) = e-1+o(1)

Thus, with big values of N, the probability of any given value being missed approaches 1/e. Therefore, the expected coverage of the space of size N, with N random values, would be close to 1-(1/e). This is approximately 63.21%.

Thomas Pornin
  • 320,799
  • 57
  • 780
  • 949
  • 1
    I find this wording a bit strange: "we expect that each output has, on average, about $2^{18446744073709551488}$ corresponding inputs". The average number of inputs per output can be computed exactly, so there is nothing *expected* about that. The expectation is that there are no outputs with a significantly higher or lower number of inputs than the average. – kasperd Dec 02 '15 at 23:28
  • Thanks for answer! I only asked 1 question. The first question. :) Some people misunderstood and thought I asked two questions. I'll edit my question. – bodacydo Dec 03 '15 at 00:16
  • I only want to find if space of H's is full, or as they say in mathematics is hash function surjective. (we can find V's so that all possible H's are generated) In MD5 case it's H's from `00000000000000000000000000000000` to `ffffffffffffffffffffffffffffffff`. Only purely theoretical question about H's. Since hash functions do a lot of bit shifting, reversing, bits can get lost, so I wonder if all H values can be generated. – bodacydo Dec 03 '15 at 01:33
  • I'm wondering if maybe more should be said about why we should expect MD5 to be surjective. For example, suppose I make MD5² by MD5-hashing the input and then MD5-hashing that. MD5² is almost certainly *not* surjective, but it has the same ratio of inputs to outputs as MD5. – Owen Dec 03 '15 at 04:02
  • 4
    @kasperd the average number of inputs per output is what, in statistics, is called an *expected value*. – hobbs Dec 03 '15 at 04:58
  • @hobbs I know, but then it should have said expected value **or** average. Not the expected average. – kasperd Dec 03 '15 at 09:45
  • @Owen The difference is that your MD5² construction has a bottleneck in its structure between the first and second MD5 invocation which you can only get 128 bits of entropy through. And in the second invocation of MD5 you can only lose entropy. If you consider simply MD5 there is also only 128 bits of entropy passed between invocations of the compression function, but in addition to that you have up to 447 bits of payload data fed into the last invocation of the compression function. – kasperd Dec 03 '15 at 09:53
  • 1
    @Owen But if we pay careful attention to the "up to 447" bits of payload part, we realize that for certain input lengths there will be 0 bits of payload data at this stage. And that actually makes it unlikely that all outputs will be possible using just inputs of a certain length. For example most likely there exist a 128 bit value which is not the MD5 hash of any input of exactly 1GB in size. – kasperd Dec 03 '15 at 09:56
  • 1
    @kasperd: indeed the internal structure matters, so I have added some discussion on that subject in my answer. – Thomas Pornin Dec 03 '15 at 13:40
  • 3
    I don't understand how you can say that there are a limited number of inputs (no matter how large). Isn't the list of possible inputs to a hash function infinite (since there is no limit to the length of the input) – SplashHit Dec 04 '15 at 02:35
  • 1
    How did you calculate the 63%? – Neil Smithline Dec 04 '15 at 05:53
  • 2
    @SplashHit The Merkle–Damgård construction uses padding designed such that the length of the message is encoding in the last block being hashed. This length field has a constant size and thus limits how long an input can be hashed. MD5, SHA1, and SHA2 all use two words for the length field leading to a 2EB limit for most of them. Some of the SHA2 variants use 64 bit words and thus in theory can support inputs as large as 35184372088832YB in practice it is unlikely to be physically possible to process such a large input. – kasperd Dec 04 '15 at 11:43
  • @NeilSmithline: I have added some explanations on my answer about the "63%". – Thomas Pornin Dec 04 '15 at 15:58
15

These are actually very common questions that people ask about hash functions. I will give a more mathematical answer, including some terminology to help you with googling.

My question is: Does every hash value H have a value V?

The mathematical way to phrase this question is

For a hash function H = Function(V), does every output H have a pre-image V that maps to it?

or more compactly,

Is a hash function Function() surjective?

Whether MD5, SHA-1, SHA-256, SHA-3, etc, are surjective is a good question and one that has been asked many times on the internet (Google can find you some good discussions). The short answer is: we don't know. We strongly suspect that they are, but this is not something we've been able to prove either mathematically, or through experiments.

To give you an idea of why this is a hard question; this answer on CS.se talks about MD5 and points out that hash functions are designed to be very close to random and avoid any patterns, making any kind of mathematical analysis very difficult. You could always write a program to guess inputs until you have seen every possible output. MD5 has a 128-bit output, meaning there are 2128 hashes you need to hit. Assuming you got them all on the first try, and you could check 1-per-second, it would take you about 1031 years and at least 1028 gb of hard drive space to do the exhaustive check (note that the universe is estimated to be ~ 1010 years old, and the total hard drive capacity on earth is around 1012 gb).

Hash functions in the SHA-family have larger output spaces, and are mathematically more complex, meaning that this kind of analysis is even less tractable for them.

Note also that since a hash function takes in an input of any size, and gives an output of fixed size, there could be (in theory) an infinite number of inputs that will map to any given hash. So a hash value H likely has large set of values {V1, V2, ...} that map to it.


For the second part of your question:

given md5 hash value f2c057ed1807c5e4227737e32cdb8669 (totally random), can we find [the set of inputs that would produce it]?

Finding a string that will produce a given hash is a form of crypanalysis called a Preimage Attack. As others have pointed out, for a hash function to be considered a Cryptographic Hash Function it must be resistant to preimage attacks - or in other words, it must be impossible to invert other than taking the list of all possible strings, hashing them, and seeing if they match.

If somebody finds a shortcut to invert a hash faster than checking all possible strings, this would be considered a vulnerability in that hash function and that hash function would be considered "broken". MD5 is considered broken and is no longer recommended for cryptographic use, so you may be able to find known preimage attacks against it. If you look those up, you may be able to find tools to help you invert your MD5 hashes. The SHA-family of hashes have not yet been broken, so you're out of luck for those.

Mike Ounsworth
  • 57,707
  • 21
  • 150
  • 207
  • Hi :) I only had one question.... That was only example to show what I mean. I only wanted to know about surjectiveness of hash functions. :) Thanks! – bodacydo Dec 03 '15 at 01:28
  • Hi - sorry - I didnt ask my question right. I don't want to reverse anything. I only want to find if space of H's is full, or as they say in mathematics is hash function surjective. (we can find V's so that all possible H's are generated) In MD5 case it's H's from `00000000000000000000000000000000` to `ffffffffffffffffffffffffffffffff`. Only purely theoretical question about H's. Since hash functions do a lot of bit shifting, reversing, bits can get lost, so I wonder if all H values can be generated. – bodacydo Dec 03 '15 at 01:32
2

That is the very purpose of one way hashes (ie md5, sha1, sha2, etc). They aren't supposed to be reversible. If you could reverse a hash lots of security would immediately become insecure. The hash doesn't contain the information from which it was hashed. The process of hashing is easy in one direction and really really difficult in the other direction is why.

Now if you had a lot of content and hashed it, then stored that content with it's hash in a big hashmap then you could quickly reverse things by looking it up by the hash, and find the content that you produced that hash with. This is what is called rainbow tables which has been a viable way to crack passwords in the past, but not so much anymore.

Even if you could, consider this. Say I produced an MD5 hash of a movie that is 100MB. If I could reverse this hash value and get my 100MB movie from it then I'd have an extremely powerful compression algorithm! Because that would mean I could take any content 1Mb, 100MB, 1Gb, 1Tb, etc. and turn that into a hash which is 32 bytes to represent anything I wished. Now could I really represent every content imaginable at any size in 32 bytes? That would be impossible as there's not enough information density in 32 bytes to represent every possible content we could ever dream up as 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456. That would be the upper limit of unique content I could create too.

I think this answer explains it better than I can as to how the math makes this impossible (or at least very very difficult):

Why are hash functions one way? If I know the algorithm, why can't I calculate the input from it?

chubbsondubs
  • 205
  • 1
  • 7
  • Nice pointing the awesome compression algorithm, I never thought of that... – ThoriumBR Dec 02 '15 at 21:33
  • Hi - sorry - I didnt ask my question right. I don't want to reverse anything or compress anything. I only want to find if space of H's is full, or as they say in mathematics is hash function surjective. (we can find V's so that all possible H's are generated) In MD5 case it's H's from `00000000000000000000000000000000` to `ffffffffffffffffffffffffffffffff`. Only purely theoretical question about H's. Since hash functions do a lot of bit shifting, reversing, bits can get lost, so I wonder if all H values can be generated. – bodacydo Dec 03 '15 at 01:29
  • Well as others pointed out we suspect hashes are surjective, but we really don't know. My mention of compression was to propose a straw man argument for why you can't easily reverse a hash because if you could you'd also have an amazing compression algorithm that defies logic that it could exist...and destabilize modern cryptography. I knew you weren't trying to compress anything. – chubbsondubs Dec 04 '15 at 03:38
1

If you are talking about cryptohgraphic hashes, there is no way of recovering back the "V" value used to generate "H" hash. They are engineered in a way to prevent that.

What people do to discover the original "V" value is to generate various "V"s, calculate their respective "H" and compare these "H"s to see which one is equal to the original "H" hash. Yeah, just by bruteforcing it.

DarkLighting
  • 1,523
  • 11
  • 16
  • Sorry. My question is only about surjectiveness of hash functions. – bodacydo Dec 03 '15 at 01:19
  • I don't want to really recover anything or discover V's. My question strictly to find if space of H's is full, or as they say in mathematics is hash function surjective. (we can find V's so that all possible H's are generated) In MD5 case it's H's from `00000000000000000000000000000000` to `ffffffffffffffffffffffffffffffff`). That is all - I've no plan or plans to find V's. Only purely theoretical question about H's. Since hash functions do a lot of bit shifting, reversing, bits can get lost, so I wonder if all H values can be generated. – bodacydo Dec 03 '15 at 01:31