Is my understanding of SHA1 correct?

Question

My understanding is that SHA1 is a 'shared' hashing algorithm. As in, if somebody on the other side of the world on an entirely different system runs a string through SHA1, it will result in the same output as if I did on my machine, every time.

In which case, is the lifetime of SHA1, generally, very finite? In that, eventually all outputs can be attributed to particular strings with a 'brute force' type of approach?

I appreciate that, if this is correct, the above approach would take massive effort/processing time (similar to attempting to hit a matching pair of guids), but I'm just asking conceptually.

Yes, that is why passwords should be [salted](http://security.stackexchange.com/a/31846/28626). — grc, Oct 14 '16 at 09:31

score 8 · Accepted Answer · answered Oct 14 '16 at 09:40

You are correct that SHA1 is deterministic - i.e. if you give the algorithm the same input, it will produce the same output not matter what computer you run on, where in the world you are, what year it is, the color or your skin or the content of your character.

There are an infinite number of inputs (any string), but only a finite number of outputs (160-bit string). Since the outputs are finite (just finite - there is no such thing as "very finite"), you are correct that in theory it would be possible to generate a "dictionary" with a matching inputs for each output.

You are also correct in your intutition that this might be practically impossible even if it is theoretically possible, given that that the dictionary would need to have 2^160 different entries. Here are two of the problems:

To make such a dictionary, you need to generate about 10^48 hashes. The universe is around 10^17 seconds old. So if you started at the big bang and generated 10^31 hashes per second you would be done by now.
To store such a dictionary, you would need around 10^50 bits of storage (10^2 per entry). That is roughly the number of atoms on earth.

You could off course make a dictionary for a subest - say the most common passwords - but doing one for the full output space is not realistic, even in science fiction. And if you did, it would just map the output to one of many possible inputs.

Excellent answer, and some fascinating facts to go with it! Thank you very much. I'm assuming, then, that sites that successfully disclose the raw text behind and SHA1 value do so simply because they've seen that string before? e.g.: http://sha1.gromweb.com/ Thanks again, great reply. — , Oct 14 '16 at 13:29
@JᴀʏMᴇᴇ Glad it helped. Yes, they do it because they have seen that particular string before. Common passwords and natural language words (or even sentences) are a tiny, tiny, tiny fraction of the whole space of hashes. — Anders, Oct 14 '16 at 13:31

score 2 · Answer 2 · answered Oct 14 '16 at 11:09

First off, yes, SHA1 (as well as every other hash algorithm) is deterministic; in other words, given the same input, it will always yield the same output. (That output may be formatted differently, such as pure binary representation, hexadecimal or Base64, but formatting of the output is irrelevant to the value of the output.)

What you describe is what is known as a preimage attack.

There are two variants of preimage attacks:

In a preimage attack, an attacker has a hash H(x) and wishes to deduce the hash input x. In other words, given y and H(x) = y, find x.
In a second preimage attack, an attacker has an input x and wishes to find a different input x', such that H(x) = H(x').

Note that x (and x') can be of arbitrary size, but the output of the hash function is of fixed size. Some hash functions are defined for different output sizes, but within the space of a single hash, the output size is still fixed (and thus the hash function can be treated as having a fixed size output).

A hash function (such as SHA1) works, in principle, by iterating over the input and using that input to modify internal state, then exposing either the full or part of the internal state (possibly after further, final processing) as the hash value. This is what is known as the Merkle-Damgård construction.

There are two consequences of this:

Yes, given sufficient computing resources, it is possible to find an input that hashes to a given output
No, given infinite computing resources, it is not possible to determine exactly which input was used to get a given output (though, as a consequence of the previous point, you can get candidates)

The latter point might not be obvious, but it follows from the fact that if H(x) = y, and x is longer than y, there can exist an arbitrary number of values for x which give the same value for y. For an ideal hash function, the expected number of such collisions for an arbitrary input size X and an arbitrary output size Y (both in bits) will be 2^(X-Y); hence, by hashing 192 bits into a 160 bit hash, if you had the ability to enumerate the entire 192-bit input space, you'd expect to find about 2^32 (which is 2^(192-160)) other values giving the same hash value.

The extreme case is a hash algorithm defined as H(x) = 0. It is trivial in this case to determine that the hash of a given value is 0, but given only the definition of the hash function and its output value 0, it is not possible to determine which value yielded 0 as the output.

As discussed already by Anders, modern hash algorithms have large enough output spaces that generating a dictionary of all possible hash outputs is not feasible. Also, even if it was feasible to generate and store such a dictionary, it would only get you a candidate input. Depending on the purpose of the attack, this may or may not be sufficient.

Suppose the hash function is defined as H(x) = (x mod 2) for an integer input space, thus yielding 0 if x is even and 1 if x is odd. In this case, I can tell you that H(65535) = 1, but given the hash value H(x) = 1, all you can tell about the input x is that it is odd, and that one such candidate would be (for example) H(3) = 1.

Is my understanding of SHA1 correct?

2 Answers2

Linked