4

I've been looking into what hashing functions are, what they do, &c. I was just wondering: how does a hashing function always return the same length hash?

How does hashing work? does not explain why hashes for a certain algorithm return the same length every time.

ZuluDeltaNiner
  • 143
  • 1
  • 4

4 Answers4

4

Most cryptographic hash functions operate on a fixed-size internal state and process the message to be hashed block by block, where a block is a number of bits. MD5, SHA-1 and SHA-2 follow the Merkle-Damgård construction which operates as follows:

  • Let S be an array of N bits, initialized to a predefined value. N is the size of the internal state and may be different from the final size of the hash.
  • Update the state based on each message block successively. More precisely, for each message block Mi, set S := C(Mi, S). The function C is known as the compression function; it takes a K-bit block and an N-bit state value and produces a new N-bit state value.
  • For the last few bits of input, which may not form a complete block (if the length of the message is not a multiple of the block length), apply a padding function to S that takes the length of the message into account (the Merkle-Damgård construction requires a specific padding method which I am not describing here).
  • Return F(S) where F is called the finalization function. F takes an N-bit state as input and returns an n-bit hash.

The compression function is so-called because it takes K + N bits of input and returns only N bits of output. This of course means that there are a lot of collisions: distinct (Mi,S) pairs producing the same resulting state. However, if done right, i.e. if it is computationally infeasable to find collisions for the compression function, then the hash itself is also collision-resistant.

There are non-Merkle-Damgård cryptographic hash functions, for example SHA-3, which uses a sponge construction. The operating method is a little different, but the core principle is the same:

  • Work with a fixed-size state.
  • Transform the state based on one message block at a time.
  • Apply padding to the last block.
  • Apply a finalization function to the internal state to return the hash value.

Non-cryptographic hashes often rely on arithmetic (but this approach doesn't tend to produce fast cryptographic hash functions). They use modular arithmetic as the compression function to keep the current state to a fixed size.

Gilles 'SO- stop being evil'
  • 50,912
  • 13
  • 120
  • 179
1

Hash functions are defined to map inputs of arbitrary length to outputs of fixed lengths. From this it is clear that hash functions are not bijective maps, i.e. there must be collisions of input values.

As of how a hash function manages to produce constant length output algorithmically: most modern hash functions operate block-wise on their input, padding the final block as necessary. Simplifying things a lot, one can say that a typical hash function design goes like that: For an input that can be split into n blocks it computes a series of block-sized values v_{k+1} = f(v_k, b_k) for k = 0,...,n-1 where b_k is the k-th block of the input, v_0 is some arbitrary value, f is some function that for two block-sized input computes a block-sized output and v_n is the final output.

In order to give you an idea, suppose a hypothetical hash function where f is XOR. Hypothetical, because this design is insecure. A realistic f would be a bit more complex in order to achieve the typical hash function properties.

countermode
  • 684
  • 1
  • 7
  • 22
0

Here is an the essential reference to the algorithm description from which it can be seen why a hash gets always the same length:

3.1 Step 1. Append Padding Bits The message is "padded" (extended) so that its length (in bits) is
congruent to 448, modulo 512. That is, the message is extended so
that it is just 64 bits shy of being a multiple of 512 bits long.
Padding is always performed, even if the length of the message is
already congruent to 448, modulo 512.

Padding is performed as follows: a single "1" bit is appended to the message, and then "0" bits are appended so that the length in bits of the padded message becomes congruent to 448, modulo 512. In all, at least one bit and at most 512 bits are appended.

Quoting from the algorithm RFC 1321 description: further reading of the algorithm you can find here

As a quick example:

echo 1|md5sum
b026324c6904b2a9cb4b88d6d61c81d1 

echo 1234567890|md5sum
7c12772809c1c0c3deda6103b10fdfa0
-3

Result of hash-function is a number padded to the string of a fixed length. Number has a maximum (it is different for different functions).

Hash-functions, roughly speaking, are calculated in cycle where each cycle modifies the number. If result of operation is larger than max then number overflows.

You can start from wikipedia: https://en.wikipedia.org/wiki/Hash_function

JimiDini
  • 172
  • 7
  • This is a really bad description. “Number padded to the string of a fixed length” is not an accurate description of typical hash functions. Standard cryptographic hash functions operate on an internal state that is not treated as a number but as an array of bits. There is no overflow. – Gilles 'SO- stop being evil' Aug 11 '14 at 07:44