How does file hashing works?

Question

When you run sha256sum filename OR md5sum filename, does it generate hash based on the file size or the whole contents of a file?

Is it different from password hash? Given a string, the program uses its algorithm to create a hash and similar way is decrypting it?

You've got two different questions here, and the second one is based on a misconception. By definition, a one-way hash (such as sha256 and md5) can't be "decrypted" or reversed - all you can do is find a value which hashes to the expected value. — Matthew, Aug 30 '17 at 15:34
I think this question is answered in multiple places: [How to securely hash passwords](https://security.stackexchange.com/a/1164/52676), [How can I check the integrity of downloaded files](https://security.stackexchange.com/a/43337/52676), [How is a file checksum calculated](http://www.geeksengine.com/article/checksum.html) — RoraΖ, Aug 30 '17 at 16:08
Both SHA-256 and MD5 use the contents of the input data as well as the length of the data to calculate the final hash — Richie Frame, Sep 01 '17 at 00:58

Polynomial · Accepted Answer · 2017-08-31T12:44:58.127

A hash is a one-way digest function. It takes a number of input bytes and computes a fixed-length value from it. If you compute the same hash again, you get the same result. Generally the numeric value of the length of the input is not considered, as the data is inherently changed if you change the length.

Hashes cannot be decrypted. They are lossy compression functions by nature and in most cases you cannot recover the full original data from them (this may not be the case for non-cryptographic indexing hash functions and small input values).

There are three main types of hash function: those used for indexing, integrity, and cryptography.

An indexing hash (e.g. MurmurHash or Java's hashCode) can be used to divide equal or very similar values into groups for improved performance in collection operations. For example, by precomputing a hash of each object and keeping the internal array for a collection sorted by the objects' hashes, lookups can be performed in worst case O(log n) time rather than O(n) using a binary search. The general requirements for indexing hashes are as follows:

Extremely high performance
Generally specialised to certain input data types (e.g. ASCII strings or pointers)
In some cases, designed to produce output values which represent the magnitude of the input value. For example, in string searching, you might choose to convert the first four characters to their ASCII integer values and interpret those together as a 32-bit integer, since sorting that integer also pre-sorts the list alphabetically.

An integrity hash (e.g. CRC32), sometimes called a "redundancy hash", has the properties necessary to provide a fairly good indication of accidental corruption of a file. These hashes will generally produce a different hash when a bit is changed, but will not withstand someone purposefully trying to generate a collision. For example, the strings "Maria has nine red beds." and "Steven has fifteen white tables." are different but generate the same CRC32 hash of 0ECB65F5. The general requirements for integrity hashes are as follows:

Have a high statistical likelihood of producing a different value when an input bit is changed
Works with any type of input data (universal)
Very high performance

These two hash function types generally make little effort to prevent someone from learning about the input data or finding collisions. This is why they are considered non-cryptographic.

A cryptographic hash such as MD5, SHA1, SHA256, or Keccak, has many more requirements. These hashes are designed to be highly resistant against attempts to discover any information about the original input data, or collisions in the hash function. The general requirements for a cryptographic hash function are as follows:

It should be very difficult to determine any information about the input value from the output value, even if an attacker can select parts of the input.
Each output bit changes with a probability of roughly 50% when any one bit in the input changes.
Resistance to first order preimage attacks, i.e. if someone gives you a hash value y, it is very difficult to find some value x such that h(x) = y. Or, put more simply, it is computationally infeasible to find the input value that produced a particular output value.
Resistance to second order preimage attacks, i.e. for any given input a it is very difficult to find another value b such that h(a) = h(b). Or, put more simply, it is hard to find a collision for a given plaintext.
General collision resistance, i.e. it is hard to find any two values a and b such that h(a) = h(b) even if the attacker selects both a and b.
Works with any type of input data (universal)
High performance

As you noted, it turns out that these functions are quite useful for obscuring passwords and turning passwords into cryptographic keys, although their main purpose is not for password storage or key derivation at all. In fact, it turns out that using a plain hash function for either of these purposes can be quite insecure. You can read about the arms race between attackers and defenders of passwords here, and more about good practice here.

When a hash function or password storage function is used, the verification is not done by decrypting anything. Instead, the password given by a user is hashed in the same way and the resulting hash is compared against the one in the database. If the password is correct then the hash will be the same, otherwise the hashes will not be equal and the server can reject the login attempt. This is generally flawed with basic hashes like SHA256, though, so I suggest reading the two links I posted above to better understand why and what good practice options you have.

For a cryptographic hash, _configurable_ performance is more useful than high performance, so that you can tune the algorithm to give you acceptable performance when users log in, but make it very resource-intensive to try a brute-force attack on it. Or you could add "password hash" as a fourth kind of hash with different requirements again. — Mike Scott, Aug 31 '17 at 07:07
@MikeScott That's the case for KDFs and password storage, but not for cryptographic hashes. KDFs and password storage functions may *use* cryptographic hashes, but their design goals are not the same. — Polynomial, Aug 31 '17 at 09:52
You have bullet points for "Resistance to first order preimage attacks" and "Resistance to second order preimage attacks". But a 3rd bullet point should be added: "Collision resistance, it is very difficult to find two values a and b such that h(a) = h(b). Then move your "Or, put more simply"-part or your 2nd bullet point to the 3rd bullet point. 2nd preimage attach should state: "Or, put more simply, given an input value (and thus its hash) it is hard to find a second input that produces the same hash" — mgr326639, Aug 31 '17 at 12:40
@mgr326639 I don't quite agree with the way you've written that, but I'll add an additional point regarding collisions for arbitrary *a* and *b* values. — Polynomial, Aug 31 '17 at 12:43
"lookups can be performed in worst case O(log n)" - Wouldn't this still be O(n) in the case that all objects have the same hash? A little nitpicky, but worst case means worst case. — AndrolGenhald, Aug 31 '17 at 14:14
@AndrolGenhald Sorry, yes, it's worst case *O(log n)* to find the correct hash in the list, after that it's *O(n)* worst case to search that bucket, although usually it's *O(1)*. — Polynomial, Aug 31 '17 at 14:46

score 1 · Answer 2 · answered Aug 30 '17 at 15:34

1

Technically, you could probably hash the filesize, but by default they hash the contents. You can test that easily - make two files of the same size (e.g. a single character) and you'll find they have different hashes.

answered Aug 30 '17 at 15:34

Matthew

27,233
7
87
101

2

Hashing the filesize wouldn't do you very much good. One point of hashes is to verify file integrity. If even a single bit of a file changes, then the hash will change. That wouldn't happen if you only hashed the filesize. If that happened, then people could maliciously corrupt files, and so long as the corrupted file had the same size, you wouldn't be able to verify the file integrity via its hashing. – Conor Mancone Aug 31 '17 at 15:08
1

'but by default...' hashing the filesize is theoretically and practically just nonesense. – trinity420 Apr 08 '19 at 17:34

How does file hashing works?

2 Answers2