27

What are similarities and differences between a "checksum" algorithm and a "hash" function?

Can they be used instead of each other? Or their usage are different?

For example, for verifying the integrity of a text, which one is better to be used?

And if they are different, what are specific algorithms for each one? Meaning that which algorithm is appropriate for "checksum" and which one for "hash" function?

Questioner
  • 1,277
  • 2
  • 10
  • 14
  • If you define hash function as something along "maps some data to other data", every checksum function is a hash function. Wikipedia lists checksum algorithms as [subcategory of hash functions](https://en.wikipedia.org/wiki/List_of_hash_functions). – bluenote10 Apr 02 '21 at 14:22
  • Cross-posted: https://security.stackexchange.com/q/194600/971, https://bitcoin.stackexchange.com/q/79511/14915, https://ethereum.stackexchange.com/q/59275/19041. Please [do not post the same question on multiple sites](https://meta.stackexchange.com/q/64068). – D.W. Aug 04 '22 at 02:58

4 Answers4

32

A checksum (such as CRC32) is to prevent accidental changes. If one byte changes, the checksum changes. The checksum is not safe to protect against malicious changes: it is pretty easy to create a file with a particular checksum.

A hash function maps some data to other data. It is often used to speed up comparisons or create a hash table. Not all hash functions are secure and the hash does not necessarily changes when the data changes.

A cryptographic hash function (such as SHA1) is a checksum that is secure against malicious changes. It is pretty hard to create a file with a specific cryptographic hash.

To make things more complicated, cryptographic hash functions are sometimes simply referred to as hash functions.

Sjoerd
  • 28,707
  • 12
  • 74
  • 102
  • 1
    A cryptographic hash function does not protect against malicious changes; it can be recalculated and changed just as easily as a checksum. – Beurtschipper Sep 27 '18 at 14:54
  • re: "_the hash does not necessarily changes when the data changes_": example please – dandavis Sep 27 '18 at 20:33
  • 4
    @Thierry: some _different_ changes to the original (something like byte #101 and byte #99872) will produce the same 'wrong' CRC, but those files _must_ differ in _two_ bytes. Every _single_ change of less than 32 bits will produce a different CRC32, by design. – dave_thompson_085 Sep 28 '18 at 02:16
  • 1
    @dave_thompson_085 That's correct, but only for a CRC, not for just any checksum (e.g. Fletcher's). – forest Sep 28 '18 at 12:43
  • @dandavis This seems to refer to a hash function that is not secure. – Tom K. Sep 28 '18 at 12:48
  • Interestingly, CRC is not a checksum by classical definition (like Fletcher is). CRC32 is widely used as a hash function. But the mathematical properties of CRC, as well as its performance, has made it popular for many protocols and embedded systems. Theoretically, any fast hash function can serve as a checksum. In many cases an additive/XOR checksum will perform worse, but would be much more efficient on highly constrained systems. – bryc Jul 09 '19 at 20:57
  • 1
    @Thierry Your statement assumes that all these 6 billion variations map to _unique different_ values. The statement "if one byte changes, the checksum changes" is still satisfied though if some of the 6 billion variations map to the _same different_ checksum. – bluenote10 Apr 02 '21 at 14:28
5

What are similarities and differences between a "checksum" algorithm and a "hash" function?

A checksum is used to determine if something is the same.

If you have download a file, you can never be sure if it got corrupted on the way to your machine. You can use cksum to calculate a checksum (based on CRC-32) of the copy you now have and can then compare it to the checksum the file should have. This is how you check for file integrity.

A hash function is used to map data to other data of fixed size. A perfect hash function is injective, so there are no collisions. Every input has one fixed output.

A cryptographic hash function is used for verification. With a cryptographic hash function you should to not be able to compute the original input.

A very common use case is password hashing. This allows the verification of a password without having to save the password itself. A service provider only saves a hash of a password and is not able to compute the original password. If the database of password hashes gets compromised, an attacker should not be able to compute these passwords as well. This is not the case, because there are strong and weak algorithms for password hashing. You can find more on that on this very site.

TL;DR:

Checksums are used to compare two pieces of information to check if two parties have exactly the same thing.

Hashes are used (in cryptography) to verify something, but this time, deliberately only one party has access to the data that has to be verified, while the other party only has access to the hash.

schroeder
  • 123,438
  • 55
  • 284
  • 319
Tom K.
  • 7,913
  • 3
  • 30
  • 53
  • Tom, besides the fact that your TL;DR; is pretty narrow, IMHO I think that the paragraph about the perfect hash function might be misleading when you say _Every input has one fixed output_. What do you think about a function like F(x) = 1? Does it have one fixed output for every input? – user2553863 Apr 08 '21 at 13:47
3

They are basically the same thing, but checksums tend to be smaller (a few bytes).

Integrity

Both hash functions and checksums are used to verify the integrity of data. Cryptographic hash functions are hash functions for which a collision is unknown. This is why cryptographic hash functions are used to construct things like a MAC (see below).

Information loss

Another property of hash functions and checksums is that information gets lost during computation. This must be true if you convert some data to a checksum/hash with less bits. This is also why you can't go back to the original data with just a checksum or a hash.

HMAC

What I think you are looking for is a MAC (Message Authentication Code). Such a code is used to detect the tampering of data. Most of the time it's just a combination of a hash function and some secret value, like a password. See also:

https://en.wikipedia.org/wiki/Message_authentication_code

Passwords

Passwords are sometimes stored as a hash. To verify the password, a hash is calculated of the password you enter, and it is compared to the stored password hash. Checksums are not used for such things because they are generally shorter and more prone to collisions, meaning that you can try random passwords and have a chance that your input has the same checksum as the original password.

But note that using normal (digest) hash functions is not the right way to store passwords. Because they are created for quickly digesting data, attackers can crack those hashes at high speeds. Programmers should use a hash function designed for storing passwords, like bcrypt or Argon2.

Edit: examples of algorithms

To answer your final question about specific algorithms: Please have a look at the Wikipedia page that lists hash functions. Like I mentioned above, they are basically the same. On Wikipedia, checksums are listed as a subset of hash functions.

https://en.wikipedia.org/wiki/List_of_hash_functions

Beurtschipper
  • 693
  • 4
  • 10
  • 1
    To quote from [Wikipedia](https://en.wikipedia.org/wiki/Hash_function): "Hash functions are related to (and often **confused with**) checksums, check digits, fingerprints, lossy compression, randomization functions, error-correcting codes, and ciphers. Although the concepts overlap to some extent, each one has its own uses and requirements and is designed and optimized differently." – Tom K. Sep 28 '18 at 13:37
1

Let me add two practical examples to the above.

Checksums: these are designed to detect accidental changes in data. A great example of this is the checksums used in TCP/IP. They are simple and very fast (each packet is checked). It is relatively simple to maliciously craft a message which's checksum equals to that of another.

Hashes: these are one-way functions used to map data to a fixed length. They are often used to compare large sets of data without the need to send it over the network. A great example of this is BitTorrent. Once your download finishes, it compares your version of the data to the original by computing the hashes and matching those.

What are similarities and differences between a "checksum" algorithm and a "hash" function?

Similar, they are both deterministic and can map a variable size data to a fixed size. They can both be used to check for sameness, although with different guarantees. See above for differences.

Can they be used instead of each other? Or their usage are different?

They shouldn't be. See above.

For example, for verifying the integrity of a text, which one is better to be used?

It depends on what you would like to do. The industry uses both to check for "integrity" (see examples above).

  • For random error detection, go with checksums
  • To compare large data sets, use hashes
  • For integrity, use HMACs with a cryptographic hash function, like SHA256

For a gentle introduction to data integrity, read this article.

Daniel Szpisjak
  • 1,825
  • 10
  • 19