What is the benefit of having a cryptographically secure hash algorithm in hashmaps?

Question

I recently read the Rust language documentation and saw this:

By default, HashMap uses a cryptographically secure hashing function that can provide resistance to Denial of Service (DoS) attacks. This is not the fastest hashing algorithm available, but the trade-off for better security that comes with the drop in performance is worth it.

As someone with no background in systems languages, I've never heard of in memory attacks based on the wrong hashing algorithm. So I got some questions:

How does the secure hashing algorithm prevent a DoS or any other attacks?

When should I opt for a more secure hash over a faster one?

You can find more information about that kind of attack on the [@hashDoS twitter account](https://www.twitter.com/hashdos) - note that the DoS is not about memory, but CPU. — Jean Hominal, Oct 05 '18 at 16:20
Related, Rust could use Bernstein and Aumasson's `SipHash` in the 128-bit flavor for *some* data. For some data it will maintain the security properties and provide the speedup. Also see [SipHash in the kernel](https://lwn.net/Articles/711167/) on LWN and [Issue 11783, Implement a fast HashMap / hash function](https://github.com/rust-lang/rust/issues/11783) in the Rust Bug Tracker. — , Oct 08 '18 at 02:28
Perhaps worth reading: https://medium.com/booking-com-development/hardening-perls-hash-function-d642601f4e54 https://lwn.net/Articles/474912/ — leonbloy, Oct 08 '18 at 12:50

score 48 · Accepted Answer · answered Oct 05 '18 at 12:58

48

Sometimes applications use untrusted data as the key in a hash map. A simple implementation can allow the untrusted data to cause a denial of service attack.

Hash maps are fast - O(1) - in the best case, but slow - O(n) - in the worst case. This is because keys are normally in separate buckets, but some values can result in the same hash - a collision - which are handled by a slower linked list. With random data, collisions will be uncommon. However, some implementation have a vulnerability where malicious data can cause many collisions, which makes the hash map slow. Some years ago there was a Linux kernel DoS due to this.

The root cause of the Linux vulnerability was that hashing was predictable. It was fixed by introducing a key into the hash function that a remote user would not know. I don't know exactly how Rust hash maps work, but I expect they use a similar kind of keyed hash.

You should opt for a more secure hash any time you're using untrusted data as a key.

answered Oct 05 '18 at 12:58

paj28

32,736
8
92
130

3

It's worth mentioning that some hashmap implementations put binary trees instead of linked lists in the buckets, which ensures that the performance doesn't degrade *too* much even if there are lots of collisions. Then what hash you use matters less. – Ben Millwood Oct 05 '18 at 14:11
14

@BenMillwood: The downside of that approach is that you need not just a hashing function for your elements, but also an ordering function. – MSalters Oct 05 '18 at 14:27
3

Would it make sense for an implementation to start by using a fast and simple hash function, but buckets which exceed e.g. 25 elements using a hash table based on a better (but slower) hash function? If 90% of the items in your table end up well distributed through N-1 buckets, but the other 10% end up in the remaining bucket, such an approach would offer better retreival performance for 90% of the buckets than would using the better hash function, but offer the advantages of the slower hash function when processing the problematic 10%. – supercat Oct 05 '18 at 19:13
@supercat Maybe I'm missing something, but I don't see how that could work. If you change the hash function for a given key when you *add* the item, how do you remember that you changed the function for that particular key, when you *read* the item? (You can't just look at the number of items in a bucket again when you do the read, because that will change as you add and delete items.) – alephzero Oct 06 '18 at 11:14
3

Details: the Rust HashMap implementation is generic over a "HashBuilder" argument, which defaults to a builder that is (1) seeded randomly and (2) uses SipHash (2-4?) to prevent the ability from an attacker to build collisions off-line. – Matthieu M. Oct 06 '18 at 17:42
1

@alephzero: Compute the cheap hash for the key, and find the appropriate bucket. If the bucket is an "ordinary" bucket, check each item in the bucket against the key. If the bucket is a reference to a second hash table, then compute the expensive hash for the key and use it to find the item in that table. – supercat Oct 06 '18 at 22:52
1

@supercat That seems to be a lot more work than just using a hashing algorithm that can resist such attacks in the first place, especially since such attacks still do quite a bit of harm with that strategy. – David Schwartz Oct 07 '18 at 23:48
1

@supercat: Interestingly, the Rust community has been looking into adaptative hashing strategy => start with a fast hash like xxHash and if there are too many collisions switch to a more collision-resistant hash (like SipHash) on a per hash-map basis. The main advantage over your strategy is that once the switch occur, the cheap hash + linear search phase is eliminated entirely. – Matthieu M. Oct 08 '18 at 06:58
1

@DavidSchwartz: If a good hash computation takes twice as long as a quick one (hardly implausible), and capriciously-chosen keys represent 10% or less of the total (likewise), then 100 single-cost hashes and 10 double-cost hashes would represent a total cost of 120, compared with a cost of 200 for using the slower hash for everything. Whether the 1.6:1 performance difference would be worth the extra effort required to use the two-stage approach would be a judgment call, but on the flip side, the two-stage storage class would avoid a the speed penalty in cases where there was no attack... – supercat Oct 08 '18 at 14:49
...and thus no need to worry about whether defending against the attack is necessary. The value of such approaches will probably depend a lot upon whether objects cache their hash values. If e.g. one is using strings that don't cache hash values, a preliminary hash that ignores all but the first and last 8 characters might offer significant performance advantages in cases where such strings yield largely-unique hashes. On the other hand, if each string can cache a single kind of hash value, computing a good hash once might not be too expensive. – supercat Oct 08 '18 at 14:56
@supercat You're ignoring the possibility that the attacker also has control over the data queried for. But in many cases that won't change the logic very much and I do think this approach might make sense in many cases. xxHash, if salted/tagged is actually pretty resistant to algorithmic complexity attacks. – David Schwartz Oct 09 '18 at 02:18

Future Security · Answer 2 · 2018-10-07T18:28:37.297

Insert, search, and remove operations on hash tables have O(n) worst case behavior. If an attacker can choose keys to be inserted into a hash table and they can compute the hash function themselves then that creates an opportunity for denial of service. All they need to do is choose keys that map to the same bucket.

The quote suggests that the use of a cryptographic hash algorithm (SHA, MD5, Blake, Skein, etc.) solves the problem. That interpretation is totally incorrect. The algorithm that Rust's HashMap uses is called SipHash. It is a hash algorithm. And it is a cryptographic algorithm. But it is not a cryptographic-hash-function. The correct term for SipHash in the world of cryptography is PRF.

The key difference is that (in cryptography) all the details of a hash function may be public knowledge. A PRF, on the other hand, requires a secret key. Without the secret information there is no way to anticipate, for any input, what the output will be. (All the other details are public.)

Something like SHA-2 will not prevent denial of service. It will be totally unbiased for non-adversarial inputs. (Because cryptographic hash functions can be modelled as random oracles.) However anyone can evaluate plain SHA-2, so someone can find hash table collisions by brute force.

The fact that a cryptographic hash function is collision resistant (with at least 256-bit output) doesn't translate to a lack of collisions in the case of hash tables. Ultimately your hash function, for a table with n buckets, will be reduced to one of n possible values. By trial and error you can find an input that maps to a specific bucket about once every n tries. No hash tables use enough buckets to make this impossible.

Using an un-keyed hash function is inherently vulnerable to denial of service, no matter how good the hash function is. The fact that the attacker and server-with-a-hash-map both query the same oracle enables a DOSer to use specifically chosen inputs to tie up your CPU.

PRFs like SipHash don't have this vulnerability if used correctly. The server uses an oracle/function chosen from a pool of 2¹²⁸ possible functions. To exploit a PRF based (hash-table-)hash-function the attacker either needs to guess which of the 2¹²⁸ functions he should be using (a "key recovery") or to find a bias in the PRF independent of the key (a way to distinguish the PRF from a random oracle).

Finally, there exists more confusing nuances involving hash algorithms. But summarized simply:

Cryptographic hash functions are a subset of all ordinary hash functions
Under the classical definition of cryptographic hash function the randomness isn't required. However randomness is a feature of all big-name cryptographic hash functions anyway.
Not all PRFs are cryptographic hash functions
Not all cryptographic hash functions are PRFs
An algorithm can have the properties of a PRF and a cryptographic hash function.
- Blake2, Skein, and KMAC have both sets of properties
- The SHA-2 and SHA-3 families are examples of (unkeyed) cryptographic hash functions
- SipHash is only a PRF (and an ordinary, but not cryptographic, hash function)
A PRF can be constructed using typical cryptographic hash functions, but the hash function itself is not necessarily a PRF.
"Randomized hashing" and "universal hashing" are similar to PRFs in some ways, but they don't have the same security requirements.
- Collisions can be created even if MurmurHash3's seed is randomized

You got my upvote, even though you totally neglected Java's HashMap that uses a balanced tree instead of a linked list: http://openjdk.java.net/jeps/180 ...in fact, my opinion is that it's better to use as fast hash function as possible (i.e. MurmurHash3) and handle collisions using a balanced tree than it is to use a slow hash function (i.e. SipHash) and handle collisions using a linked list. — juhist, Oct 06 '18 at 12:26

score 18 · Answer 3 · answered Oct 05 '18 at 12:59

I agree that's a little vague and it's going to heavily depend on how the hashmaps are used.

Here's my guess: say you're taking some input from users, say [Firstname.Lastname] and using that as the lookup value in your hash table. Let's say you're building your hash table using the simple hash function that takes initials so that [Firstname.Lastname] --> FL then it would be easy for an attacker to submit loads of values that all hash to the same thing. That would essentially turn your hashtable into a list which negates all the performance gains of using a hashtable. Slow lookups = denial of service.

AA -> [ ]
AB -> [ ]
...
FK -> [ ]
FL -> [First.Last, F1.F2, F1.F2, Fanotheu.Lonteuh, ...]
FM -> [ ]
...
ZZ -> [ ]

Cryptographic hash functions are specifically designed to prevent this because it's very hard to construct two different inputs that have the same hash value (called collisions).

When should I opt for a more secure hash over a faster one?

The answer is simple: opt for a cryptographic hash any time the lookup value is supplied by users and could be maliciously crafted. If the lookup values are coming from some internal source that you trust to be non-malicious and evenly distributed, then you can use a faster hash.

I don't know who to accept for this very reason. Thank you to you both! The answers are both helpful. I go for paj28's answer, for the explanation with O(1) vs O(n) and the explicit statement that it becomes a *linked* list. In the end, I had to decide... — Greaka, Oct 05 '18 at 13:12
@enkryptor oh, oops. Not worth bumping it to the home page for a minor edit like that though. — Mike Ounsworth, Oct 05 '18 at 22:30

What is the benefit of having a cryptographically secure hash algorithm in hashmaps?

3 Answers3