7

According to the Wikipedia page for key derivation functions, a KDF's purpose is to derive a secret key for cryptography:

In cryptography, a key derivation function (KDF) derives one or more secret keys from a secret value such as a master key, a password, or a passphrase using a pseudorandom function.[1][2] KDFs can be used to stretch keys into longer keys or to obtain keys of a required format, such as converting a group element that is the result of a Diffie–Hellman key exchange into a symmetric key for use with AES. Keyed cryptographic hash functions are popular examples of pseudorandom functions used for key derivation.[3]

Let's assume we just completed a Curve25519 key exchange, and we want to use the key for a symmetric algorithm, e.g. AES.

  • If the raw shared secret can be used as a key for the symmetric cipher, does using a KDF provide any security benefits? (Assuming that the KDF's output can also be used in the symmetric cipher.
  • If the raw secret cannot be used as a key for the cipher, we apply a KDF to it. In this case, why does the KDF have to be slow? (Or is this property of a KDF just for specific cases, but not this one?)
  • 1
    They're slow to reduce the effectiveness of brute-force attempts. If they can only guess once per second, per CPU that they're using, it'll take them longer to guess the correct secret than if they could guess hundreds of thousands of times per second per CPU. – Ghedipunk Oct 04 '19 at 18:23
  • 1
    Does that really matter if the key is unknown? E.g. if we perform a key exchange, the adversary has a range of possible values of the ECC secret and a range of PBKDF values. If we're using a KDF dervied key _K_, then finding the preimage to the KDF-ed doesn't matter right? The only thing that matters is finding the key used. –  Oct 04 '19 at 18:26
  • The secrets are typically low entropy, such as passwords. – Ghedipunk Oct 04 '19 at 18:36
  • 1
    @mngxyuiso Yes, you could try to brute-force the derived key instead. But that key should be so large that brute-forcing is fruitless – Hagen von Eitzen Oct 05 '19 at 12:47

3 Answers3

17

The confusion here is that there are two distinct kinds of key generation function, and people often say "key derivation function" without being explicit which one they mean (or even understanding there are two):

  • Key-based key derivation functions
  • Password-based key derivation functions

A key-based derivation function like HKDF presupposes that the inputs may be biased or partially predictable, but that otherwise it has enough min-entropy to be reliably unguessable. The shared secret produced by a Diffie-Hellman exchange is one of the textbook examples.

Password-based functions, on the other hand, assume that the inputs have low entropy, and thus they're designed to impose as high a cost as reasonable to a guessing attacks (without becoming unbearably costly to the honest parties). Slowing down the computation by having a high number of iterations is the classic technique, but newer functions like scrypt and Argon2 go beyond this and aim to be memory-hard:

  • The canonical algorithm for computing them uses a large, tunable amount of memory;
  • Any algorithm for computing the function that uses less memory than the canonical one should pay a very high time penalty (adverse time-memory tradeoff).
Luis Casillas
  • 10,181
  • 2
  • 27
  • 42
11

Not all KDFs are slow! Something like HKDF is extremely fast, and only involves a handful of invocations to the underlying PRF.

KDFs are only slow when they're intended to convert a potentially low-entropy input—like a password—to a high-entropy output such as an encryption key or a password verifier. In this scenario, such functions are designed to be slow in order to add computation time as if the attacker were trying to brute force a secret with higher entropy than the one actually used.

For something like a shared secret after a Curve25519 key exchange, you would generally prefer a fast KDF. For instance, the Noise protocol framework uses HDKF to generate encryption keys from a shared secret derived from curve multiplication. While you can use a raw shared secret as a key directly, most protocols in practice use some form of a KDF to allow for features like forward secrecy.

Stephen Touset
  • 5,736
  • 1
  • 23
  • 38
  • Is converting a low-entropy input to a high-entropy output actually possible? You are referring to an output format that would allow for high entropy, or to a *potentially* high-entropy output, right? – caw Nov 03 '19 at 22:57
  • "It depends". In a very strict mathematical sense, no. In a *practical* sense with physical computers built from matter and existing in spacetime, yes. The idea is that even some input might have `n` bits of entropy, key stretching increases the amount of actual work needed to enumerate and test each guess. If the amount of work required for each guess is `2^m` times the amount of work that would have been needed without key stretching, the effective entropy is `n + m` bits. – Stephen Touset Nov 04 '19 at 19:49
3

The reason why you use a KDF, or a secure hash for that matter, on a curve25519 shared key is that the bits are not distributed randomly. You have 32 bytes of "point data" which contain roughly 126 bits of "security".

So... which bits do you choose? Take the first 126 bits and leave the remaining 2 bits of your 128-bit key zero? Or take the last 126 bits? Or just take 128 bits out of the middle? Some other strategy? How do you know you picked the right bits? How do you know there are no exploitable patterns?
Using a secure hash or KDF solves all these issues. Something-something-input gives 128-bits of pretty-much-perfectly-random output (or rather, random-looking). Or, any other amount of bits that you want. You do not waste entropy, you need not worry whether you picked the "good" bits, and you do not risk having possibly exploitable, obvious patterns that come from the ECC's calculations (which are not perfectly "random looking"). Of course, entropy will not "magically" be added if you stretch, but the point is, an external observer cannot tell where it is. The KDF or hash does not need to be slow (and most of the time should not be).

The reason why you use a slow hash or KDF on passwords or any other user input is that anything that comes from a human has embarrassingly poor entropy, and is subject to being brute-forced with the aid of a dictionary (plus obvious permutations). Modern computers can literally do hundreds of millions of simple hashes per second, so that's a problem if your password database gets stolen. The attacker may not break the complete database, but getting a few users' passwords is a matter of a fraction of a second if no deliberately slow function is used. The longer it takes a prospective attacker, the better. More work needed to break a password means you have more of a time window to react and inform users in case of a breach.

The same goes for e.g. access to your encrypted disk or your Keepass file. If an attacker can try 100-200 million passwords per second, you may as well not encrypt at all, it doesn't matter how much care you put into choosing a good password.
If an attacker can try 3-4 passwords per second because that's just how long it takes to run the KDF, your password is basically "unbreakable" because it takes forever to find a match at that rate.

Certainly, this also makes it more expensive for you to unlock your volume. However, you only do it once, an attacker must do it many times.

Damon
  • 5,001
  • 1
  • 19
  • 26