From the code example provided, it is possible to determine the correct password by timing the code when given various inputs.
First, you shouldn't actually examine the password directly! At the very least, you should hash the password with a password hash like Argon2id first, and compare the password hash of the input with the password hash you stored during user registration (or when the user last changed their password).
Even better, you should use a password-authenticated key agreement protocol like OPAQUE, but these may be beyond your pay grade at the moment until they see more widespread adoption and implementation.
What can I do to ensure that my code is not vulnerable to such timing attacks?
The best way to start is to use a library routine or primitive that someone else has already written and has a reason to maintain. For example, in NaCl/libsodium, you can use crypto_verify_32
to compare two 32-byte strings, like two Argon2id hashes, or two HMAC-SHA256 message authentication codes. Then the effort to answer this question can be focused on a single place that will get a lot of attention and scrutiny and will keep up with advances.
But let's say you don't have crypto_verify_32
, or you want to implement it yourself. What can you do?
To start, you need to understand what operations have side channels. It's tempting to say—as other answers did—that the side channel arises only because of an early
abort. But that's not the whole story. In general, there are many operations (here written in C for illustration) that may take an amount of time which depends on the values of the inputs—we call these operations variable-time operations, in contrast to constant-time*:
for (i = 0; i < n; i++) if (x[i] == y[i]) return EFAIL;
obviously takes fewer loop iterations so it is practically guaranteed to run in variable time depending on the secret values of x[i]
and y[i]
.
A mere secret-dependent conditional for (i = 0; i < n; i++) if (x[i]) bad++;
, if x[i]
is secret, may run in variable time too even if the loop doesn't abort early. Why?
Here's a coarse approximation. The machine instructions that the CPU may execute look something like this:
0: tmp := x[i]
branch to 1 if tmp is zero
bad := bad + 1
1: i := i + 1
branch to 0 if i < n
The number of instructions executed depends on what the value of x[i]
is at each iteration: we skip bad := bad + 1
on some iterations. This is a good model for how early timing attacks on, e.g., RSA worked as in Kocher's seminal paper on timing attacks: the main modular exponentiation loop computes a (say) 2048-bit modular squaring unconditionally, but computes a 2048-bit modular multiplication conditionally depending on the value of the secret exponent. Skipping the multiplication substantially changes the time taken by the whole operation.
There's another reason, though, and it has to do with branch prediction, a key design element in what makes modern CPUs run so fast on many workloads—even if you write the same amount of code (say, same number of machine instructions, and somehow you guarantee that they take the same number of cycles to compute) in each branch of a conditional, the time it takes to execute may depend on which way the condition went.
In general, CPUs are bad at keeping which instructions got executed secret, so don't make the choice of instructions depend on secrets.
Table/array lookups might take a different amount of time depending on what memory has been cached in the CPU cache. Consequently, if the location in the array that you're reading from depends on a secret, the time it takes might depend on the secret, which has been exploited to recover AES keys by cache timing.
(This makes AES a rather questionable design in retrospect, with its intentional use of key-dependent table lookups! NIST's published rationale (§3.6.2, Attacks on Implementations: The Role of Operations) curiously claims table lookups are ‘not vulnerable to timing attacks’ despite the numerous such attacks that have been reported since then.)
Variable-distance shifting like x = y << z
may take more time on some CPUs if z
is larger and less time if it is smaller.
(This makes RC5 and the AES finalist RC6 rather questionable design in retrospect, with their intentional use of key-dependent rotation distances!)
On some CPUs, multiplication may run faster or slower depending on whether the upper half of the inputs is zero or not.
64-bit integer addition on 32-bit CPUs in principle might take more time depending on whether there is a carry. This is because, when x
, y
, and z
, are 64-bit integers, the logic x = y + z
might look something more like:
int carry = 0;
x[0] = y[0] + z[0];
if (the previous addition overflowed)
carry = 1;
x[1] = y[1] + z[1] + carry;
Consequently, the time it takes may depend on whether there is a carry from the sum of the low 32-bit halves to the sum of the high 32-bit halves. (In practice, this is usually only a concern on exotic CPUs or for other types of side channels like power analysis that are relevant more to smart cards than to laptops and phones.)
This might sound a little overwhelming. What can we do?
There are some operations that generally do run in constant time on most CPUs. They are:
- Bitwise operations:
x & y
, x | y
, x ^ y
, ~x
, and other ones that don't appear in C like AND-with-complement.
- Constant-distance shifts and rotations like the shift
x << 3
or the rotationx <<< 3
(not standard C but common in cryptography; it means (x << 3) | (x >> (32 - 3))
, if x
is 32-bit).
- Often integer addition and subtraction:
x + y
, x - y
, when x
and y
are (say) unsigned 32-bit integers on a 32-bit CPU, and often even 64-bit integers on a 32-bit CPU with the help of ADD-with-carry instructions.
- Sometimes integer multiplication, but the story about multiplication is complicated, which is unfortunate for cryptography because multiplication mixes bits around quite nicely and has useful algebraic properties.
To be clear: I do not mean that a C compiler guarantees these operations run in constant-time if you use them in a C program; I am merely using C notation for the operations that CPUs generally execute in constant time. (More on this caveat in a moment.)
‘But wait,’ you protest, ‘how can I possibly write a useful program out of these operations? No conditionals? No loops? No arrays?’
First, you don't have to eschew conditionals, loops, or arrays altogether. They just can't depend on secrets. For example, for (i = 0; i < 32; i++) ... x[i] ...
is fine. But for (i = 0; i < m[0]; i++) ...
is not fine if m[0]
is supposed to be secret, and for (i = 0; i < m[0]; i++) ... tab[x[i]] ...
is not fine if x[i]
is supposed to be secret.
Second, you can still build these things! It's just a little trickier. For example, suppose b
is a uint32_t that is either 0 or 1. Then b - 1
is either -1 = 0xffffffff or 0, respectively, so
x = ((b - 1) & z) | (~(b - 1) & y);
causes x = y
if b
is 1, or x = z
if b
is 0—much like x = (b ? y : z)
, but without a branch. Obviously, this requires computing both y
and z
, so there is some performance impact! Similarly, you can look up an element of a table by looking up all elements of the table and selecting the one you want with bitwise operations like this. Not as fast as x[i]
, but not as leaky either.
In general, you can convert a program with conditionals into a logic circuit with no conditionals, even if you don't want to for performance reasons. There are various other similar tricks you can do. You might draft a constant-time memory equality routine such as crypto_verify_32
like this, assuming x and y are uint8_t arrays:
uint32_t result = 0;
for (i = 0; i < 32; i++)
result |= x[i] ^ y[i];
return ((result - 1) >> 8) & 1;
Exercise: Does this return 0 for equal and 1 for inequal, or 0 for inequal and 1 for equal?
Writing programs like this—and adopting cryptosystems such as X25519 that encourage implementations that look like this, instead of cryptosystems such as RSA or AES that encourage implementations involving involve secret-dependent branches or secret-dependent table lookups—is a good start for plugging timing side channels.
But, there's a catch! Remember when I said that the C compiler doesn't guarantee constant time? A smart C compiler like Clang/LLVM might recognize that the clever crypto_verify_32
loop above can be executed more efficiently by making it abort early, and might undo the hard work you did to rewrite it as a logic circuit that runs in constant time. (In other circumstances, it might help you out, for example by converting x = (b ? y : z);
into a conditional move instruction, CMOV, without branches, but usually you can't rely on the C compiler's good will.)
There are some tricks you can do to thwart this, like an inline assembly fragment that causes the compiler to drop roughly all assumptions for optimization:
uint32_t result = 0;
for (i = 0; i < 32; i++)
result |= x[i] ^ y[i];
asm volatile ("" ::: "memory");
return ((result - 1) >> 8) & 1;
This may or may not work with your compiler. To be confident, you really have to examine the compiler's generated machine code—and even then, a compiler might perform just-in-time optimizations that rewrite the machine code according to profiling analysis, especially in higher-level languages like Java. So you might really want to write the logic in assembly (or in a programming language like qhasm that can generate the fine-tuned assembly more reliably than a C compiler), and just call it from C.
Maybe some day C compilers will adopt a secret
qualifier, like const
or volatile
, which forces the compiler to generate only machine instructions that are known—in some model of the CPU!—to run in constant time when operating on the object, and prevents the compiler from taking shortcuts like secret-dependent early aborts from a loop. But that day is not here yet.
There's also a matter of which machine instructions actually run in constant time on a CPU, which is sometimes documented and sometimes reliable. So in addition to doing engineering to build your programs out of logic circuits, you also need to do science to figure out which operations are actually safe to use on the CPU.
This brings us back to the original point: You really want to focus the effort of maintaining this into a library routine, so that each programmer doesn't have to keep track of vagaries of compilers (and CPU designs!) in generated code and timing on their own, and can instead leave it to our friendly neighborhood bear.
Are there other countermeasures than constant-time logic? Sometimes, yes.
You could inject random noise into your logic, in the hope that it will confuse the attacker's measurements. But there's already noise in their measurements, like scheduling in the operating system, so they just have to take more samples—and it turns out noise is not a very effective side channel countermeasure.
Specifically, artificial noise raises the attacker's costs by at most about the square of the ratio of artificial noise to true noise, which is far below what is usually considered an acceptable gap for security in cryptography. So it mostly costs you a lot of time doing nothing.
You could use algebraic properties of the cryptosystem to randomize it, sometimes called ‘blinding’. For example, instead of computing y^d mod n
where d
is a secret exponent in RSA, you could pick r
at random, compute s := r^e mod n
where e*d ≡ 1 (mod (n))
, multiply y
by s
to get (y * r^e) mod n
, compute (y * r^e)^d mod n = (r * y^d) mod n
, and then divide off r
.
Many implementations, such as OpenSSL, use this approach because it's an easy way to retrofit an existing implementation of a cryptosystem like RSA that has the necessary algebraic structure. It's not a bad idea like random noise is, but it does have costs: you have to do the extra work for randomization, you have to have modular division or inversion logic—and side channels might still leak information about r
and d
. For example, even blinded modular exponentiation will leak the Hamming weight of d
unless you take additional countermeasures like adding a random multiple of (n)
to d
first—which may expose additional side channels, etc.
For the specific case of comparing two byte strings for equality (for example, two message authentication codes), one reasonable option is to hash them with a pseudorandom function family like HMAC-SHA256 under a one-time secret key k
, and check whether HMAC-SHA256_k(x) == HMAC-SHA256_k(y)
.
The probability of a false positive is 1/2256, which is a smaller probability than you ever have to worry about. You can safely use variable-time equality for the HMAC because if x
is not equal to y
, then the amount of time even in the naivest byte string equality routine (assuming it doesn't bail out at the first zero byte or something stupid like that!) will be independent of the values of x
and y
: there's a 255/256 probability it'll stop after one iteration, a 65535/65536 probability after two iterations, etc.
Of course, this really helps only if you can implement HMAC-SHA256 in constant time! Fortunately SHA-256 is designed to be easily implemented as a constant-time logic circuit, so C implementations tend to be reasonably resistant to side channels—but, say, Python will get you into trouble because of the small integer cache if nothing else.
* The terminology is unfortunately a little confusing. Here constant-time means that the amount of time does not vary depending on inputs, and is not the same as the asymptotic notion of ‘constant-time’ in computer science, often written O(1), which just means the amount of time may vary depending on inputs but is bounded by a constant. I'm sorry. I didn't invent the terminology. I might have chosen ‘fixed-time’ vs. ‘variable-time’ but it's too late now—‘constant-time’ is entrenched in the literature.