Can't explain data from side channel attack attempt

Question

I found the comparison function below (slightly modified) from a crypto library I was using. I was curious about the potential vulnerability to side channel attacks. Specifically, the character comparison is only done if the character being compared is within the bounds of the two strings. I suspected this might allow an attacker to determine string length.

Perhaps this difference is simply too small to be subject to a timing attack, but I played with an attempt below. I basically create strings of increasing lengths and compare to a given initial string. I was expecting to perhaps see linear growth in comparison time both before and after the point where the second string becomes longer, but perhaps with a different slope since the operations performed are different.

Instead, I see the data below (note the string being compared is 27 characters in length). Any explanation as to why I have no clue what I'm talking about would be greatly appreciated :)

A small note, I did try with -O0 in case some strange optimization was at fault. The only thing I can think to do from here is start digging into the generated assembly.

string comparison time vs length of string

#include <string.h>
#include <sys/time.h>
#include <stdio.h>


int CompareStrings(const char* s1, const char* s2) {
    int eq = 1;
    int s1_len = strlen(s1);
    int s2_len = strlen(s2);

    if (s1_len != s2_len) {
        eq = 0;
    }

    const int max_len = (s2_len < s1_len) ? s1_len : s2_len;

    // to prevent timing attacks, should check entire string
    // don't exit after found to be false
    int i;
    for (i = 0; i < max_len; ++i) {
      if (s1_len >= i && s2_len >= i && s1[i] != s2[i]) {
        eq = 1;
      }
    }

    return eq;
}

double time_diff(struct timeval x , struct timeval y) {
    double x_ms , y_ms , diff;

    x_ms = (double)x.tv_sec*1000000 + (double)x.tv_usec;
    y_ms = (double)y.tv_sec*1000000 + (double)y.tv_usec;

    diff = (double)y_ms - (double)x_ms;

    return diff;
}

void test_with_length(char* str1, int n) {
    char str2[n + 1];
    struct timeval tp1;
    struct timeval tp2;
    int i;

    for (i = 0; i < n; i++) {
        str2[i] = 'a';
    }
    str2[n] = '\0';

    gettimeofday(&tp1, NULL);
    for (i = 0; i < 20000000; i++) {
        CompareStrings(str1, str2);
    }
    gettimeofday(&tp2, NULL);
    printf("%d %.01f\n", n, time_diff(tp1, tp2));
}

int main() {
    char *str1 = "XXXXXXXXXXXXXXXXXXXXXXXXXXX";

    int i = 0;
    for (i = 1; i <= 100; i++) {
        test_with_length(str1, i);
    }
}

Polynomial · Accepted Answer · 2012-09-16T08:29:52.017

First, some comments on the code:

Your calls to strlen are a black box - you cannot possibly know how they react to timing attacks.
Your compare loop exits after you exhaust the shortest string, leaking the shorter length.
You did not remove bias generated by the loop in test_with_length.
Array accesses are not O(1) time complexity, especially if each element is smaller than the word width of your CPU.

A better way to compensate for such attacks:

bool CompareStr(char* strA, char* strB)
{
    bool result = true;
    int lenA, lenB = 0;

    while(strA[n] != 0)
        lenA++;
    while(strB[n] != 0)
        lenB++;

    int maxLen = (lenA > lenB) ? lenA : lenB;

    // compensate for array access to some extent
    int diff = (lenA > lenB) ? lenA - lenB : lenB - lenA;
    char dummy = '\0';
    for(int i = 0; i < diff; i++) { dummy = strA[0]; }

    // better way to avoid timing attacks
    for(int i = 0; i < maxLen; i++)
    {
        if ( ((i >= lenA) ? strA[0] : strA[i]) != ((i >= lenB) ? strB[0] : strB[i]) )
            result = false;
    }

    return false;
}

// compute loop overhead
time_pre_loop = getTime();
for(int i = 0; i < LOOP_COUNT; i++) { }
time_post_loop = getTime();
double diff = time_diff(time_post_loop, time_pre_loop);

time_pre_test = getTime();
for(int i = 0; i < LOOP_COUNT; i++)
    CompareStr(strA, strB);
time_post_test = getTime();

double result = time_diff(time_post_test, time_pre_test) - diff;

Anyway, here's where the fun part starts. Consider the following assembly code that could be used to compute whether two strings are equal, with early-out optimisation removed as a basic timing attack defense:

 xor ecx, ecx                         ; clear counter to 0
 mov dword ptr ds:[0040AAAA], 1       ; set result to true
loopStart:
 mov ah, byte ptr ds:[ecx+00401234]   ; read char from string A
 mov bh, byte ptr ds:[ecx+00402468]   ; read char from string B
 cmp ah, bh                           ; compare the chars
 jz match                             ; are they equal?
 mov dword ptr ds:[0040AAAA], 0       ; strings don't match! set result to true
match:
 cmp ah, 0                            ; are we at the end of string A?
 jz done
 cmp bh, 0                            ; are we at the end of string B?
 jz done
 inc ecx                              ; increment counter and continue loop
 jmp loopStart
done:
 ret

There are actually a few timing attacks here. The first two instructions are O(1). However, the two single-byte reads that follow are not.

Whilst memory reads are usually O(1), in practice they will not be. In fact, they don't fit into the big-O model at all. On a 32-bit system, every 4th byte read will be faster than the others. On a 64-bit system, every 4th byte will be faster, and every 8th byte will be faster still. This is due to non-aligned memory reads. The CPU is great at fetching blocks of data that are aligned to its bus size, but not so great at fetching random individual bytes. The CPU will attempt to optimise this by pipelining instructions and loading the block into its L2 cache, which makes this issue more subtle, but it's still there.

The next trick is for strings that are larger than the CPU's cache. If you have a string that exceeds this size (usually 128KB or higher for the L2 cache) you'll see slight delays in memory fetches every time a read exceeds that boundary. This is usually only a small delay, since the L3 cache will be used to store the next block, but we can see an even bigger difference for larger strings (8MB+) as the memory fetches have to be done from system memory.

If we consider a block memory copy as O(n), where n is the memory length, this makes some sense, but does not properly depict the small variances in timing caused by implementation idiosyncrasies.

Finally, we can see that the mov dword ptr ds:[0040AAAA], 0 instruction is executed whenever one string does not match another. This leaks information in two ways:

Allows us to estimate the relative length of the shorter string by identifying the extra time used when repeatedly setting the result.
If we can measure accurately, allows us to identify which characters are different.

As we can see, it's quite difficult to prevent these issues. Here's my attempt at a better system, assuming fixed-length strings padded to alignment with null bytes:

int r, d = 0;
int* bufferA = (int*)stringA;
int* bufferB = (int*)stringB;
for(int i = 0; i < BUFFER_SIZE; i += 4)
{
    d = bufferA[i] ^ bufferB[i]; // if equal, d = 0
    r |= d; // if all cases are 0, r will be 0
}
return r; // result is 0 if equal, non-zero if not equal

This is almost perfectly O(n) for the following reasons:

Bitwise xor of aligned ints is O(1)
Bitwise or of aligned ints is O(1)
Memory reads are aligned.
Neither length is leaked, due to BUFFER_LENGTH padding.
No branches, excluding the loop.

I'll check this out this answer more this evening. A couple quick things I noticed. First, the compare loop doesn't exit until going through the longer string. (The loop goes up to `max_len` which is set as the maximum of the two string lengths.) As for your suggested solution, I would have written almost the same thing, I was just curious to see what exploits were possible with the code I had. Anyway, thanks for such a detailed answer! :) — Michael Mior, Sep 13 '12 at 15:14
@MichaelMior The problem with your loop is that the `if` isn't one `if`, it's 3. Depending on the order it gets compiled to, and how it handles the `eq` variable, you might find out all sorts of juicy information from the timings. — Polynomial, Sep 13 '12 at 15:17
That was my assumption, which was why I tried this exercise. I'm not sure what you mean about `strlen` being a black box. The implementation is easily looked up. But I agree it's much easier to reason using the simple loop in your answer. As for the loop overhead, since it's constant, I assume it doesn't really matter. Removing it would certainly make the measurements more accurate, but I don't think it's necessary for correctness. — Michael Mior, Sep 14 '12 at 06:37
Thanks for your point about array access times. I'm not sure what you mean when you say it's not "linear." That seems to imply that array access is O(n), which of course isn't the case. Anyway, accepting this answer because it explains a lot :) — Michael Mior, Sep 14 '12 at 06:40
It's neither O(1) or O(n), since the time "glitches" occur on alignment and memory boundaries (L1 cache, L2 cache, etc). It's predictable, but it doesn't fit into a big-O model. — Polynomial, Sep 14 '12 at 08:07
Made a few edits to reflect what's been said in the comments. — Polynomial, Sep 16 '12 at 08:30
@Polynomial: I am sorry, but there are a number of wrong assertions in your answer. In the code in the question, the for loop does NOT exit when the shorter of the two strings is exhausted (it _could_ if the compiler's optimizer is smart, but `gcc -O0` is not smart). Byte accesses are NOT faster when at addresses multiple of 4 or 8; what is slower (or even forbidden, depending on processor architecture) is unaligned accesses for words longer than a byte. In the code given, everything will be in L1 cache after the first call, so there will be no measurable cache effect. — Tom Leek, Sep 16 '12 at 14:51
@Polynomial: also, your "attempt at a better system" has bugs. In particular, it assumes that the two strings begin at an address which is a multiple of `sizeof(int)`, which is not guaranteed. x86 and PowerPC architectures will forgive you your unaligned accesses; on a Sparc, you would get a well-deserved SIGBUS. — Tom Leek, Sep 16 '12 at 14:53
@TomLeek I do state the caveats of the last code sample. I'm not sure I agree on the byte accesses front - last time I checked the JEDEC standard it specified only multiples of the bus width for memory fetch operations. This means that for the first byte of each block _should_ be slower to access. I'm unsure as to the specifics of individual CPUs, but most (if not all) microcontrollers and volatile memory ICs that I've worked with have overheads for non-word-aligned single-byte memory reads. — Polynomial, Sep 16 '12 at 22:52
@Polynomial: that's a strange assertion. If the CPU wants byte 64783, with a 32-bit bus, it will fetch the whole word at address 64780 and then extract the fourth byte of that word. It should take the same time than accessing byte 64780 or byte 64781. What's slower is accessing an unaligned _word_ because it would translate to two fetches. (Besides, the Intel manuals and the PowerPC ISA agree with me.) — Tom Leek, Sep 17 '12 at 11:01
@Polynomial: ah, I get it. What is _faster_ is reading a word-aligned single byte on a machine which does not have an opcode for that, _as long as the compiler is aware that the access is aligned_. Example would be pre-21164 Alpha processor (about 15 years ago...). Generic byte accesses need a word read then a shift; if the byte is already aligned and this is known at compile-time, then the shift can be skipped. But this does not really apply to an indexed access in an array, because the index is a runtime value. — Tom Leek, Sep 17 '12 at 11:05

score 2 · Answer 2 · answered Sep 16 '12 at 14:46

First, there is a bug in your comparison function; the "eq = 1;" in the loop should read "eq = 0;". It should not change the timing, though, assuming that the compiler does not play games with the optimizer. To make your code more resistant to optimization games, you should put the benchmarked function (your CompareStrings()) and the benchmark code (test_with_length()) in distinct modules (separately compiled C source files, and pray for the compiler not to perform inter-module optimizations -- gcc should be safe for that).

What you see looks correct. The two strlen() calls should have a cost which rises roughly linearly with the sum of the lengths of the two strings; but strlen() plays some nifty games in order to be faster. In particular, strlen() will try to read data bytes by whole aligned chunks (4, 8, 16 bytes maybe), and this will work because the compiler knows that memory access control by the MMU works on a page granularity, so it is possible to make "safe" out-of-bounds accesses in many cases. The small strings will have specialized versions, and the lengths here (at most 100 bytes) are too small to really exhibit asymptotic behaviours.

Chances are that the cost of the two strlen() is small with regards to the for loop. It will introduce some "noise" in the graph, though, so you should not look too closely at the dots.

The for loop will run for max_len steps, which is the length of the longer of the two strings. Therefore, as a first approximation, we expect the cost of that loop to be linear in the length of the larger of the two strings, so you should see a roughly constant time when the second operand length grows from 1 to 27, then an increase -- which is what you see in the graph.

Thanks for pointing out the bug. I translated from another code based that used some boolean macros and obviously made a typo there. I agree with what you have in your last paragraph. I was just expecting the possibility of some difference due to the fact that the character comparisons are no longer performed after the length of the second string exceeds that of the first. — Michael Mior, Sep 16 '12 at 21:52

Can't explain data from side channel attack attempt

2 Answers2