Why are CPUs designed in a way so the "meltdown" exploit works?

Question

I'm trying to wrap my head around "meltdown", but to first understand it, I've been trying to understand memory accesses.

From what I understand, the CPU attempts to look up the virtual address in the translation lookaside buffer, which indicates what data is in the CPU cache. If it's there we immediately fetch it. If not, we then look up the address in the page table.

Now from my reading, every process has it's own page table. However, every process also has the kernel mapped into it's page table.

Presumably the page table also has access bits, obviously we can't allow reads directly into kernel space when the CPU is in user mode (I think this is called "ring-3").

From what I understand of page tables, these access bits are stored in the lower bits of the address. As our page entries are 4k, there's plenty of bits left over to store access bits.

From what I've read about the exploit, the issue is that the check for access is done after the data is retrieved. The reason for this is for efficiency reasons, we want to quickly get the data to the CPU and we can just catch the permission error before we do any permanent changes. But unfortunately we've affected the CPU cache by doing an indirect memory fetch which is detectable using timing attacks.

This scheme might make sense if the page lookup was cheap but the access check is expensive. But from my understanding that doesn't seem to be the case.

I've read the page table on a 64 bit machine has at least three layers, which means at least three memory lookups. Hopefully these are in the cache but if they aren't that means recursively searching the page table for it's own pages.

After we've done all this work and finally found the page table entry, when we load the physical address from the page table, we also load the access bits. Why not just check it there? It seems far more trivial to check the access bit we've already loaded than muck around with circuitry to deal with it later on.

I'm obviously missing something about how the CPU is working, but I can't work out what. We have to do the page table lookup to even work out what to fetch, and once we've gone to that trouble why not just check the access bit?

https://www.raspberrypi.org/blog/why-raspberry-pi-isnt-vulnerable-to-spectre-or-meltdown/ — Monica Apologists Get Out, Jan 08 '18 at 17:27
The answers in that question don't actually seem to explain why the access check is so slow and needs to be delayed. — Clinton, Jan 08 '18 at 19:28
@Clinton, the answers are lousy, but the question is the same. — Mark, Jan 08 '18 at 20:52

Ben Voigt · Answer 1 · 2018-01-08T19:47:34.723

2

A couple things you missed:

The kernel memory in question might already be in cache, making fetching the data just as fast as fetching the access bits. It's a content-addressed lookup for both. Even if not, there's the TLB. Full three-level page table accesses are not the common case to optimize for when performance is the goal.
The privilege level itself might be changed by other half-executed instructions in the pipeline. Until those instructions are fully retired, the privilege level itself is only a speculation.

I wonder if #2 will lead to the discovery of even more vulnerabilities...

edited Jan 08 '18 at 19:47

answered Jan 08 '18 at 19:42

Ben Voigt

760
1
10
17

I would actually be pretty damn surprised if the CPL (or IOPL, etc) itself were speculated. A lot of low-level CPU logic is changed by certain privileged instructions that affect everything (such as `OUT`), so these are likely never speculated. How could you speculate an `OUT` and a subsequent `IN` where the latter is dependent on the former? Or take for example `INT` (as in `int 0x80`) or `SYSCALL` which go to CPL0. Those instructions take a long time to retire, at least as long as the syscall itself is running. You couldn't speculate based on what these syscalls return. – forest Mar 07 '18 at 06:18
@forest: It's more likely to be the return-from-trap (whether using software interrupts, syscall enter, or whatever) that gets speculated, because that just resumes already decoded instructions from the primary instruction stream. That direction is also more dangerous (because if the change in privilege isn't applied to speculated instructions, they get run with higher privilege than they should) – Ben Voigt Mar 07 '18 at 06:56
I'm not sure how that would be more dangerous. If an instruction like `OUT` is speculated and not actually retired, the effects it triggers will not be visible because they will not be fully executed. Only the contents of the cache will be modified. – forest Mar 07 '18 at 07:06
@forest: What if they are speculated and succeed under the current privilege level, then the prediction is correct so the speculated execution path does get retired? Now, instead of the effect being a violation trap, the action goes through. – Ben Voigt Mar 07 '18 at 07:10
If the prediction was correct, then there is no issue. That would mean the instructions were supposed to run privileged. – forest Mar 07 '18 at 07:13
@forest: No... correct branch prediction means the instructions were actually reached, it does not mean they succeed. In the scenario I'm outlining, they are reached after a return-from-interrupt, so they should execute unprivileged and cause a trap. But the RFI wasn't retired yet when speculation started... – Ben Voigt Mar 07 '18 at 15:39
Oh I think I get what you mean. That would be an interesting thing to test. I imagine that specific issue would not be present in a modern x86 core, but they are so complex... Who knows? – forest Mar 08 '18 at 05:23

score 1 · Answer 2 · answered Jan 08 '18 at 21:01

1

The real reason is that no-one realised the security issues during design. There's no fundamental reason that CPUs can't be implemented securely, and your question outlines one way to do that.

Speculative execution is for performance, so the CPU tries to do as much work as it can before actual execution, but it can't modify the "architectural state" - what software can see. Loading memory into an internal buffer doesn't affect the state, so it can do that when it likes. But causing an access violation does affect it, so current designs wait until actual execution. (This is some speculation on my part).

Despite AMD being non-vulnerable to Meltdown, I suspect that's more luck than planning. The Meltdown paper says that both Intel and AMD perform speculative lookups after an access violation. They just couldn't craft a practical exploit on AMD.

A reasonable fix is to check permissions before speculative lookups, and stall speculative execution if they occur. When actual execution catches up, it can raise an interrupt and avoid leaving any site effects.

answered Jan 08 '18 at 21:01

paj28

32,736
8
92
130

The only design I can see that would make sense would be if the L1 cache operates on virtual addresses, and assumes that if the last operation on some virtual address yielded some value, the next one (whether by the same or different process) is "likely" to do so as well. Any instructions that would be affected by the fetched data would have to remain speculative while the system fetched data "for real", and would need to be discarded if the real data didn't match the speculation, but if real data matched speculation all operations that depended upon it could be retired simultaneously. – supercat Jan 08 '18 at 23:44
@supercat - I really don't think that's necessary (for Meltdown at least) - I expect the TLB already stores access flags, but it it doesn't that's the obvious place to cache them. – paj28 Jan 09 '18 at 10:39
There's no reason the TLB shouldn't hold access flags, nor is there any reason the system wouldn't know whether any access that used the TLB was authorized by the time data retrieved using a TLB-supplied address was available. On the other hand, it could make sense to start fetching data from the a logical-address L1 cache at the same time as it starts trying to fetch the TLB entry. If there's an L1 hit and a TLB miss, the system couldn't use the L1 data "for real" until the TLB lookup was complete, but using it speculatively could be advantageous. – supercat Jan 09 '18 at 17:36
Further, if there is a TLB miss and data later gets fetched, having the L1 cache receive the data without regard for the privilege level of the code that had initiated the request would be simpler than having to track such permissions along that pipeline. – supercat Jan 09 '18 at 17:40
@supercat - Gotcha. I didn't know about virtually addressed cache, that is another spanner in the works! – paj28 Jan 09 '18 at 19:08

Why are CPUs designed in a way so the "meltdown" exploit works?

2 Answers2