Why are AMD processors not/less vulnerable to Meltdown and Spectre?

Question

I’ve read up on Meltdown and Spectre and it’s not obvious to me why AMD would be less vulnerable. Do AMD processors simply not have speculative execution? Or do they have some way of not exploding the same side channels?

Update: I ask because AMD’s press releases claim their products are less vulnerable.

I think that this question would be better suited for the Information Security site on this network since it's not a programming question and is actually just about an information security topic. — Tophandour, Jan 05 '18 at 22:55
I thought about that, but I figured SO users would know more about processor architecture — Ethan Reesor, Jan 05 '18 at 22:56
Meltdown: Intel does more aggressive speculation that alters CPU state even for blocked accesses. Explained here: https://arstechnica.com/gadgets/2018/01/meltdown-and-spectre-every-modern-processor-has-unfixable-security-flaws/ — , Jan 05 '18 at 22:59
Does ISSE cover hardware security issues? IS has always seemed software-centric to me and my question is definitely about processor architecture and not about software. — Ethan Reesor, Jan 05 '18 at 23:12

score 28 · Accepted Answer · edited Jun 16 '20 at 09:49

Only Meltdown is specifically an Intel vulnerability / design flaw.

update: it seems AMD is mostly resilient to Spectre. It's not clear why that would be the case. But according to AMD:

(from early January, now replaced, see update 2 below)

Differences in AMD architecture mean there is a near zero risk of exploitation of this variant. Vulnerability to Variant 2 (Branch Target Injection) has not been demonstrated on AMD processors to date.

Note that it's only a "near zero", unlike the "zero" for Meltdown. (And that the "bounds check bypass" Spectre v1 attack (conditional direct branch before an array access) still needs to be fixed in software.)

update 2: AMD may have underestimated Spectre, because they are still releasing microcode updates, and their web site now says:

GPZ Variant 2 (Branch Target Injection or Spectre) is applicable to AMD processors.

While we believe that AMD’s processor architectures make it difficult to exploit Variant 2, we continue to work closely with the industry on this threat.

It's still not clear what microarchitectural design feature / difference makes AMD CPUs any more resilient than Intel; they do still have a large out-of-order window and AFAIK are not significantly more limited in what can be in flight at once. But their branch predictors are designed differently, and are maybe harder to mis-train or are somehow insulated across privilege boundaries? But the latter would make them immune to Spectre v1 as well, and AMD CPUs are just as vulnerable to the bounds-check bypass version of Spectre.

Spectre (v2) affects any "normal" CPU design with branch prediction + speculative execution for indirect branches. In Spectre, the attacker primes / trains the branch predictor to mispredict the target of an indirect branch in kernel code or in another process. Until the mis-predict is discovered, the privileged process will speculatively execute some instructions that change the micro-architectural state of the machine in a way that depends on secret data.

(Then the attacker uses a cache-timing attack to read that microarchitectural state, the same as in Meltdown. When attacking the kernel, you'd ideally choose a "gadget" that uses the secret data to index an array your attacking user-space code also has read privilege for. If you can't, you might have to use a cache-eviction timing attack to look for which set of cache had a line evicted instead of looking for a line that became hot.)

Defeating Spectre in HW might take something as costly as flushing the branch-predictor caches on every transition across trust boundaries. Or at least the indirect-branch target prediction buffers. (kernel / user is one the HW is aware of, but into sandboxed JIT-compiled code (especially Javascript) is a problem.)

The key to Meltdown is that Intel CPUs (but apparently not ARM or AMD) don't squash under-privileged TLB hits. The load executes, and so do following instructions, and it only actually faults when the faulting load tries to retire. This allows unprivileged code to directly cause changes in the microarchitectural state itself, based on data it doesn't have permission to read, instead of (Spectre) tricking a privileged process into doing it.

(Note that you only get actual data if the line was already hot in L1d cache; It seems that on CPUs even repeated Meltdown attacks on the same line won't cause the CPU to bring a line into cache if it wasn't there already.)

;; run this in user-space
;; (and suppress or catch the fault somehow so you can do it quickly/repeatedly)

;; clflush all the cache lines in local_array (which you have permission to read)

;; create a long delay before following instructions can retire, but with few uops so OoO exec can see past it
times 30  sqrtpd  xmm0, xmm0     ; high latency per uop

movzx  eax, byte [kernel_byte]    ; eventually faults, but OoO exec continues first
shl    eax, 12
mov    eax, [local_array + rax]   ; the cache-line this touches depends on the secret data
;; after the movzx faults, the cache-line touched by the mov will still be hot

; then check which cache line of local_array is already hot.

(And BTW, I emailed the authors of the Meltdown paper about this. They were in a rush to get it published by the public disclosure date which apparently moved up by 1 week unexpectedly. They are planning to clarify their section on the microarchitectural details to make it clear that this is the real design flaw that enables Meltdown.)

(Meltdown depends on the kernel keeping its pages mapped into the virtual address space of the process, but with a bit set in the page table entries that flag them as kernel-only mappings. This makes system calls and interrupts cheaper, because they don't have to modify the page tables to allow access to kernel memory: the CPU just starts allowing those kernel-only mappings to work when running in ring 0. See this Q&A for a diagram of the x86-64 page-table entry format. The U/S bit (user/supervisor) is the one that controls whether a mapping is kernel-only or not.)

The fault that's eventually raised is either handled (catch SIGSEGV), or suppressed (run the transient sequence inside a TSX (transactional memory) transaction, or as a result of mis-speculation). So in Meltdown, branch mispredicts are only relevant as part of making the attack efficient and reliable by suppressing the fault on the load from a kernel virtual address. Speculative execution past a faulting load is key, though, even for dependent instruction. (Not just for independent instructions executed before a load-address is ready or anything that simple).

Presumably AMD's load execution units / TLB are designed differently, with privilege checking for loads applied earlier or differently. Either a load from a virtual address with a kernel-only mapping is treated the same as a load from an unmapped page, or the physical page bits used for speculative execution are set to all-zero or all-one or something. Or maybe it just squashes the load without triggering a page-walk (like a load from an unmapped page would).

Note that x86 doesn't require TLB invalidation when changing a page table entry from invalid to valid, so the address-translation hardware isn't allowed to do "negative" caching. Or if it does, it would have to be coherent with page table writes that make previously-invalid entries valid. Intel CPUs do a kind of TLB shootdown to maintain coherence for TLB entries that were only speculatively loaded, going beyond what the x86 manuals require to avoid breaking old code that existed before the current TLB-invalidation rules were published (e.g. Win95 through ME).

The point is, a different microarchitectural design choice could totally block the thing that Meltdown depends on. And such design choices are plausible on their own, not specifically to avoid Meltdown.

A related question: why does Intel's delayed permission check design choice make sense?

Until Meltdown / Spectre, CPU designers were only worried about making sure memory protection applied to the architectural state (non-speculative values in architectural registers, not physical registers used by out-of-order execution). i.e. this side-channel wasn't on anyone's radar. The results of instruction execution don't become architectural state until retirement, so that's the point when everything has to be correct (in pre-Meltdown thinking).

As a CPU designer, you want everything to execute as efficiently as possible, with as few special cases as possible. Or especially, with special cases in as few components as possible. It's simpler for the overall design if an execution unit can never stall, so that a pipelined section of logic doesn't need any flow-control, just simply always accept one new input per clock.

(Update: an alternative to stalling is to report failure. Load uops in Intel CPUs can already report failure back to the OoO scheduler and need to be replayed. This happens on L1d cache misses, and on detection (during address calculation) of a cache-line split (to get data from the other cache line), or if the load port tried to use the 4c latency special case for simple addressing modes but the actual address was in a different page than the base register. And even replay of other uops dependent on a load that were dispatched in anticipation of a load producing its result in a certain cycle. Cache misses cause that, too. So in theory under-privileged loads could use this mechanism to effectively not produce a value.)

It looks like Intel's current load execution units don't have a way to squash the load on a TLB hit. The TLB itself is a CAM (Content-Addressable Memory). Managing TLB entries with speculative (prefetch) page-walks involves some more active hardware, but the TLB itself has to be fast to support 3 lookups per clock (from ports 2, 3, and 7).

Most code doesn't page-fault by trying to load from kernel addresses, so optimizing for that case wasn't a consideration. If such loads are seen by out-of-order execution, they usually happen as a result of mis-speculation (running a load instruction with the wrong data in register). (Not necessarily maliciously induced mis-speculation (Spectre), just the regular kind from imperfect branch prediction.) Not doing anything about a TLB-hit faulting load until retirement is a good design decision if most of the time it never reaches retirement because a branch mispredict or other mis-speculation is detected earlier in in-order retirement. Wasting load bandwidth and causing cache pollution is questionable, but (other than Meltdown attacks) this is probably a pretty rare case so keeping the hardware simple was the most valuable thing.

Usually page-faults happen because of access to an address which isn't mapped at all: the page table entry is marked Invalid, not just kernel-only. Or even a higher level of the 4-level nested page tables is invalid, e.g. the page-directory entry. As mentioned earlier, negative caching isn't architecturally allowed (and AFAIK isn't done with snooping for correctness either), so such PTEs will never appear in the TLB. A page-fault for an unmapped page will be raised only after a page-walk (which finds the mapping for that page doesn't exist). x86 has dedicated page-walk hardware so the page-table loads can happen in the background while the execution units run other uops. (Skylake even has two HW page-walk units). But anyway,

So a page-fault from trying to read a kernel-only mapping is a special case, very different microarchitecturally from the unmapped-page case. (It's actually similar to trying to store into a read-only mapping, which presumably also has delayed faulting. Stores don't become globally visible right away; the store buffer makes speculative store execution possible by keeping them private until after retirement, at which point they can commit to L1D).

How could Intel fix Meltdown in future hardware?

Fixing Meltdown is relatively easy (compared to Spectre), although it probably can't be done with a microcode update. As well as setting a fault-if/when-this-reaches-retirement bit on the uop, a TLB lookup could gate the page-address bits (to all ones) with the privilege-check. e.g. a load in user-space from any kernel page could micro-architecturally execute as a load from the very top physical page. (And systems with less than the max amount of RAM wouldn't have any physical RAM at that physical address.)

Or a failed privilege check could maybe still allow the load to happen microarchitecturally, but mask the result to all-zero in the load port. (Remember, the Meltdown problem isn't that an unprivileged load can bring kernel data into cache, it's that the secret data load result can be used to make another load with a data-dependent address. Continuing speculative execution with a zero result for any under-privileged load that hits in the TLB wouldn't allow any data-dependent microarchitectural effects).

The TLB lookup happens in parallel with indexing the VIPT L1D cache, but the TLB result is needed for the tag-check part of an L1D cache load. So requiring the TLB lookup result is already needed to select the right way from the set indexed by address bits 6 to 11. (L1D is 8-way set associative). So also requiring the TLB permission check to be ready this early shouldn't introduce any extra latency. Masking with the permission result would introduce one extra gate-delay, but one clock cycle has time for many gate delays. (e.g. 64-bit add latency is only 1 cycle, 64-bit imul latency is only 3 cycles on Intel Sandybridge-family. http://agner.org/optimize/).

You could even design it so a load that would fault (if it reaches retirement) doesn't complete execution at all, and no load result is forwarded to dependent instructions. Maybe even squash it so it doesn't send a memory request down the hierarchy towards outer caches if it wasn't in L1D cache. (i.e. don't even check L1D for it). This might be a more tricky design.

(But some CPUs do work that way, according to Henry Wong's tests: e.g. AMD CPUs don't produce a result at all for Meltdown tests, vs. some non-vulnerable CPUs like Via Nano producing zero or Pentium Pro producing some random internal value.)

This design could make it impossible(?) for a user-space process to even detect whether a kernel address was mapped or not by any kind of timing attack, because the micro-architectural effect of a load to a kernel-only mapping would be the same as to an unmapped page. (But actually only the same if it also triggered a page-walk. An under-privileged TLB hit wouldn't trigger a page walk, and you could probably detect that directly.)

This might be valuable to stop processes defeating KASLR, if the kernel uses anything smaller than 1G hugepages to map any of its own memory.

Further reading about CPU internals:

Modern Microprocessors A 90-Minute Guide!. Why out-of-order (and speculative) execution is a thing in the first place, and the difference between CPUs with / without it.
David Kanter's Haswell microarchitecture write-up
What Every Programmer Should Know About Memory by Ulrich Drepper. And a 2017 review on minor changes since it was published in 2007.
Agner Fog's optimization guide and microarchitecture guide

Meltdown-specific details:

The Microarchitecture Behind Meltdown - Henry Wong's detailed tests.
Encouraging the CPU to perform out of order execution for a Meltdown test - I think data is only vulnerable to Meltdown when something that is allowed to read it has brought it into L1d. (Directly or via HW prefetch.)
https://stackoverflow.com/questions/50191769/does-the-meltdown-mitigation-in-combination-with-callocs-cow-lazy-allocati - speculative execution into kernel code from user-space isn't possible. i.e. a page-fault handler can't speculatively execute, because CPUs don't rename the privilege level. See that for more about how CPUs take exceptions.
Out-of-order execution vs. speculative execution

This could probably use a re-edit to put some of the thoughts / statements into a more coherent order. Sorry it's a bit messy. — Peter Cordes, Jan 08 '18 at 12:14
This question might just be the result of me not actually reading your answer, but is it possible AMD is just marketing its CPUs as hard-to-exploit for Spectre but is actually isn't, because both seem to be x86_64 so I'm unable to understand where the difference is arising from. — zombiesauce, Dec 17 '21 at 10:52
@zombiesauce: x86_64 is the architecture. Their internal *micro*-architecture is different from Intel's, although there are broad similarities. http://www.lighterra.com/papers/modernmicroprocessors/. (Although Zen1 used a perceptron-style branch predictor, unlike the IT-TAGE branch prediction like Intel has since Haswell, but I think Zen2 also uses an IT-TAGE predictor). It turns out that AMD CPUs aren't immune to Spectre; they are affected by many variants of it. https://en.wikipedia.org/wiki/Spectre_(security_vulnerability). — Peter Cordes, Dec 17 '21 at 11:11
But Meltdown is a much more specific attack. It's not at all surprising that the details of how out-of-order exec works for uops dependent on a should-fault load would be different. Same for MDS attacks in general, like Intel L1TF and others vs. AMD's recently discovered https://www.extremetech.com/computing/326558-all-amd-cpus-found-harboring-meltdown-like-security-flaw — Peter Cordes, Dec 17 '21 at 11:12

Why are AMD processors not/less vulnerable to Meltdown and Spectre?

1 Answers1

Linked