4

Recently, many motherboards supporting skylake or kabylake, got a uefi update described as cpu microcode security update for a specific Intel errata, described by Intel as :

Short Loops Which Use AH/BH/CH/DH Registers May Cause Unpredictable System Behavior.

Under complex micro-architectural conditions, short loops of less than 64 instructions that use AH, BH, CH or DH registers as well as their corresponding wider register (e.g. RAX, EAX or AX for AH) may cause unpredictable system behavior. This can only happen when both logical processors on the same physical processor are active.

Intel issues cpu frequent errata that can cause denial of service, but in this case, manufacturer didn't create a specific uefi update for each of them.
Of course, I tried the following code on several logical cores which doesn´t crash anything (I don’t understand if all 8 registers must be involved to trigger the bug or only one of them is enough) :

48 ba ff 00 00 00 04    movabs $0x4000000ff,%rdx
00 00 00
.L5:
48 89 d0                movq    %rdx,%rax
48 2d fe 00 00 00       subq    $0xfe,%rax
08 f4                   orb     %dh,%ah
48 89 c3                movq    %rax,%rbx
48 81 eb fe 00 00 00    subq    $0xfe,%rbx
08 e7                   orb     %ah,%bh
48 89 d9                movq    %rbx,%rcx
48 81 e9 fe 00 00 00    subb    $0xfe,%rcx
08 fd                   orb     %bh,%ch
48 89 ca                movq    %rcx,%rdx
48 81 ea fe 00 00 00    subq    $0xfe,%rdx
08 ee                   orb     %ch,%dh
48 85 cb                test   %rcx,%rbx
75 cc                   jne    .L5
movq    %rcx, %rdx
movq    %rbx, %rax
movq    %rax, %rsi
leaq    .LC0(%rip), %rdi
movl    $0, %eax
call    printf@PLT

So does unpredictable system behavior imply remote code execution (e.g. because such loops would propagate register change to the other thread running on the same core) ?

Also, what kind of loops can trigger the bug? Does simply modifying some of the involved registers in less than 64 instructions trigger the bug? Does loops needs to be different (I mean threads shouldn’t use the same code)?
At least can it be possible to have example Ocaml code that can trigger the bug?
How to know if a vulnerable microcode is used while running qemu-kvm? (qemu -cpu host hides microcode revision number)

user2284570
  • 1,402
  • 1
  • 14
  • 33
  • It also worth to note that I have an affected system and that I can verify code path that can trigger the bug… – user2284570 Oct 08 '17 at 23:05
  • Intel won't disclose this kind of information. You'll have to just make your own tests. – Overmind Oct 09 '17 at 05:03
  • @Overmind in fact, a lot of peoples discovered the bug before Intel® because it caused applications crash in Ocaml https://lists.debian.org/debian-devel/2017/06/msg00308.html. So while I couldn’t found it, **I think the code path is well known**… – user2284570 Oct 09 '17 at 11:15

3 Answers3

3

The ocaml bug tracker still has the original bug report: https://caml.inria.fr/mantis/view.php?id=7452

Using the same ocaml version (4.03) and using the steps to reproduce, namely: while ocamlfind opt -c -g -bin-annot -ccopt -g -ccopt -O2 -ccopt -Wextra -ccopt '-Wstrict-overflow=5' -thread -w +a-4-40..42-44-45-48-58 -w -27-32 -package extprot test.ml -o test.cmx; do echo "ok"; done it is stated that it can be reproduced within ~30m on an unpatched machine.

I could not find any internal Intel information concerning the errata. I assume it is kept secret due to the sensitive nature of the issue.

The debian ML post (https://lists.debian.org/debian-devel/2017/06/msg00308.html) has more detail on the issue itself and the types of processors affected. This might be relevant to your research.

  • It is impossible to intall the extprot library with latest opam s packages due to hundreds of compiler errors (at least on debian jessie and fedora 26). So I can t use the example in the ocaml bug report. The errata isn t secret https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf – user2284570 Oct 31 '17 at 09:04
1

You asked a lot of things that are very explicitly answered in the Debian report and also in the Intel documentation... Anyway, both hardware threads must be running a tight loop (that fits the conditions to activate the loop stream detector in both threads at the same time), that touch those registers and hits other unknown internal microprocessor details, for the errata to trigger.

One of the OCAML people publicly reported a single case of page table corruption, so it seems, from this single piece of evidence, that we should not rule out that the errata could cause damage that escapes process context and cause unpredictable behavior on an unrelated process.

But statistically relevant testing would have to be done to know for sure.

Triggering the bug is difficult, but the OCAML garbage collector manages to do it relatively easy (it is not "on demand"). Look for the Hacker News coverage for details, basically it is most likely the loop stream detector (a power management optimization) that triggers the errata, and getting that to run on both threads in the way it is needed for the errata to trigger on purpose isn't trivial. Nobody published anything related to that, yet.

The OCAML garbage collector manages to do it often enough, and it is the only reproducer known. Some very good security researches were reportedly interested, but so far nothing surfaced.

Meanwhile, patch that microcode. We cannot be really sure it is not security-exploitable on demand at this time, but even if we were sure it was not exploitable for privilege escalation, it would still be a pocket of unpredictable nastiness.

As for Linux distros, even the most conservative ones have already issued updated packages since there has been enough testing and no real issues with this rounds of updates surfaced [that were not present in previous updates, anyway].

As for motherboard vendors, by now you should know which ones you should avoid in the future.

anonymous
  • 21
  • 2
  • 1
    `One of the OCAML people publicly reported a single case of page table corruption` **Please link the reference…** I mean, a link to the stated report… – user2284570 Oct 10 '17 at 13:42
  • Sorry, don't have it on hand right now. Two of the researches published details on their blogs, it is one of them. Read the last Debian ML thread post with the final update, it should be in the updated references. – anonymous Oct 10 '17 at 13:47
  • I read the Debian mailing list and couldn’t find that info… – user2284570 Oct 10 '17 at 13:48
  • at least, do you have example ocaml code that can trigger the bug? – user2284570 Oct 29 '17 at 18:23
0

from this article a switch case seemed to cause the issue:

# The main loop condition
.L108:
    .loc 3 542 0
    testq   %r13, %r13
    jg  .L111
[...]

the if condition checking to load next memory chunk on chunk boundary

.L103:
    .loc 3 567 0
    movq    chunk(%rip), %rax
    movq    -8(%rax), %rax
    movq    %rax, chunk(%rip)
    .loc 3 568 0

the exit condition. This code is only taken once at exit and gcc inserted a shortcut jump to function exit

    testq   %rax, %rax
    je  .L115
    .loc 3 575 0
    movq    %rax, caml_gc_sweep_hp(%rip)
    .loc 3 576 0
    addq    -16(%rax), %rax
    movq    %rax, limit(%rip)
.L111:
    .loc 3 543 0

The if branch. This is likely the direct effect of value range propagation, jumping directly inside the loop instead of running the check again

 movq    caml_gc_sweep_hp(%rip), %rbx
    cmpq    limit(%rip), %rbx
    jnb .L103
    .loc 3 545 0

The switch entrypoint decoding header word

movq    (%rbx), %rax
.loc 3 546 0
movq    %rax, %rdx
shrq    $10, %rdx
movq    %rdx, %r13
notq    %r13
addq    %r12, %r13
movq    %r13, %r12
.loc 3 547 0
leaq    8(%rbx,%rdx,8), %rdx
movq    %rdx, caml_gc_sweep_hp(%rip)
.loc 3 548 0

the branch handling White color, aka unreachable blocks (left out of this sample)

movq    %rax, %rdx
andl    $768, %edx
je  .L105

the branch handling Blue color (left out of this sample)

cmpq    $512, %rdx
je  .L106

The default branch aka reachable blocks

.loc 3 562 0

update of the color part of the header

andb    $252, %ah
movq    %rax, (%rbx)
.loc 3 563 0
jmp .L108 # jumping back to the loop condition to scan next block

specifically this block of code caused the issue:

.L111:
   .loc 3 562 0
   movq    -16(%rbp), %rax
   andb    $252, %ah
   movq    %rax, %rdx
   movq    -8(%rbp), %rax
   movq    %rdx, (%rax)
   .loc 3 563 0
   nop
   jmp     .L102
jtillman
  • 127
  • 3
  • Thank you. But you didn’t understood the article, the code which caused the issue in your answer is described as the unoptimized version, so it can’t be`.L111` – user2284570 Nov 04 '17 at 11:02