Processor microcode manipulation to change opcodes?

Question

I had recently thought of an extreme way of implementing security by obscurity and wanted to ask you guys if it's possible.

Would a person with no access to special processor documentation be able to change the CPU's microcode in order to obfuscate the machine's instruction set?

What else would need to be changed for a machine to boot with such a processor - would BIOS manipulation be enough?

Related: http://stackoverflow.com/questions/4366837/what-is-intel-microcode || http://stackoverflow.com/questions/5806589/why-does-intel-hide-internal-risc-core-in-their-processors — Ciro Santilli OurBigBook.com, Jun 19 '15 at 20:15
@d33tah On Intel the microcode is entirely RSA encrypted with the key built-in in transistors. — user2284570, Aug 08 '15 at 15:21
Many of these answers seem to be conflating microcode with micro-ops, such as the one mentioning RISC architectures. Microcode is a lookup table for modifying the behavior of instructions. Micro-ops are the mini-instructions that make up larger instructions and which is the main difference between RISC and CISC, as with RISC, you tend to work closer to the actual hardware, and the instructions you use are closer to micro-ops. — forgetful, Oct 19 '17 at 12:03
Might have some luck with AMD processors https://hackaday.com/2017/12/28/34c3-hacking-into-a-cpus-microcode/ — Steven Spark, Oct 06 '20 at 21:01

score 25 · Answer 1 · edited May 23 '17 at 12:40

Although modern x86 processors allow for runtime microcode upload, the format is model-specific, undocumented, and controlled by checksums and possibly signatures. Also, the scope of microcode is somewhat limited nowadays, because most instructions are hardwired. See this answer for some details. Modern operating systems upload microcode blocks upon boot, but these blocks are provided by the CPU vendors themselves for bugfixing purposes.

(Note that microcode which is uploaded is kept in an internal dedicated RAM block, which is not Flash or EEPROM; it is lost when power is cut.)

Update: there seems to be some misconceptions and/or terminology confusion about what microcode is and what it can do, so here are some longer explanations.

In the days of the first microprocessors, transistors were expensive: they used a lot of silicon area, which is the scarce resource in chip foundries (the larger a chip is, the higher the failure rate, because each dust particle at the wrong place makes the whole chip inoperative). So the chip designers had to resort to many tricks, one of them being microcode. The architecture of a chip of that era would look like this:

Z80 architecture

(this image was shamelessly plundered from this site). The CPU is segmented into many individual units linked together through data buses. Let's see what a fictional "add B, C" instruction would entail (addition of the contents of register B and register C, result to be stored back in B):

The register bank must put the contents of the B register on the internal data bus. At the end of the same cycle, the "TEMP" storage unit should read the value from the data bus and store it.
The register bank must put the contents of the C register on the internal data bus. At the end of the same cycle, the "A" storage unit should read the value from the data bus and store it.
The Arithmetic and Logic Unit (ALU) should read its two inputs (which are TEMP and A) and compute an addition. The result will be available on its output at the next cycle on the bus.
The register bank must read the byte on the internal data bus, and store it into the B register.

The whole process would take a whooping four clock cycles. Each unit in the CPU must receive its specific orders in due sequence. The control unit, which dispatches the activation signals to each CPU unit, must "know" all the sequences for all the instruction. This is where microcode intervenes. Microcode is a representation, as bit words, of elementary steps in this process. Each CPU unit would have a few reserved bits in each microcode. For instance, bits 0 to 3 in each word would be for the register bank, encoding the register which is to be operated, and whether the operation is a read or a write; bits 4 to 6 would be for the ALU, telling it which arithmetic or logic operation it must perform.

With microcode, the control logic becomes a rather simple circuit: it consists of a pointer in the microcode (which is a ROM block); at each cycle, the control unit reads the next microcode word, and sends to each CPU unit, on dedicated wires, its microcode bits. The instruction decoder is then a map from opcodes (the "machine code instructions" which the programmer sees, and are stored in RAM) into offsets in the microcode block: the decoder sets the microcode pointer to the first microcode word for the sequence which implements the opcode.

One description of this process is that the CPU really processes microcode; and the microcode implements an emulator for the actual opcodes which the programmer thinks of as "machine code".

ROM is compact: each ROM bit takes about the same size, or even slightly less, than one transistor. This allowed the CPU designers to store a lot of complex distinct behaviours in a small silicon space. Thus, the highly venerable Motorola 68000 CPU, core processor of the Atari ST, Amiga and Sega Megadrive, could fit in about 40000 transistor-equivalent space, about a third of which consisting of microcode; in that tiny area, it could host fifteen 32-bit registers, and implement a whole paraphernalia of addressing modes for which is was famous. The opcodes were reasonably compact (thus saving RAM); microcode words are larger, but invisible from the outside.

All this changed with the advent of the RISC processors. RISC comes from the realization that while microcode allows for opcodes with complex behaviour, it also implies a lot of overhead in instruction decoding. As we saw above, a simple addition would take several clock cycles. On the other hand, programmers of that time (late 1980s) would increasingly shun assembly, preferring the use of compilers. A compiler translates some programming language into a sequence of opcodes. It so happens that compilers use relatively simple opcodes; opcodes with complex behaviour are hard to integrate into the logic of a compiler. So the net result was that microcode implied overhead, thus execution inefficiency, for the sake of complex opcodes which the programmers did not use !

RISC is, simply put, the suppression of the microcode-in-CPU. The opcodes which the programmer (or the compiler) sees are the microcode, or close enough. This means that RISC opcodes are larger (typically 32 bits per opcode, as in the original ARM, Sparc, Mips, Alpha and PowerPC processors) with a more regular encoding. A RISC CPU can then process one instruction per cycle. Of course, instructions do less things than their CISC counterparts ("CISC" is what non-RISC processors do, like the 68000).

Therefore, if you want to program in microcode, use a RISC processor. In a true RISC processor, there is no microcode stricto sensu; there are opcodes which are translated with a 1-to-1 correspondence into the activation bits for all the CPU units. This gives the compiler more options to optimize code, while saving space in the CPU. The first ARM used only 30000 transistors, less than the 68000, while providing substantially more computing power for the same clock frequency. The price to pay was larger code, but RAM was increasingly cheaper at that time (that's when computer RAM size began to be counted in megabytes instead of mere kilobytes).

Then things changed again by becoming more confused. RISC did not kill off the CISC processors. It turned out that backward compatibility is an extremely strong force in the computing industry. This is why modern x86 processors (like Intel i7 or even newer) are still able to run code designed for the 8086 of the late 1970s. So x86 processors have to implement opcodes with complex behaviours. The result is that modern processors have an instruction decoder which segregates opcodes into two categories:

The usual, simple opcodes which compilers use are executed RISC-like, "hardwired" into fixed behaviours. Additions, multiplications, memory accesses, control flow opcodes... are all handled that way.
The unusual, complex opcodes kept around for compatibility are interpreted with microcode, which is limited to a subset of the units in the CPU so as not to interfere and induce latency in the processing of the simple opcodes. An example of a microcoded instruction in a modern x86 is fsin, which computes the sine function on a floating-point operand.

Since transistors have shrunk a lot (a quad-core i7 from 2008 uses 731 millions of transistors), it became quite tolerable to replace the ROM block for microcode with a RAM block. That RAM block is still internal to the CPU, inaccessible from user code, but it can be updated. After all, microcode is a kind of software, so it has bugs. CPU vendors publish updates for the microcode of their CPU. Such an update can be uploaded by the operating system using some specific opcodes (this requires kernel-level privileges). Since we are talking about RAM, this is not permanent, and must be performed again after each boot.

The contents of these microcode updates are not documented at all; they are very specific to the CPU exact model, and there is no standard. Moreover, there are checksums which are believed to be MAC or possibly even digital signatures: vendors want to keep a tight control of what enters the microcode area. It is conceivable that maliciously crafted microcode could damage the CPU by triggering "short circuits" within the CPU.

Summary: microcode is not as awesome as what it is often cracked up to be. Right now, microcode hacking is a closed area; CPU vendors reserve it for themselves. But even if you could write your own microcode, you would probably be disappointed: in modern CPU, microcode impacts only peripheral units of the CPU.

As for the initial question, an "obscure opcode behaviour" implemented in microcode would not be practically different from a custom virtual machine emulator, like what @Christian links to. It would be "security through obscurity" at its finest, i.e. not very fine. Such things are vulnerable to reverse engineering.

If the fabled microcode could implement a complete decryption engine with a tamper resistant storage area for keys, then you could have a really robust anti-reverse-engineering solution. But microcode cannot do that. This requires some more hardware. The Cell CPU can do that; it has been used in the Sony PS3 (Sony botched it in other areas, though -- the CPU is not alone in the system and cannot ensure total security by itself).

I agree it would be susceptible to reverse engineering, but it would also make it harder to exploit such a system - payloads wouldn't work because the attacker wouldn't know the instruction set. And it would be faster than a virtual machine solution. — d33tah, Jan 26 '13 at 19:58

Polynomial · Answer 2 · 2013-01-26T14:38:55.643

You're very much getting into the realm of "Here be dragons" when you look into hardware manipulation like this. I don't know of any research or in-the-wild attack that has done any practical experimentation with this, so my answer will be purely academic.

First, it's probably best if I explain a bit about how microcode works. If you're already clued up on this stuff, feel free to skip ahead, but I'd rather include the details for those who don't know. A microprocessor consists of a huge array of transistors on a silicon die that interconnect in a way that provides a set of useful basic functions. These transistors alter their states based on internal changes in voltage, or on transitions between voltage levels. These transitions are triggered by a clock signal, which is actually a square wave that switches between high and low voltage at a high frequency - this is where we get "speed" measurements for CPUs, e.g. 2GHz. Every time a clock cycle switches between low and high voltage a single internal change is made. This is called a clock tick. In the simplest devices a single clock tick might constitute a whole programmed operation, but these devices are extremely limited in terms of what they're capable of doing.

As processors have gotten more complex, the amount of work that needs to be done at the hardware level to provide even the most basic operations (e.g. an addition of two 32-bit integers) has increased. A single native assembly instruction (e.g. add eax, ebx) might involve quite a lot of internal work, and microcode is what defines that work. Each clock tick performs a single microcode instruction, and a single native instruction might involve hundreds of microcode instructions.

Let's look at an extremely simplistic version of a memory read, for the instruction mov eax, [01234000], i.e. move a 32-bit integer from memory at address 01234000 into an internal register. First, the processor has to read the instruction from its internal instruction cache, which is a complicated task in itself. Let's ignore this for now, but it involves a lot of various operations inside the control unit (CU) that parse the instruction and prime various other internal units. Once the control unit has parsed the instruction, it then has to execute a group of microinstructions to perform the operation. First, it needs to check that the system memory pipeline is ready for a new instruction (remember that memory chips take commands too) so that it can do a read. Next, it needs to send a read command to the pipeline and wait for it to be serviced. DDR is asynchronous, so it must wait for an interrupt to say that the operation has completed. Once the interrupt is raised, the CPU continues with the instruction. The next operation is to move the new value from memory into an internal register. This isn't as simple as it sounds - the registers you would normally recognise (eax, ebx, ecx, edx, ebp, etc.) aren't fixed to a particular physical set of transistors in the chip. In fact, a CPU has a lot more physical internal registers than it exposes, and it uses a technique called register renaming to optimise the translation of incoming, outgoing and processed data. So the actual data from the memory bus has to be moved into a physical register, then that register has to be mapped to an exposed register name. In this case we'd be mapping it to eax.

All of the above is a simplification - the real operation might involve a lot more work, or might be handled by a dedicated internal device. As such, you might be looking at a large sequence of microinstructions that do very little on their own but add up to a single instruction. In some cases special microinstructions are used to trigger asynchronous internal hardware operations that handle a particular operation, designed to improve performance.

As you can see, microcode is immensely complicated. Not only would it wildly vary between CPU types, but also between release versions and revisions. This makes it a difficult thing to target - you can't really tell what microcode is programmed into the device. Not only that, but the way the microcode is programmed into the chip is also specific to each processor. On top of that, it's undocumented and checksummed, and potentially requires some signature checks too. You'd need some serious hardware to reverse engineer the mechanisms and checks.

Let's assume for a moment that you could overwrite microcode in a useful way. How would you make it do anything useful? Keep in mind that each code simply shifts some values around in the internals of the hardware, rather than a real operation. Obfuscating opcodes by juggling microcode around would require a complete custom OS and bootloader, but the BIOS would (likely) still work. Unfortunately more modern systems use UEFI rather than the old BIOS spec, which involves some execution of code on the CPU in real mode. This means you'd need an entirely new BIOS and OS, all written from scratch. Hardly a useful obfuscation method. On top of that, you may not even be able to remap instructions, because the seemingly arbitrary byte values aren't so arbitrary - the individual bits map to codes that select different areas of the CPU internals. Changing them might break the CPU's ability to even parse the instruction data.

A more interesting exercise would be to implement a new instruction that transitions you from ring3 to ring0 and another that switches back, all without performing any checks. This would allow you to do some fun stuff with privilege escalation without ever needing OS-specific backdoors.

Awesome answer! BTW, with Linux I guess that preparing such a custom OS would be pretty easy, perhaps hacking the assembler would be enough? — d33tah, Jan 26 '13 at 14:29
Even remapping the instructions would be difficult. An x86 instruction can have prefixes, multi-byte opcodes, etc. Simply shifting bytes around for the opcodes wouldn't work, because they internally map to something useful, e.g. the first few bits might tell the CPU what class of instruction it is (jump, privilege, ALU calc, x86 FPU, MMX, SSE, etc.). Changing these might completely break the control unit, whose job it is to take codes and execute the corresponding microprograms. — Polynomial, Jan 26 '13 at 14:35
Linux will not help you here @d33tah - Polynomial is talking about the on-chip microcodes that make up the op-codes that your OS can use, whether it is windows, Linux or another — Rory Alsop, Jan 26 '13 at 15:58
Why would the BIOS work when opcodes are shuffled? BIOS is only a program on a memory-mapped chip that gets executed by the CPU when it powers up. — user10008, Aug 20 '14 at 16:06
@user10008 You're correct - my statements above are a little misleading in hindsight. What I was trying to say is that the BIOS only uses a small subset of the instructions that a modern processor exposes, so you could modify certain extensions (e.g. x87 FPU, MMX, SSE, AES-NI, etc.) without affecting the functionality or execution of the BIOS itself. — Polynomial, Aug 21 '14 at 10:13
Your description of how microcode is used surprises me (I'd expect most of the core functionality to be hardwired for efficiency reasons, and only more complex instructions and particular circumstances to be handled via microcode). Do you have links with further details? — Stefan, Jan 19 '16 at 17:39
@Polynomial: Good answer but 1) On linux the ucode is loaded into volatile memory (e.g. https://wiki.archlinux.org/index.php/microcode) 2) I understand that `intel-ucode.img` is crypto signed (no details). Lastly, you can be sure someone like China or NSA could very well have their own `intel-ucode.img` equivalent for the ultimate 'ghost' processor (e.g. ghosting AES-NI). Internal mode, tapeout mask leaks etc. But yeah, here be dragons. — DeepSpace101, Feb 10 '16 at 05:34
Unless I am misunderstanding, this answer is very incorrect. It is describing micro-_ops_, not microcode. You say for example that `add` uses microcode, which is untrue. Due to the performance impact of a microcode table lookup, instructions like `add` do _not_ use microcode, and using microcode for it would be disastrous. It does, however, use micro-ops, but that is not what the OP was asking about. — forest, Jan 18 '18 at 17:27

score 5 · Answer 3 · answered May 27 '14 at 00:54

Yes, it's possible although not quite the way some think. I've proposed a few ideas on Schneier's blog along these lines. There are a few ways of doing this:

Your own microcode that starts with a processor that will not change. This can be accomplished using an open core, for instance, and freezing the internal design. Then you (and other users) do custom microcode on it. This is a lot of work as others noted. However, you can do High-level language to microcode/micro-instruction compilers (google them using those key words). The combo of these is the heavyweight approach. An easier version of concept is Alpha's PALcode which lets you create new instructions which are composed of existing instructions and executed atomically. Not sure if feature exists in any processors still in production.
The other approach, mine, was to come up with a microcode & simply change the identifiers for the machine code instructions. The compiler and microcode signer is on a protected machine, either non-networked or sitting behind a highly assured guard. Incoming shell-code has a random effect which in similar academic research almost never leads to code execution. (Google Instruction Set Randomization as there's even CPU prototypes on this kind of thing.) The scheme would also produce a toolchain with compiler, debugger, etc. Tensilica's Xtensa processor I.P. already generate CPU's and toolchains for specific applications. This is... much simpler than that. ;)
The best approach is to modify the architecture to only allow sensible operations on data. These are called 'tagged,' 'capability,' etc architectures. Tagged architectures add tags to pieces of memory representing a data type (eg integer, array, code). Processor type checks individual operations for sanity before allowing them. Crash-safe.org's secure design does this. Capability systems are about compartmentalizing systems with secure pointers to pieces of code and data. Cambrige's CHERI project does this. Both styles were used in past to develop practical systems with excellent security properties and/or track-record. Definitive book on them below. My current designs leverage these as a solid foundation on which to build a secure OS in vein of GEMSOS, KeyKOS, or JX systems.

http://homes.cs.washington.edu/~levy/capabook/

Just giving this answer because, despite the negative answers, things in the spirit of what you describe have been done and a few tested against common attacks. They just don't totally do a new microcode, esp by hand. They use shortcuts like I mentioned or design a processor to simulate this effect. It might be beaten if it becomes mainstream yet will stop most code injections meanwhile. I recommended using Linux on PPC (with identifying info removed) for business apps a long time ago for this reason to a small crowd of users. They're still malware and hack free after 5+ years with a steady supply of cheap hardware. So, I expect a random ISA or microcoding approach to work even better, a tagged/capability better than that (even against pro's), and a combination of them even better.

score 2 · Answer 4 · answered Jan 26 '13 at 13:13

I don't think that changing the microcode of x86 is possible but running an emulator on top with different microcode is possible and being used. This emulator can be built to start at boot time similar to CPU bootstrapping (yes, CPUs need to bootstrap too).

Obfuscating opcodes is being used in PE protectors which will generate an unique set of opcodes and the virtual machine that can interpret those opcodes. This method makes static analysis hard and is used for anti-piracy and malware writing. An example of this technology is Themida.

Processor microcode manipulation to change opcodes?

4 Answers4

Linked