Use special-case short-form encodings for AL/AX/EAX, and other short forms and single-byte instructions
Examples assume 32 / 64-bit mode, where the default operand size is 32 bits. An operand-size prefix changes the instruction to AX instead of EAX (or the reverse in 16-bit mode).
inc/dec
a register (other than 8-bit): inc eax
/ dec ebp
. (Not x86-64: the 0x4x
opcode bytes were repurposed as REX prefixes, so inc r/m32
is the only encoding.)
8-bit inc bl
is 2 bytes, using the inc r/m8
opcode + ModR/M operand encoding. So use inc ebx
to increment bl
, if it's safe. (e.g. if you don't need the ZF result in cases where the upper bytes might be non-zero).
scasd
: e/rdi+=4
, requires that the register points to readable memory. Sometimes useful even if you don't care about the FLAGS result (like cmp eax,[rdi]
/ rdi+=4
). And in 64-bit mode, scasb
can work as a 1-byte inc rdi
, if lodsb or stosb aren't useful.
xchg eax, r32
: this is where 0x90 NOP came from: xchg eax,eax
. Example: re-arrange 3 registers with two xchg
instructions in a cdq
/ idiv
loop for GCD in 8 bytes where most of the instructions are single-byte, including an abuse of inc ecx
/loop
instead of test ecx,ecx
/jnz
cdq
: sign-extend EAX into EDX:EAX, i.e. copying the high bit of EAX to all bits of EDX. To create a zero with known non-negative, or to get a 0/-1 to add/sub or mask with. x86 history lesson: cltq
vs. movslq
, and also AT&T vs. Intel mnemonics for this and the related cdqe
.
lodsb/d: like mov eax, [rsi]
/ rsi += 4
without clobbering flags. (Assuming DF is clear, which standard calling conventions require on function entry.) Also stosb/d, sometimes scas, and more rarely movs / cmps.
push
/pop reg
. e.g. in 64-bit mode, push rsp
/ pop rdi
is 2 bytes, but mov rdi, rsp
needs a REX prefix and is 3 bytes.
xlatb
exists, but is rarely useful. A large lookup table is something to avoid. I've also never found a use for AAA / DAA or other packed-BCD or 2-ASCII-digit instructions.
1-byte lahf
/ sahf
are rarely useful. You could lahf
/ and ah, 1
as an alternative to setc ah
, but it's typically not useful.
And for CF specifically, there's sbb eax,eax
to get a 0/-1, or even un-documented but universally supported 1-byte salc
(set AL from Carry) which effectively does sbb al,al
without affecting flags. (Removed in x86-64). I used SALC in User Appreciation Challenge #1: Dennis ♦.
1-byte cmc
/ clc
/ stc
(flip ("complement"), clear, or set CF) are rarely useful, although I did find a use for cmc
in extended-precision addition with base 10^9 chunks. To unconditionally set/clear CF, usually arrange for that to happen as part of another instruction, e.g. xor eax,eax
clears CF as well as EAX. There are no equivalent instructions for other condition flags, just DF (string direction) and IF (interrupts). The carry flag is special for a lot of instructions; shifts set it, adc al, 0
can add it to AL in 2 byte, and I mentioned earlier the undocumented SALC.
std
/ cld
rarely seem worth it. Especially in 32-bit code, it's better to just use dec
on a pointer and a mov
or memory source operand to an ALU instruction instead of setting DF so lodsb
/ stosb
go downward instead of up. Usually if you need downward at all, you still have another pointer going up, so you'd need more than one std
and cld
in the whole function to use lods
/ stos
for both. Instead, just use the string instructions for the upward direction. (The standard calling conventions guarantee DF=0 on function entry, so you can assume that for free without using cld
.)
8086 history: why these encodings exist
In original 8086, AX was very special: instructions like lodsb
/ stosb
, cbw
, mul
/ div
and others use it implicitly. That's still
the case of course; current x86 hasn't dropped any of 8086's opcodes (at least not any of the officially documented ones). But later CPUs added new instructions that gave better / more efficient ways to do things without copying or swapping them to AX first. (Or to EAX in 32-bit mode.)
e.g. 8086 lacked later additions like movsx
/ movzx
to load or move + sign-extend, or 2 and 3-operand imul cx, bx, 1234
that don't produce a high-half result and don't have any implicit operands.
Also, 8086's main bottleneck was instruction-fetch, so optimizing for code-size was important for performance back then. 8086's ISA designer (Stephen Morse) spent a lot of opcode coding space on special cases for AX / AL, including special (E)AX/AL-destination opcodes for all the basic immediate-src ALU- instructions, just opcode + immediate with no ModR/M byte. 2-byte add/sub/and/or/xor/cmp/test/... AL,imm8
or AX,imm16
or (in 32-bit mode) EAX,imm32
.
But there's no special case for EAX,imm8
, so the regular ModR/M encoding of add eax,4
is shorter.
The assumption is that if you're going to work on some data, you'll want it in AX / AL, so swapping a register with AX was something you might want to do, maybe even more often than copying a register to AX with mov
.
Everything about 8086 instruction encoding supports this paradigm, from instructions like lodsb/w
to all the special-case encodings for immediates with EAX to its implicit use even for multiply/divide.
Don't get carried away; it's not automatically a win to swap everything to EAX, especially if you need to use immediates with 32-bit registers instead of 8-bit. Or if you need to interleave operations on multiple variables in registers at once. Or if you're using instructions with 2 registers, not immediates at all.
But always keep in mind: am I doing anything that would be shorter in EAX/AL? Can I rearrange so I have this in AL, or am I currently taking better advantage of AL with what I'm already using it for.
Mix 8-bit and 32-bit operations freely to take advantage whenever it's safe to do so (you don't need carry-out into the full register or whatever).
Also, to initialize a register with a small (8-bit) value other than 0: use e.g.
push 200; pop edx
- 3 bytes for initialization. – anatolyg – 2017-07-18T10:27:57.7332BTW to initialize a register to -1, use
dec
, e.g.xor eax, eax; dec eax
– anatolyg – 2017-07-18T10:28:32.513@anatolyg: 200 is a poor example, it doesn't fit in a sign-extended-imm8. But yes,
– Peter Cordes – 2018-03-29T13:35:00.373push imm8
/pop reg
is 3 bytes, and is fantastic for 64-bit constants on x86-64, wheredec
/inc
is 2 bytes. Andpush r64
/pop 64
(2 bytes) can even replace a 3 bytemov r64, r64
(3 bytes with REX). See also Set all bits in CPU register to 1 efficiently for stuff likelea eax, [rcx-1]
given a known value ineax
(e.g. if need a zeroed register and another constant, just use LEA instead of push/pop