I noticed that there's no such question, so here it is:

Do you have general tips for golfing in machine code? If the tip only applies to a certain environment or calling convention, please specify that in your answer.

Please only one tip per answer (see here).

ბიმო

Posted 2017-07-17T21:43:42.150

Reputation: 15 345

Answers

`mov`-immediate is expensive for constants

This might be obvious, but I'll still put it here. In general it pays off to think about the bit-level representation of a number when you need to initialize a value.

Initializing `eax` with `0`:

b8 00 00 00 00          mov    $0x0,%eax

should be shortened (for performance as well as code-size) to

31 c0                   xor    %eax,%eax

Initializing `eax` with `-1`:

b8 ff ff ff ff          mov    $-1,%eax

can be shortened to

31 c0                   xor    %eax,%eax
48                      dec    %eax

83 c8 ff                or     $-1,%eax

Or more generally, any 8-bit sign-extended value can be created in 3 bytes with push -12 (2 bytes) / pop %eax (1 byte). This even works for 64-bit registers with no extra REX prefix; push/pop default operand-size = 64.

6a f3                   pushq  $0xfffffffffffffff3
5d                      pop    %rbp

Or given a known constant in a register, you can create another nearby constant using lea 123(%eax), %ecx (3 bytes). This is handy if you need a zeroed register and a constant; xor-zero (2 bytes) + lea-disp8 (3 bytes).

31 c0                   xor    %eax,%eax
8d 48 0c                lea    0xc(%eax),%ecx

ბიმო

Posted 2017-07-17T21:43:42.150

Reputation: 15 345

Also, to initialize a register with a small (8-bit) value other than 0: use e.g. push 200; pop edx - 3 bytes for initialization. – anatolyg – 2017-07-18T10:27:57.733

2BTW to initialize a register to -1, use dec, e.g. xor eax, eax; dec eax – anatolyg – 2017-07-18T10:28:32.513

@anatolyg: 200 is a poor example, it doesn't fit in a sign-extended-imm8. But yes, push imm8 / pop reg is 3 bytes, and is fantastic for 64-bit constants on x86-64, where dec / inc is 2 bytes. And push r64 / pop 64 (2 bytes) can even replace a 3 byte mov r64, r64 (3 bytes with REX). See also Set all bits in CPU register to 1 efficiently for stuff like lea eax, [rcx-1] given a known value in eax (e.g. if need a zeroed register and another constant, just use LEA instead of push/pop

– Peter Cordes – 2018-03-29T13:35:00.373

In a lot of cases, accumulator-based instructions (i.e. those that take (R|E)AX as the destination operand) are 1 byte shorter than general-case instructions; see this question on StackOverflow.

Govind Parmar

Posted 2017-07-17T21:43:42.150

Reputation: 828

Normally the most useful ones are the al, imm8 special cases, like or al, 0x20 / sub al, 'a' / cmp al, 'z'-'a' / ja .non_alphabetic being 2 bytes each, instead of 3. Using al for character data also allows lodsb and/or stosb. Or use al to test something about the low byte of EAX, like lodsd / test al, 1 / setnz cl makes cl=1 or 0 for odd/even. But in the rare case where you need a 32-bit immediate, then sure op eax, imm32, like in my chroma-key answer

– Peter Cordes – 2018-03-29T13:31:26.463

Choose your calling convention to put args where you want them.

The language of your answer is asm (actually machine code), so treat it as part of a program written in asm, not C-compiled-for-x86. Your function doesn't have to be easily callable from C with any standard calling convention. That's a nice bonus if it doesn't cost you any extra bytes, though.

In a pure asm program, it's normal for some helper functions to use a calling convention that's convenient for them and for their caller. Such functions document their calling convention (inputs / outputs / clobbers) with comments.

In real life, even asm programs do (I think) tend to use consistent calling conventions for most functions (especially across different source files), but any given important function could do something special. In code-golf, you're optimizing the crap out of one single function, so obviously it's important / special.

To test your function from a C program, can write a wrapper that puts args in the right places, saves/restores any extra registers you clobber, and puts the return value into e/rax if it wasn't there already.

The limits of what's reasonable: anything that doesn't impose an unreasonable burden on the caller:

ESP/RSP must be call-preserved; other integer regs are fair game. (RBP and RBX are usually call-preserved in normal conventions, but you could clobber both.)
Any arg in any register (except RSP) is reasonable, but asking the caller to copy the same arg to multiple registers is not.
Requiring DF (string direction flag for lods / stos / etc.) to be clear (upward) on call/ret is normal. Letting it be undefined on call/ret would be ok. Requiring it to be cleared or set on entry but then leaving it modified when you return would be weird.
Returning FP values in x87 st0 is reasonable, but returning in st3 with garbage in other x87 register isn't. The caller would have to clean up the x87 stack. Even returning in st0 with non-empty higher stack registers would also be questionable (unless you're returning multiple values).
Your function will be called with call, so [rsp] is your return address. You can avoid call/ret on x86 using link register like lea rbx, [ret_addr] / jmp function and return with jmp rbx, but that's not "reasonable". That's not as efficient as call/ret, so it's not something you'd plausibly find in real code.
Clobbering unlimited memory above RSP is not reasonable, but clobbering your function args on the stack is allowed in normal calling conventions. x64 Windows requires 32 bytes of shadow space above the return address, while x86-64 System V gives you a 128 byte red-zone below RSP, so either of those are reasonable. (Or even a much larger red-zone, especially in a stand-alone program rather than function.)

Borderline cases: write a function that produces a sequence in an array, given the first 2 elements as function args. I chose to have the caller store the start of the sequence into the array and just pass a pointer to the array. This is definitely bending the question's requirements. I considered taking the args packed into xmm0 for movlps [rdi], xmm0, which would also be a weird calling convention.

Return a boolean in FLAGS (condition codes)

OS X system calls do this (CF=0 means no error): Is it considered bad practice to use the flags register as a boolean return value?.

Any condition that can be checked with one JCC is perfectly reasonable, especially if you can pick one that has any semantic relevance to the problem. (e.g. a compare function might set flags so jne will be taken if they weren't equal).

Require narrow args (like a `char`) to be sign or zero extended to 32 or 64 bits.

This is not unreasonable; using movzx or movsx to avoid partial-register slowdowns is normal in modern x86 asm. In fact clang/LLVM already makes code that depends on an undocumented extension to the x86-64 System V calling convention: args narrower than 32 bits are sign or zero extended to 32 bits by the caller.

You can document/describe extension to 64 bits by writing uint64_t or int64_t in your prototype if you want. e.g. so you can use a loop instruction, which uses the whole 64 bits of RCX unless you use an address-size prefix to override the size down to 32 bit ECX (yes really, address-size not operand-size).

Note that long is only a 32-bit type in the Windows 64-bit ABI, and the Linux x32 ABI; uint64_t is unambiguous and shorter to type than unsigned long long.

Existing calling conventions:

Windows 32-bit __fastcall, already suggested by another answer: integer args in ecx and edx.
x86-64 System V: passes lots of args in registers, and has lots of call-clobbered registers you can use without REX prefixes. More importantly, it was actually chosen to allow compilers to inline memcpy or memset as rep movsb easily: the first 6 integer/pointer args are passed in RDI, RSI, RDX, RCX, R8, R9.

If your function uses lodsd / stosd inside a loop that runs rcx times (with the loop instruction), you can say "callable from C as int foo(int *rdi, const int *rsi, int dummy, uint64_t len) with the x86-64 System V calling convention". example: chromakey.
32-bit GCC regparm: Integer args in EAX, ECX, EDX, return in EAX (or EDX:EAX). Having the first arg in the same register as the return value allows some optimizations, like this case with an example caller and a prototype with a function attribute. And of course AL/EAX is special for some instructions.
The Linux x32 ABI uses 32-bit pointers in long mode, so you can save a REX prefix when modifying a pointer (example use-case). You can still use 64-bit address-size, unless you have a 32-bit negative integer zero-extended in a register (so it would be a large unsigned value if you did [rdi + rdx]).

Note that push rsp / pop rax is 2 bytes, and equivalent to mov rax,rsp, so you can still copy full 64-bit registers in 2 bytes.

Peter Cordes

Posted 2017-07-17T21:43:42.150

Reputation: 2 810

When challenges ask to return an array, do you think returning on the stack is reasonable? I think that's what compilers will do when returning a struct by value. – qwr – 2018-05-18T18:52:19.970

@qwr: no, the mainstream calling conventions pass a hidden pointer to the return value. (Some conventions pass/return small structs in registers). C/C++ returning struct by value under the hood, and see the end of How do objects work in x86 at the assembly level?. Note that passing arrays (inside structs) does copy them onto the stack for x86-64 SysV: What kind of C11 data type is an array according to the AMD64 ABI, but Windows x64 passes a non-const pointer.

– Peter Cordes – 2018-05-18T20:14:04.087

so what do you think about reasonable or not? Do you count x86 under this rule https://codegolf.meta.stackexchange.com/a/8507/17360

– qwr – 2018-05-18T23:37:40.003

@qwr: x86 isn't a "stack based language". x86 is a register machine with RAM, not a stack machine. A stack machine is like reverse-polish notation, like x87 registers. fld / fld / faddp. x86's call-stack doesn't fit that model: all normal calling conventions leave RSP unmodified, or pop the args with ret 16; they don't pop the return address, push an array, then push rcx / ret. The caller would have to know the array size or have saved RSP somewhere outside the stack to find itself.

– Peter Cordes – 2018-05-18T23:56:42.557

Call push the address of instruction after the call in the stack jmp to function called; ret pop the address from the stack and jmp to that address – RosLuP – 2019-03-12T18:56:55.187

It is possible return one array if 1) that array is given in the arguments or 2) use malloc() and who call has the responsibility of free that array with free() function – RosLuP – 2019-03-12T19:01:56.753

Do language implements with full tail-call optimization all preserve ESP/RSP? – dfeuer – 2019-05-05T06:31:38.780

@dfeuer Yes. jmp instead of call / ret is only possible if the number of stack args for the target function is less than or equal to the stack arg space for this function. (Or for callee-pops with ret 12 for example, if it matches exactly). Many compilers have missed optimizations about re-using their own stack-arg space for tailcall args (of different may matter). https://godbolt.org/z/Nty30a. So register-arg calling conventions help a lot, both to avoid missed optimations, and to make tailcalls possible when the target has more args than this function.

– Peter Cordes – 2019-05-05T06:50:56.830

Use special-case short-form encodings for AL/AX/EAX, and other short forms and single-byte instructions

Examples assume 32 / 64-bit mode, where the default operand size is 32 bits. An operand-size prefix changes the instruction to AX instead of EAX (or the reverse in 16-bit mode).

inc/dec a register (other than 8-bit): inc eax / dec ebp. (Not x86-64: the 0x4x opcode bytes were repurposed as REX prefixes, so inc r/m32 is the only encoding.)

8-bit inc bl is 2 bytes, using the inc r/m8 opcode + ModR/M operand encoding. So use inc ebx to increment bl, if it's safe. (e.g. if you don't need the ZF result in cases where the upper bytes might be non-zero).
scasd: e/rdi+=4, requires that the register points to readable memory. Sometimes useful even if you don't care about the FLAGS result (like cmp eax,[rdi] / rdi+=4). And in 64-bit mode, scasb can work as a 1-byte inc rdi, if lodsb or stosb aren't useful.
xchg eax, r32: this is where 0x90 NOP came from: xchg eax,eax. Example: re-arrange 3 registers with two xchg instructions in a cdq / idiv loop for GCD in 8 bytes where most of the instructions are single-byte, including an abuse of inc ecx/loop instead of test ecx,ecx/jnz
cdq: sign-extend EAX into EDX:EAX, i.e. copying the high bit of EAX to all bits of EDX. To create a zero with known non-negative, or to get a 0/-1 to add/sub or mask with. x86 history lesson: cltq vs. movslq, and also AT&T vs. Intel mnemonics for this and the related cdqe.
lodsb/d: like mov eax, [rsi] / rsi += 4 without clobbering flags. (Assuming DF is clear, which standard calling conventions require on function entry.) Also stosb/d, sometimes scas, and more rarely movs / cmps.
push/pop reg. e.g. in 64-bit mode, push rsp / pop rdi is 2 bytes, but mov rdi, rsp needs a REX prefix and is 3 bytes.

xlatb exists, but is rarely useful. A large lookup table is something to avoid. I've also never found a use for AAA / DAA or other packed-BCD or 2-ASCII-digit instructions.

1-byte lahf / sahf are rarely useful. You could lahf / and ah, 1 as an alternative to setc ah, but it's typically not useful.

And for CF specifically, there's sbb eax,eax to get a 0/-1, or even un-documented but universally supported 1-byte salc (set AL from Carry) which effectively does sbb al,al without affecting flags. (Removed in x86-64). I used SALC in User Appreciation Challenge #1: Dennis ♦.

1-byte cmc / clc / stc (flip ("complement"), clear, or set CF) are rarely useful, although I did find a use for cmc in extended-precision addition with base 10^9 chunks. To unconditionally set/clear CF, usually arrange for that to happen as part of another instruction, e.g. xor eax,eax clears CF as well as EAX. There are no equivalent instructions for other condition flags, just DF (string direction) and IF (interrupts). The carry flag is special for a lot of instructions; shifts set it, adc al, 0 can add it to AL in 2 byte, and I mentioned earlier the undocumented SALC.

std / cld rarely seem worth it. Especially in 32-bit code, it's better to just use dec on a pointer and a mov or memory source operand to an ALU instruction instead of setting DF so lodsb / stosb go downward instead of up. Usually if you need downward at all, you still have another pointer going up, so you'd need more than one std and cld in the whole function to use lods / stos for both. Instead, just use the string instructions for the upward direction. (The standard calling conventions guarantee DF=0 on function entry, so you can assume that for free without using cld.)

8086 history: why these encodings exist

In original 8086, AX was very special: instructions like lodsb / stosb, cbw, mul / div and others use it implicitly. That's still the case of course; current x86 hasn't dropped any of 8086's opcodes (at least not any of the officially documented ones). But later CPUs added new instructions that gave better / more efficient ways to do things without copying or swapping them to AX first. (Or to EAX in 32-bit mode.)

e.g. 8086 lacked later additions like movsx / movzx to load or move + sign-extend, or 2 and 3-operand imul cx, bx, 1234 that don't produce a high-half result and don't have any implicit operands.

Also, 8086's main bottleneck was instruction-fetch, so optimizing for code-size was important for performance back then. 8086's ISA designer (Stephen Morse) spent a lot of opcode coding space on special cases for AX / AL, including special (E)AX/AL-destination opcodes for all the basic immediate-src ALU- instructions, just opcode + immediate with no ModR/M byte. 2-byte add/sub/and/or/xor/cmp/test/... AL,imm8 or AX,imm16 or (in 32-bit mode) EAX,imm32.

But there's no special case for EAX,imm8, so the regular ModR/M encoding of add eax,4 is shorter.

The assumption is that if you're going to work on some data, you'll want it in AX / AL, so swapping a register with AX was something you might want to do, maybe even more often than copying a register to AX with mov.

Everything about 8086 instruction encoding supports this paradigm, from instructions like lodsb/w to all the special-case encodings for immediates with EAX to its implicit use even for multiply/divide.

Don't get carried away; it's not automatically a win to swap everything to EAX, especially if you need to use immediates with 32-bit registers instead of 8-bit. Or if you need to interleave operations on multiple variables in registers at once. Or if you're using instructions with 2 registers, not immediates at all.

But always keep in mind: am I doing anything that would be shorter in EAX/AL? Can I rearrange so I have this in AL, or am I currently taking better advantage of AL with what I'm already using it for.

Mix 8-bit and 32-bit operations freely to take advantage whenever it's safe to do so (you don't need carry-out into the full register or whatever).

Peter Cordes

Posted 2017-07-17T21:43:42.150

Reputation: 2 810

cdq is useful for div which needs zeroed edx in many cases. – qwr – 2018-04-14T16:50:23.800

@qwr: right, you can abuse cdq before unsigned div if you know your dividend is below 2^31 (i.e. non-negative when treated as signed), or if you use it before setting eax to a potentially-large value. Normally (outside code-golf) you'd use cdq as setup for idiv, and xor edx,edx before div

– Peter Cordes – 2018-04-14T21:17:05.730

Use `fastcall` conventions

x86 platform has many calling conventions. You should use those that pass parameters in registers. On x86_64, the first few parameters are passed in registers anyway, so no problem there. On 32-bit platforms, the default calling convention (cdecl) passes parameters in stack, which is no good for golfing - accessing parameters on stack requires long instructions.

When using fastcall on 32-bit platforms, 2 first parameters are usually passed in ecx and edx. If your function has 3 parameters, you might consider implementing it on a 64-bit platform.

C function prototypes for fastcall convention (taken from this example answer):

extern int __fastcall SwapParity(int value);                 // MSVC
extern int __attribute__((fastcall)) SwapParity(int value);  // GNU

anatolyg

Posted 2017-07-17T21:43:42.150

Reputation: 10 719

Or use a fully custom calling convention, because you're writing in pure asm, not necessarily writing code to be called from C. Returning booleans in FLAGS is often convenient.

– Peter Cordes – 2018-08-04T21:32:02.360

Subtract -128 instead of add 128

0100 81C38000      ADD     BX,0080
0104 83EB80        SUB     BX,-80

Samely, add -128 instead of subtract 128

l4m2

Posted 2017-07-17T21:43:42.150

Reputation: 5 985

This also works the other direction, of course: add -128 instead of sub 128. Fun fact: compilers know this optimization, and also do a related optimization of turning < 128 into <= 127 to reduce the magnitude of an immediate operand for cmp, or gcc always prefers rearranging compares to reduce the magnitude even if it's not -129 vs. -128.

– Peter Cordes – 2018-05-18T20:22:04.100

Create 3 zeroes with `mul` (then `inc`/`dec` to get +1 / -1 as well as zero)

You can zero eax and edx by multiplying by zero in a third register.

xor   ebx, ebx      ; 2B  ebx = 0
mul   ebx           ; 2B  eax=edx = 0

inc   ebx           ; 1B  ebx=1

will result in EAX, EDX, and EBX all being zero in just four bytes. You can zero EAX and EDX in three bytes:

xor eax, eax
cdq

But from that starting point you can't get a 3rd zeroed register in one more byte, or a +1 or -1 register in another 2 bytes. Instead, use the mul technique.

Example use-case: concatenating the Fibonacci numbers in binary.

Note that after a LOOP loop finishes, ECX will be zero and can be used to zero EDX and EAX; you don't always have to create the first zero with xor.

peter ferrie

Posted 2017-07-17T21:43:42.150

Reputation: 804

1This is a bit confusing. Could you expand? – NoOneIsHere – 2017-11-14T01:29:14.740

@NoOneIsHere I believe he wants to set three registers to 0, including EAX and EDX. – NieDzejkob – 2017-11-27T15:11:29.573

CPU registers and flags are in known startup states

We can assume that the CPU is in a known and documented default state based on platform and OS.

For example:

DOS http://www.fysnet.net/yourhelp.htm

Linux x86 ELF http://asm.sourceforge.net/articles/startup.html

640KB

Posted 2017-07-17T21:43:42.150

Reputation: 7 149

Code Golf rules say your code has to work on at least one implementation. Linux chooses to zero all the regs (except RSP) and stack before entering a fresh user-space process, even though the i386 and x86-64 System V ABI docs say they're "undefined" on entry to _start. So yes it's fair game to take advantage of that if you're writing a program instead of a function. I did so in Extreme Fibonacci. (In a dynamically-linked executable, ld.so runs before jumping to your _start, and does leave garbage in registers, but static is just your code.)

– Peter Cordes – 2019-05-03T01:50:05.033

A couple others: eflags is set to 0x202, mxcsr is set to 0x1f80, though you can't directly access them. – Calculuswhiz – 2020-01-10T05:04:31.347

To add or subtract 1, use the one byte inc or dec instructions which are smaller than the multibyte add and sub instructions.

user230118

Posted 2017-07-17T21:43:42.150

Reputation: 239

Note that 32-bit mode has 1-byte inc/dec r32 with the register number encoded in the opcode. So inc ebx is 1 byte, but inc bl is 2. Still smaller than add bl, 1 of course, for registers other than al. Also note that inc/dec leave CF unmodified, but update the other flags.

– Peter Cordes – 2018-03-29T18:56:15.863

12 for +2 & -2 in x86 – l4m2 – 2018-03-31T04:39:59.103

The loop and string instructions are smaller than alternative instruction sequences. Most useful is loop <label> which is smaller than the two instruction sequence dec ECX and jnz <label>, and lodsb is smaller than mov al,[esi] and inc si.

user230118

Posted 2017-07-17T21:43:42.150

Reputation: 239

`lea` for math

This is probably one of the first things one learns about x86, but I leave it here as a reminder. lea can be used to do multiplication by 2, 3, 4, 5, 8, or 9, and adding an offset.

For example, to calculate ebx = 9*eax + 3 in one instruction (in 32-bit mode):

8d 5c c0 03             lea    0x3(%eax,%eax,8),%ebx

Here it is without an offset:

8d 1c c0                lea    (%eax,%eax,8),%ebx

Wow! Of course, lea can be used to also do math like ebx = edx + 8*eax + 3 for calculating array indexing.

qwr

Posted 2017-07-17T21:43:42.150

Reputation: 8 929

1Maybe worth mentioning that lea eax, [rcx + 13] is the no-extra-prefixes version for 64-bit mode. 32-bit operand-size (for the result) and 64-bit address size (for the inputs). – Peter Cordes – 2018-03-30T17:39:09.453

The FLAGS are set after many instructions

After many arithmetic instructions, the Carry Flag (unsigned) and Overflow Flag (signed) are set automatically (more info). The Sign Flag and Zero Flag are set after many arithmetic and logical operations. This can be used for conditional branching.

Example:

d1 f8                   sar    %eax

ZF is set by this instruction, so we can use it for condtional branching.

qwr

Posted 2017-07-17T21:43:42.150

Reputation: 8 929

When have you ever used the parity flag? You know it's the horizontal xor of the low 8 bits of the result, right? (Regardless of operand-size, PF is set only from the low 8 bits; see also). Not even-number / odd-number; for that check ZF after test al,1; you usually don't get that for free. (Or and al,1 to create an integer 0/1 depending on odd/even.)

– Peter Cordes – 2018-03-29T18:46:20.400

Anyway, if this answer said "use flags already set by other instructions to avoid test / cmp", then that would be pretty basic beginner x86, but still worth an upvote. – Peter Cordes – 2018-03-29T18:49:10.487

@PeterCordes Huh, I seemed to have misunderstood the parity flag. I am still working on my other answer. I'll edit the answer. And as you can probably tell, I am a beginner so basic tips help. – qwr – 2018-03-29T18:55:55.587

`mov` small immediates into lower registers when applicable

If you already know the upper bits of a register are 0, you can use a shorter instruction to move an immediate into the lower registers.

b8 0a 00 00 00          mov    $0xa,%eax

versus

b0 0a                   mov    $0xa,%al

Use `push`/`pop` for imm8 to zero upper bits

Credit to Peter Cordes. xor/mov is 4 bytes, but push/pop is only 3!

6a 0a                   push   $0xa
58                      pop    %eax

qwr

Posted 2017-07-17T21:43:42.150

Reputation: 8 929

mov al, 0xa is good if you don't need it zero-extended to the full reg. But if you do, xor/mov is 4 bytes vs. 3 for push imm8/pop or lea from another known constant. This could be useful in combination with mul to zero 3 registers in 4 bytes, or cdq, if you need a lot of constants, though. – Peter Cordes – 2018-03-29T18:13:49.997

The other use-case would be for constants from [0x80..0xFF], which are not representable as a sign-extended imm8. Or if you already know the upper bytes, e.g. mov cl, 0x10 after a loop instruction, because the only way for loop to not jump is when it made rcx=0. (I guess you said this, but your example uses an xor). You can even use the low byte of a register for something else, as long as the something else puts it back to zero (or whatever) when you're done. e.g. my Fibonacci program keeps -1024 in ebx, and uses bl.

– Peter Cordes – 2018-03-29T18:17:28.297

@PeterCordes I've added your push/pop technique – qwr – 2018-03-29T18:30:14.333

Should probably go into the existing answer about constants, where anatolyg already suggested it in a comment. I'll edit that answer. IMO you should rework this one to suggest using 8-bit operand-size for more stuff (except xchg eax, r32) e.g. mov bl, 10 / dec bl / jnz so your code doesn't care about the high bytes of RBX.

– Peter Cordes – 2018-03-29T18:37:11.633

@PeterCordes hmm. I'm still not sure about when to use 8-bit operands so I'm not sure what to put in that answer. – qwr – 2018-03-29T18:41:26.423

If you can solve the problem with uint8_t / int8_t, then you can do everything with 8-bit registers. e.g. if ecx is busy for something else so you can't use loop, and your trip-count is less than 255, then you can mov bl, 123 / dec bl / jnz. Oh, but note that dec bl is 2 bytes while dec ebx is 1 byte in 32-bit mode. dec ebx is 2 bytes in x86-64, so you aren't missing out on inc/dec by using 8-bit operand size. – Peter Cordes – 2018-03-29T18:52:21.373

Caveat is that PUSH immediate is not supported on the 8086/8088. – 640KB – 2019-02-18T19:16:44.270

Use whatever calling conventions are convenient

System V x86 uses the stack and System V x86-64 uses rdi, rsi, rdx, rcx, etc. for input parameters, and rax as the return value, but it is perfectly reasonable to use your own calling convention. __fastcall uses ecx and edx as input parameters, and other compilers/OSes use their own conventions. Use the stack and whatever registers as input/output when convenient.

Example: The repetitive byte counter, using a clever calling convention for a 1 byte solution.

Meta: Writing input to registers, Writing output to registers

Other resources: Agner Fog's notes on calling conventions

qwr

Posted 2017-07-17T21:43:42.150

Reputation: 8 929

I finally got around to posting my own answer on this question about making up calling conventions, and what's reasonable vs unreasonable.

– Peter Cordes – 2018-05-18T05:05:10.250

@PeterCordes unrelated, what is the best way to print in x86? So far I've avoided challenges that require printing. DOS looks like it has useful interrupts for I/O but I am only planning on writing 32/64 bit answers. The only way I know of is int 0x80 which requires a bunch of setup. – qwr – 2018-05-18T19:23:00.117

Yeah, int 0x80 in 32-bit code, or syscall in 64-bit code, to invoke sys_write, is the only good way. It's what I used for Extreme Fibonacci. In 64-bit code, __NR_write = 1 = STDOUT_FILENO, so you can mov eax, edi. Or if the upper bytes of EAX are zero, mov al, 4 in 32-bit code. You could also call printf or puts, I guess, and write a "x86 asm for Linux+glibc" answer. I think it's reasonable to not count the PLT or GOT entry space, or the library code itself.

– Peter Cordes – 2018-05-18T20:02:36.557

I'd be more inclined to have the caller pass a char*buf and produce the string in that, with manual formatting. e.g. like this (awkwardly optimized for speed) asm FizzBuzz, where I got string data into register and then stored it with mov, because the strings were short and fixed-length.

– Peter Cordes – 2018-05-18T20:04:35.510

Use do-while loops instead of while loops

This is not x86 specific but is a widely applicable beginner assembly tip. If you know a while loop will run at least once, rewriting the loop as a do-while loop, with loop condition checking at the end, often saves a 2 byte jump instruction. In a special case you might even be able to use loop.

qwr

Posted 2017-07-17T21:43:42.150

Reputation: 8 929

Related: Why are loops always compiled like this? explains why do{}while() is the natural looping idiom in assembly (especially for efficiency). Note also that 2-byte jecxz/jrcxz before a loop works very well with loop to handle the "needs to run zero times" case "efficiently" (on the rare CPUs where loop isn't slow). jecxz is also usable inside the loop to implement a while(ecx){}, with jmp at the bottom.

– Peter Cordes – 2018-04-15T00:02:18.057

@PeterCordes that is a very well written answer. I'd like to find a use for jumping into the middle of a loop in a code golf program. – qwr – 2018-04-15T02:56:33.927

Use goto jmp and indentation... Loop follow – RosLuP – 2019-03-12T19:04:21.387

Use conditional moves `CMOVcc` and sets `SETcc`

This is more a reminder to myself, but conditional set instructions exist and conditional move instructions exist on processors P6 (Pentium Pro) or newer. There are many instructions that are based on one or more of the flags set in EFLAGS.

qwr

Posted 2017-07-17T21:43:42.150

Reputation: 8 929

1I've found branching is usually smaller. There are some cases where it's a natural fit, but cmov has a 2-byte opcode (0F 4x +ModR/M) so it's 3 bytes minimum. But the source is r/m32, so you can conditionally load in 3 bytes. Other than branching, setcc is useful in more cases than cmovcc. Still, consider the entire instruction set, not just baseline 386 instructions. (Although SSE2 and BMI/BMI2 instruction are so large that they're rarely useful. rorx eax, ecx, 32 is 6 bytes, longer than mov + ror. Nice for performance, not golf unless POPCNT or PDEP saves many isns) – Peter Cordes – 2018-03-29T18:33:32.200

@PeterCordes thanks, I've added setcc. – qwr – 2018-03-29T18:36:30.063

Save on `jmp` bytes by arranging into if/then rather than if/then/else

This is certainly very basic, just thought I would post this as something to think about when golfing. As an example, consider the following straightforward code to decode a hexadecimal digit character:

    cmp $'A', %al
    jae .Lletter
    sub $'0', %al
    jmp .Lprocess
.Lletter:
    sub $('A'-10), %al
.Lprocess:
    movzbl %al, %eax
    ...

This can be shortened by two bytes by letting a "then" case fall into an "else" case:

    cmp $'A', %al
    jb .digit
    sub $('A'-'0'-10), %eax
.digit:
    sub $'0', %eax
    movzbl %al, %eax
    ...

Daniel Schepler

Posted 2017-07-17T21:43:42.150

Reputation: 1 001

You'd often do this normally when optimizing for performance, especially when the extra sub latency on the critical path for one case isn't part of a loop-carried dependency chain (like here where each input digit is independent until merging 4-bit chunks). But I guess +1 anyway. BTW, your example has a separate missed optimization: if you're going to need a movzx at the end anyway then use sub $imm, %al not EAX to take advantage of the no-modrm 2-byte encoding of op $imm, %al. – Peter Cordes – 2019-08-24T05:05:09.823

Also, you can eliminate the cmp by doing sub $'A'-10, %al ; jae .was_alpha ; add $('A'-10)-'0'. (I think I got the logic right). Note that 'A'-10 > '9' so there's no ambiguity. Subtracting the correction for a letter will wrap a decimal digit. So this is safe if we're assuming our input is valid hex, just like yours does. – Peter Cordes – 2019-08-24T05:10:52.983

You can fetch sequential objects from the stack by setting esi to esp, and performing a sequence of lodsd/xchg reg, eax.

peter ferrie

Posted 2017-07-17T21:43:42.150

Reputation: 804

Why is this better than pop eax / pop edx / ...? If you need to leave them on the stack, you can push them all back after to restore ESP, still 2 bytes per object with no need to mov esi,esp. Or did you mean for 4-byte objects in 64-bit code where pop would get 8 bytes? BTW, you can even use pop to loop over a buffer with better performance than lodsd, e.g. for extended-precision addition in Extreme Fibonacci

– Peter Cordes – 2018-03-29T15:06:55.657

it's more correctly useful after an "lea esi,[esp+size of ret address]", which would preclude using pop unless you have a spare register. – peter ferrie – 2018-04-03T12:46:39.937

Oh, for function args? Pretty rare you want more args than there are registers, or that you'd want the caller to leave one in memory instead of passing them all in registers. (I have a half-finished answer about using custom calling conventions, in case one of the standard register-call conventions doesn't fit perfectly.) – Peter Cordes – 2018-04-03T19:18:01.370

cdecl instead of fastcall will leave the parameters on the stack, and it's easy to have lots of parameters. See github.com/peterferrie/tinycrypt, for example. – peter ferrie – 2018-04-08T21:53:25.177

For codegolf and ASM: Use instructions use only registers, push pop, minimize register memory or memory immediate

RosLuP

Posted 2017-07-17T21:43:42.150

Reputation: 3 036

To copy a 64-bit register, use push rcx ; pop rdx instead of a 3-byte mov.
The default operand-size of push/pop is 64-bit without needing a REX prefix.

  51                      push   rcx
  5a                      pop    rdx
                vs.
  48 89 ca                mov    rdx,rcx

(An operand-size prefix can override the push/pop size to 16-bit, but 32-bit push/pop operand-size is not encodeable in 64-bit mode even with REX.W=0.)

If either or both registers are r8..r15, use mov because push and/or pop will need a REX prefix. Worst case this actually loses if both need REX prefixes. Obviously you should usually avoid r8..r15 anyway in code golf.

You can keep your source more readable while developing with this NASM macro. Just remember that it steps on the 8 bytes below RSP. (In the red-zone in x86-64 System V). But under normal conditions it's a drop-in replacement for 64-bit mov r64,r64 or mov r64, -128..127

    ; mov  %1, %2       ; use this macro to copy 64-bit registers in 2 bytes (no REX prefix)
%macro MOVE 2
    push  %2
    pop   %1
%endmacro

Examples:

   MOVE  rax, rsi            ; 2 bytes  (push + pop)
   MOVE  rbp, rdx            ; 2 bytes  (push + pop)
   mov   ecx, edi            ; 2 bytes.  32-bit operand size doesn't need REX prefixes

   MOVE  r8, r10             ; 4 bytes, don't use
   mov   r8, r10             ; 3 bytes, REX prefix has W=1 and the bits for reg and r/m being high

   xchg  eax, edi            ; 1 byte  (special xchg-with-accumulator opcodes)
   xchg  rax, rdi            ; 2 bytes (REX.W + that)

   xchg  ecx, edx            ; 2 bytes (normal xchg + modrm)
   xchg  rcx, rdx            ; 3 bytes (normal REX + xchg + modrm)

The xchg part of the example is because sometimes you need to get a value into EAX or RAX and don't care about preserving the old copy. push/pop doesn't help you actually exchange, though.

Peter Cordes

Posted 2017-07-17T21:43:42.150

Reputation: 2 810

Try `AAM` or `AAD` for byte division operations

If you are working with only 8 bit values, using the AAM instruction can sometimes save several bytes over DIV reg8 since it will take an imm8 and returns remainder and quotient in opposite AH/AL registers as DIV.

D4 0A    AAM        ; AH = AL / 10, AL = AL % 10

It can also accept any byte value as the divisor as well by altering the second byte.

D4 XX    AAM  XX    ; AH = AL / XX, AL = AL % XX

And AAD is the inverse of this, which is two operations in one.

D5 XX    AAD  XX    ; AL = AH * XX + AL

640KB

Posted 2017-07-17T21:43:42.150

Reputation: 7 149

Try `XLAT` for byte memory access

XLAT is a one byte instruction that is equivalent to AL = [BX+AL]. Yes, that's right, it lets you use AL as an index register for memory access.

640KB

Posted 2017-07-17T21:43:42.150

Reputation: 7 149

Tips for golfing in x86/x64 machine code

Answers

mov-immediate is expensive for constants

Initializing eax with 0:

Initializing eax with -1:

Choose your calling convention to put args where you want them.

The limits of what's reasonable: anything that doesn't impose an unreasonable burden on the caller:

Return a boolean in FLAGS (condition codes)

Require narrow args (like a char) to be sign or zero extended to 32 or 64 bits.

Existing calling conventions:

8086 history: why these encodings exist

Use fastcall conventions

Subtract -128 instead of add 128

Create 3 zeroes with mul (then inc/dec to get +1 / -1 as well as zero)

CPU registers and flags are in known startup states

lea for math

The FLAGS are set after many instructions

mov small immediates into lower registers when applicable

Use push/pop for imm8 to zero upper bits

Use whatever calling conventions are convenient

Use do-while loops instead of while loops

Use conditional moves CMOVcc and sets SETcc

Save on jmp bytes by arranging into if/then rather than if/then/else

For codegolf and ASM: Use instructions use only registers, push pop, minimize register memory or memory immediate

Try AAM or AAD for byte division operations

Try XLAT for byte memory access

`mov`-immediate is expensive for constants

Initializing `eax` with `0`:

Initializing `eax` with `-1`:

Require narrow args (like a `char`) to be sign or zero extended to 32 or 64 bits.

Use `fastcall` conventions

Create 3 zeroes with `mul` (then `inc`/`dec` to get +1 / -1 as well as zero)

`lea` for math

`mov` small immediates into lower registers when applicable

Use `push`/`pop` for imm8 to zero upper bits

Use conditional moves `CMOVcc` and sets `SETcc`

Save on `jmp` bytes by arranging into if/then rather than if/then/else

Try `AAM` or `AAD` for byte division operations

Try `XLAT` for byte memory access