i386 (x86-32) machine code, 8 bytes (9B for unsigned)
+1B if we need to handle b = 0
on input.
amd64 (x86-64) machine code, 9 bytes (10B for unsigned, or 14B 13B for 64b integers signed or unsigned)
10 9B for unsigned on amd64 that breaks with either input = 0
Inputs are 32bit non-zero signed integers in eax
and ecx
. Output in eax
.
## 32bit code, signed integers: eax, ecx
08048420 <gcd0>:
8048420: 99 cdq ; shorter than xor edx,edx
8048421: f7 f9 idiv ecx
8048423: 92 xchg edx,eax ; there's a one-byte encoding for xchg eax,r32. So this is shorter but slower than a mov
8048424: 91 xchg ecx,eax ; eax = divisor(from ecx), ecx = remainder(from edx), edx = quotient(from eax) which we discard
; loop entry point if we need to handle ecx = 0
8048425: 41 inc ecx ; saves 1B vs. test/jnz in 32bit mode
8048426: e2 f8 loop 8048420 <gcd0>
08048428 <gcd0_end>:
; 8B total
; result in eax: gcd(a,0) = a
This loop structure fails the test-case where ecx = 0
. (div
causes a #DE
hardware execption on divide by zero. (On Linux, the kernel delivers a SIGFPE
(floating point exception)). If the loop entry point was right before the inc
, we'd avoid the problem. The x86-64 version can handle it for free, see below.
Mike Shlanta's answer was the starting point for this. My loop does the same thing as his, but for signed integers because cdq
is one byter shorter than xor edx,edx
. And yes, it does work correctly with one or both inputs negative. Mike's version will run faster and take less space in the uop cache (xchg
is 3 uops on Intel CPUs, and loop
is really slow on most CPUs), but this version wins at machine-code size.
I didn't notice at first that the question required unsigned 32bit. Going back to xor edx,edx
instead of cdq
would cost one byte. div
is the same size as idiv
, and everything else can stay the same (xchg
for data movement and inc/loop
still work.)
Interestingly, for 64bit operand-size (rax
and rcx
), signed and unsigned versions are the same size. The signed version needs a REX prefix for cqo
(2B), but the unsigned version can still use 2B xor edx,edx
.
In 64bit code, inc ecx
is 2B: the single-byte inc r32
and dec r32
opcodes were repurposed as REX prefixes. inc/loop
doesn't save any code-size in 64bit mode, so you might as well test/jnz
. Operating on 64bit integers adds another one byte per instruction in REX prefixes, except for loop
or jnz
. It's possible for the remainder to have all zeros in the low 32b (e.g. gcd((2^32), (2^32 + 1))
), so we need to test the whole rcx and can't save a byte with test ecx,ecx
. However, the slower jrcxz
insn is only 2B, and we can put it at the top of the loop to handle ecx=0
on entry:
## 64bit code, unsigned 64 integers: rax, rcx
0000000000400630 <gcd_u64>:
400630: e3 0b jrcxz 40063d <gcd_u64_end> ; handles rcx=0 on input, and smaller than test rcx,rcx/jnz
400632: 31 d2 xor edx,edx ; same length as cqo
400634: 48 f7 f1 div rcx ; REX prefixes needed on three insns
400637: 48 92 xchg rdx,rax
400639: 48 91 xchg rcx,rax
40063b: eb f3 jmp 400630 <gcd_u64>
000000000040063d <gcd_u64_end>:
## 0xD = 13 bytes of code
## result in rax: gcd(a,0) = a
Full runnable test program including a main
that runs printf("...", gcd(atoi(argv[1]), atoi(argv[2])) );
source and asm output on the Godbolt Compiler Explorer, for the 32 and 64b versions. Tested and working for 32bit (-m32
), 64bit (-m64
), and the x32 ABI (-mx32
).
Also included: a version using repeated subtraction only, which is 9B for unsigned, even for x86-64 mode, and can take one of its inputs in an arbitrary register. However, it can't handle either input being 0 on entry (it detect when sub
produces a zero, which x - 0 never does).
GNU C inline asm source for the 32bit version (compile with gcc -m32 -masm=intel
)
int gcd(int a, int b) {
asm (// ".intel_syntax noprefix\n"
// "jmp .Lentry%=\n" // Uncomment to handle div-by-zero, by entering the loop in the middle. Better: `jecxz / jmp` loop structure like the 64b version
".p2align 4\n" // align to make size-counting easier
"gcd0: cdq\n\t" // sign extend eax into edx:eax. One byte shorter than xor edx,edx
" idiv ecx\n"
" xchg eax, edx\n" // there's a one-byte encoding for xchg eax,r32. So this is shorter but slower than a mov
" xchg eax, ecx\n" // eax = divisor(ecx), ecx = remainder(edx), edx = garbage that we will clear later
".Lentry%=:\n"
" inc ecx\n" // saves 1B vs. test/jnz in 32bit mode, none in 64b mode
" loop gcd0\n"
"gcd0_end:\n"
: /* outputs */ "+a" (a), "+c"(b)
: /* inputs */ // given as read-write outputs
: /* clobbers */ "edx"
);
return a;
}
Normally I'd write a whole function in asm, but GNU C inline asm seems to be the best way to include a snippet which can have in/outputs in whatever regs we choose. As you can see, GNU C inline asm syntax makes asm ugly and noisy. It's also a really difficult way to learn asm.
It would actually compile and work in .att_syntax noprefix
mode, because all the insns used are either single/no operand or xchg
. Not really a useful observation.
1If we're allowing asm to have inputs in whatever registers are convenient, and the result in whatever reg is convenient, we should definitely be allowing functions, or even code fragments (i.e. just a function body). Making my answer a complete function would add about 4B with a register calling convention like MS's 32bit vectorcall (one xchg eax, one mov, and a ret), or more with a stack calling convention. – Peter Cordes – 2016-04-08T23:05:59.480
@PeterCordes Sorry, I should have been more specific. You can totally just write the bear necessary code but if you would be so kind as to include a way to run said code it would be nice. – Mike Shlanta – 2016-04-09T18:59:08.530
So count just the gcd code, but provide the surrounding code so people can verify / experiment / improve? BTW, your test-cases with zero as one of the two inputs break our x86 machine code answers. div by zero raises a hardware exception. On Linux, your process gets a
SIGFPE
. – Peter Cordes – 2016-04-09T19:30:59.903@Martin Doesn't your edit make it impossible to implement this in languages where the default integer type is immutable and only limited by the available memory? – CodesInChaos – 2016-04-10T12:24:16.317
3@CodesInChaos Memory and time limitations are usually ignored as long as the algorithm itself can in principle handle all inputs. The rule is just meant to avoid people hardcoding arbitrary limits for loops that artificially limits the algorithm to a smaller range of inputs. I don't quite see how immutability comes into this? – Martin Ender – 2016-04-10T14:10:35.350
1gcd(0,n) is error not n – RosLuP – 2019-03-18T12:33:49.453