x86 32-bit machine code (32-bit integers): 17 bytes.
(also see other versions below, including 16 bytes for 32-bit or 64-bit, with a DF=1 calling convention.)
Caller passes args in registers, including a pointer to the end of an output buffer (like my C answer; see it for justification and explanation of the algorithm.) glibc's internal _itoa
does this, so it's not just contrived for code-golf. The arg-passing registers are close to x86-64 System V, except we have an arg in EAX instead of EDX.
On return, EDI points to the first byte of a 0-terminated C string in the output buffer. The usual return-value register is EAX/RAX, but in assembly language you can use whatever calling convention is convenient for a function. (xchg eax,edi
at the end would add 1 byte).
The caller can calculate an explicit length if it wants, from buffer_end - edi
. But I don't think we can justify omitting the terminator unless the function actually returns both start+end pointers or pointer+length. That would save 3 bytes in this version, but I don't think it's justifiable.
- EAX = n = number to decode. (For
idiv
. The other args aren't implicit operands.)
- EDI = end of output buffer (64-bit version still uses
dec edi
, so must be in the low 4GiB)
- ESI/RSI = lookup table, aka LUT. not clobbered.
- ECX = length of table = base. not clobbered.
nasm -felf32 ascii-compress-base.asm -l /dev/stdout | cut -b -30,$((30+10))-
(Hand edited to shrink comments, line numbering is weird.)
32-bit: 17 bytes ; 64-bit: 18 bytes
; same source assembles as 32 or 64-bit
3 %ifidn __OUTPUT_FORMAT__, elf32
5 %define rdi edi
6 address %define rsi esi
11 machine %endif
14 code %define DEF(funcname) funcname: global funcname
16 bytes
22 ;;; returns: pointer in RDI to the start of a 0-terminated string
24 ;;; clobbers:; EDX (tmp remainder)
25 DEF(ascii_compress_nostring)
27 00000000 C60700 mov BYTE [rdi], 0
28 .loop: ; do{
29 00000003 99 cdq ; 1 byte shorter than xor edx,edx / div
30 00000004 F7F9 idiv ecx ; edx=n%B eax=n/B
31
32 00000006 8A1416 mov dl, [rsi + rdx] ; dl = LUT[n%B]
33 00000009 4F dec edi ; --output ; 2B in x86-64
34 0000000A 8817 mov [rdi], dl ; *output = dl
35
36 0000000C 85C0 test eax,eax ; div/idiv don't write flags in practice, and the manual says they're undefined.
37 0000000E 75F3 jnz .loop ; }while(n);
38
39 00000010 C3 ret
0x11 bytes = 17
40 00000011 11 .size: db $ - .start
It's surprising that the simplest version with basically no speed/size tradeoffs is the smallest, but std
/cld
cost 2 bytes to use stosb
to go in descending order and still follow the common DF=0 calling convention. (And STOS decrements after storing, leaving the pointer pointing one byte too low on loop exit, costing us extra bytes to work around.)
Versions:
I came up with 4 significantly different implementation tricks (using simple mov
load/store (above), using lea
/movsb
(neat but not optimal), using xchg
/xlatb
/stosb
/xchg
, and one that enters the loop with an overlapping-instruction hack. See code below). The last needs a trailing 0
in the lookup table to copy as the output string terminator, so I'm counting that as +1 byte. Depending on 32/64-bit (1-byte inc
or not), and whether we can assume the caller sets DF=1 (stosb
descending) or whatever, different versions are (tied for) shortest.
DF=1 to store in descending order makes it a win to xchg / stosb / xchg, but the caller often won't want that; It feels like offloading work to the caller in a hard-to-justify way. (Unlike custom arg-passing and return-value registers, which typically don't cost an asm caller any extra work.) But in 64-bit code, cld
/ scasb
works as inc rdi
, avoiding truncating the output pointer to 32-bit, so it's sometimes inconvenient to preserve DF=1 in 64-bit-clean functions. . (Pointers to static code/data are 32-bit in x86-64 non-PIE executables on Linux, and always in the Linux x32 ABI, so an x86-64 version using 32-bit pointers is usable in some cases.) Anyway, this interaction makes it interesting to look at different combinations of requirements.
- IA32 with a DF=0 on entry/exit calling convention: 17B (
nostring
).
- IA32: 16B (with a DF=1 convention:
stosb_edx_arg
or skew
); or with incoming DF=dontcare, leaving it set: 16+1B stosb_decode_overlap
or 17B stosb_edx_arg
- x86-64 with 64-bit pointers, and a DF=0 on entry/exit calling convention: 17+1 bytes (
stosb_decode_overlap
), 18B (stosb_edx_arg
or skew
)
x86-64 with 64-bit pointers, other DF handling: 16B (DF=1 skew
), 17B (nostring
with DF=1, using scasb
instead of dec
). 18B (stosb_edx_arg
preserving DF=1 with 3-byte inc rdi
).
Or if we allow returning a pointer to 1 byte before the string, 15B (stosb_edx_arg
without the inc
at the end). All set to call again and expand another string into the buffer with different base / table... But that would make more sense if we didn't store a terminating 0
either, and you might put the function body inside a loop so that's really a separate problem.
x86-64 with 32-bit output pointer, DF=0 calling convention: no improvement over 64-bit output pointer, but 18B (nostring
) ties now.
- x86-64 with 32-bit output pointer: no improvement over the best 64-bit pointer versions, so 16B (DF=1
skew
). Or to set DF=1 and leave it, 17B for skew
with std
but not cld
. Or 17+1B for stosb_decode_overlap
with inc edi
at the end instead of cld
/scasb
.
With a DF=1 calling convention: 16 bytes (IA32 or x86-64)
Requires DF=1 on input, leaves it set. Barely plausible, at least on a per-function basis. Does the same thing as the above version, but with xchg to get the remainder in/out of AL before/after XLATB (table lookup with R/EBX as base) and STOSB (*output-- = al
).
With a normal DF=0 on entry/exit convention, the std
/ cld
/scasb
version is 18 bytes for 32 and 64-bit code, and is 64-bit clean (works with a 64-bit output pointer).
Note that the input args are in different registers, including RBX for the table (for xlatb
). Also note that this loop starts by storing AL, and ends with the last char not stored yet (hence the mov
at the end). So the loop is "skewed" relative to the others, hence the name.
;DF=1 version. Uncomment std/cld for DF=0
;32-bit and 64-bit: 16B
157 DEF(ascii_compress_skew)
158 ;;; inputs
159 ;; O in RDI = end of output buffer
160 ;; I in RBX = lookup table for xlatb
161 ;; n in EDX = number to decode
162 ;; B in ECX = length of table = modulus
163 ;;; returns: pointer in RDI to the start of a 0-terminated string
164 ;;; clobbers:; EDX=0, EAX=last char
165 .start:
166 ; std
167 00000060 31C0 xor eax,eax
168 .loop: ; do{
169 00000062 AA stosb
170 00000063 92 xchg eax, edx
171
172 00000064 99 cdq ; 1 byte shorter than xor edx,edx / div
173 00000065 F7F9 idiv ecx ; edx=n%B eax=n/B
174
175 00000067 92 xchg eax, edx ; eax=n%B edx=n/B
176 00000068 D7 xlatb ; al = byte [rbx + al]
177
178 00000069 85D2 test edx,edx
179 0000006B 75F5 jnz .loop ; }while(n = n/B);
180
181 0000006D 8807 mov [rdi], al ; stosb would move RDI away
182 ; cld
183 0000006F C3 ret
184 00000070 10 .size: db $ - .start
A similar non-skewed version overshoots EDI/RDI and then fixes it up.
; 32-bit DF=1: 16B 64-bit: 17B (or 18B for DF=0)
70 DEF(ascii_compress_stosb_edx_arg) ; x86-64 SysV arg passing, but returns in RDI
71 ;; O in RDI = end of output buffer
72 ;; I in RBX = lookup table for xlatb
73 ;; n in EDX = number to decode
74 ;; B in ECX = length of table
75 ;;; clobbers EAX,EDX, preserves DF
76 ; 32-bit mode: a DF=1 convention would save 2B (use inc edi instead of cld/scasb)
77 ; 32-bit mode: call-clobbered DF would save 1B (still need STD, but INC EDI saves 1)
79 .start:
80 00000040 31C0 xor eax,eax
81 ; std
82 00000042 AA stosb
83 .loop:
84 00000043 92 xchg eax, edx
85 00000044 99 cdq
86 00000045 F7F9 idiv ecx ; edx=n%B eax=n/B
87
88 00000047 92 xchg eax, edx ; eax=n%B edx=n/B
89 00000048 D7 xlatb ; al = byte [rbx + al]
90 00000049 AA stosb ; *output-- = al
91
92 0000004A 85D2 test edx,edx
93 0000004C 75F5 jnz .loop
94
95 0000004E 47 inc edi
96 ;; cld
97 ;; scasb ; rdi++
98 0000004F C3 ret
99 00000050 10 .size: db $ - .start
16 bytes for the 32-bit DF=1 version
I tried an alternate version of this with lea esi, [rbx+rdx]
/ movsb
as the inner loop body. (RSI is reset every iteration, but RDI decrements). But it can't use xor-zero / stos for the terminator, so it's 1 byte larger. (And it's not 64-bit clean for the lookup table without a REX prefix on the LEA.)
LUT with explicit length and a 0 terminator: 16+1 bytes (32-bit)
This version sets DF=1 and leaves it that way. I'm counting the extra LUT byte required as part of the total byte count.
The cool trick here is having the same bytes decode two different ways. We fall into the middle of the loop with remainder = base and quotient= input number, and copy the 0 terminator into place.
On the first time through the function, the first 3 bytes of the loop are consumed as the high bytes of a disp32 for an LEA. That LEA copies the base (modulus) to EDX, idiv
produces the remainder for later iterations.
The 2nd byte of idiv ebp
is FD
, which is the opcode for the std
instruction that this function needs to work. (This was a lucky discovery. I had been looking at this with div
earlier, which distinguishes itself from idiv
using the /r
bits in ModRM. The 2nd byte of div epb
decodes as cmc
, which is harmless but not helpful. But with idiv ebp
can we actually remove the std
from the top of the function.)
Note the input registers are difference once again: EBP for the base.
103 DEF(ascii_compress_stosb_decode_overlap)
104 ;;; inputs
105 ;; n in EAX = number to decode
106 ;; O in RDI = end of output buffer
107 ;; I in RBX = lookup table, 0-terminated. (first iter copies LUT[base] as output terminator)
108 ;; B in EBP = base = length of table
109 ;;; returns: pointer in RDI to the start of a 0-terminated string
110 ;;; clobbers: EDX (=0), EAX, DF
111 ;; Or a DF=1 convention allows idiv ecx (STC). Or we could put xchg after stos and not run IDIV's modRM
112 .start:
117 ;2nd byte of div ebx = repz. edx=repnz.
118 ; div ebp = cmc. ecx=int1 = icebp (hardware-debug trap)
119 ;2nd byte of idiv ebp = std = 0xfd. ecx=stc
125
126 ;lea edx, [dword 0 + ebp]
127 00000040 8D9500 db 0x8d, 0x95, 0 ; opcode, modrm, 0 for lea edx, [rbp+disp32]. low byte = 0 so DL = BPL+0 = base
128 ; skips xchg, cdq, and idiv.
129 ; decode starts with the 2nd byte of idiv ebp, which decodes as the STD we need
130 .loop:
131 00000043 92 xchg eax, edx
132 00000044 99 cdq
133 00000045 F7FD idiv ebp ; edx=n%B eax=n/B;
134 ;; on loop entry, 2nd byte of idiv ebp runs as STD. n in EAX, like after idiv. base in edx (fake remainder)
135
136 00000047 92 xchg eax, edx ; eax=n%B edx=n/B
137 00000048 D7 xlatb ; al = byte [rbx + al]
138 .do_stos:
139 00000049 AA stosb ; *output-- = al
140
141 0000004A 85D2 test edx,edx
142 0000004C 75F5 jnz .loop
143
144 %ifidn __OUTPUT_FORMAT__, elf32
145 0000004E 47 inc edi ; saves a byte in 32-bit. Makes DF call-clobbered instead of normal DF=0
146 %else
147 cld
148 scasb ; rdi++
149 %endif
150
151 0000004F C3 ret
152 00000050 10 .size: db $ - .start
153 00000051 01 db 1 ; +1 because we require an extra LUT byte
# 16+1 bytes for a 32-bit version.
# 17+1 bytes for a 64-bit version that ends with DF=0
This overlapping decode trick can also be used with cmp eax, imm32
: it only takes 1 byte to effectively jump forward 4 bytes, only clobbering flags. (This is terrible for performance on CPUs that mark instruction boundaries in L1i cache, BTW.)
But here, we're using 3 bytes to copy a register and jump into the loop. That would normally take 2+2 (mov + jmp), and would let us jump into the loop right before the STOS instead of before the XLATB. But then we'd need a separate STD, and it wouldn't be very interesting.
Try it online! (with a _start
caller that uses sys_write
on the result)
It's best for debugging to run it under strace
, or hexdump the output, so you can see verify that there's a \0
terminator in the right place and so on. But you can see this actually work, and produce AAAAAACHOO
for an input of
num equ 698911
table: db "CHAO"
%endif
tablen equ $ - table
db 0 ; "terminator" needed by ascii_compress_stosb_decode_overlap
(Actually xxAAAAAACHOO\0x\0\0...
, because we're dumping from 2 bytes earlier in the buffer out to a fixed length. So we can see that the function wrote the bytes it was supposed to and didn't step on any bytes it shouldn't have. The start-pointer passed to the function was the 2nd-last x
character, which was followed by zeros.)
Sandbox (deleted) – Jo King – 2018-06-12T11:39:39.893
3D'awwh, I feel honored. 05AB1E [tag:ascii-art] was my favorite awhile back. – Magic Octopus Urn – 2018-06-12T12:51:32.530
you can create a new challenge: find the permutation of characters in the array to minimize the number :) – mazzy – 2018-06-12T19:09:41.907