Best case 8 cycles, Worst case 12 cycles
Since it is not clear in the question, I am basing this off the Ivy Bridge latencies.
The approach here is to use the bsr
(bit scan reverse) instruction as a poor-man's log2(). The result is used as an index into a jump table which contains entries for bits 0 to 42. I am assuming that given that operation on 64bit data is implicitly required, then use of the bsr
instruction is OK.
In the best case inputs, the jumptable entry can map the bsr
result directly to the magnitude. e.g. For inputs in the range 32-63, the bsr
result will be 5, which is mapped to a magnitude of 1. In this case, the instruction path is:
Instruction Latency
bsrq 3
jmp 2
movl 1
jmp 2
total 8
In the worst case inputs, the bsr
result will map to two possible magnitudes, so the jumptable entry does one additional cmp
to check if the input is > 10n. E.g. for inputs in the range 64-127, the bsr
result will be 6. The corresponding jumptable entry then checks if the input > 100 and sets the output magnitude accordingly.
In addition for the worst case path, we have an additional mov instruction to load a 64bit immediate value for use in the cmp
, so the worst case instruction path is:
Instruction Latency
bsrq 3
jmp 2
movabsq 1
cmpq 1
ja 2
movl 1
jmp 2
total 12
Here is the code:
/* Input is loaded in %rdi */
bsrq %rdi, %rax
jmp *jumptable(,%rax,8)
.m0:
movl $0, %ecx
jmp .end
.m0_1:
cmpq $9, %rdi
ja .m1
movl $0, %ecx
jmp .end
.m1:
movl $1, %ecx
jmp .end
.m1_2:
cmpq $99, %rdi
ja .m2
movl $1, %ecx
jmp .end
.m2:
movl $2, %ecx
jmp .end
.m2_3:
cmpq $999, %rdi
ja .m3
movl $2, %ecx
jmp .end
.m3:
movl $3, %ecx
jmp .end
.m3_4:
cmpq $9999, %rdi
ja .m4
movl $3, %ecx
jmp .end
.m4:
movl $4, %ecx
jmp .end
.m4_5:
cmpq $99999, %rdi
ja .m5
movl $4, %ecx
jmp .end
.m5:
movl $5, %ecx
jmp .end
.m5_6:
cmpq $999999, %rdi
ja .m6
movl $5, %ecx
jmp .end
.m6:
movl $6, %ecx
jmp .end
.m6_7:
cmpq $9999999, %rdi
ja .m7
movl $6, %ecx
jmp .end
.m7:
movl $7, %ecx
jmp .end
.m7_8:
cmpq $99999999, %rdi
ja .m8
movl $7, %ecx
jmp .end
.m8:
movl $8, %ecx
jmp .end
.m8_9:
cmpq $999999999, %rdi
ja .m9
movl $8, %ecx
jmp .end
.m9:
movl $9, %ecx
jmp .end
.m9_10:
movabsq $9999999999, %rax
cmpq %rax, %rdi
ja .m10
movl $9, %ecx
jmp .end
.m10:
movl $10, %ecx
jmp .end
.m10_11:
movabsq $99999999999, %rax
cmpq %rax, %rdi
ja .m11
movl $10, %ecx
jmp .end
.m11:
movl $11, %ecx
jmp .end
.m11_12:
movabsq $999999999999, %rax
cmpq %rax, %rdi
ja .m12
movl $11, %ecx
jmp .end
.m12:
movl $12, %ecx
jmp .end
jumptable:
.quad .m0
.quad .m0
.quad .m0
.quad .m0_1
.quad .m1
.quad .m1
.quad .m1_2
.quad .m2
.quad .m2
.quad .m2_3
.quad .m3
.quad .m3
.quad .m3
.quad .m3_4
.quad .m4
.quad .m4
.quad .m4_5
.quad .m5
.quad .m5
.quad .m5_6
.quad .m6
.quad .m6
.quad .m6
.quad .m6_7
.quad .m7
.quad .m7
.quad .m7_8
.quad .m8
.quad .m8
.quad .m8_9
.quad .m9
.quad .m9
.quad .m9
.quad .m9_10
.quad .m10
.quad .m10
.quad .m10_11
.quad .m11
.quad .m11
.quad .m11_12
.quad .m12
.quad .m12
.quad .m12
.end:
/* output is given in %ecx */
This was mostly generated from gcc assembler output for proof-of-concept C code I wrote. Note the C code uses a computable goto to implement the jump table. It also uses the __builtin_clzll()
gcc builtin, which compiles to the bsr
instruction (plus an xor
).
I considered several solutions before arriving at this one:
FYL2X
to calculate the natural log, then FMUL
by the necessary constant. This would presumably win this if it was an [tag:instruction:golf] contest. But FYL2X
has as latency of 90-106 for Ivy bridge.
Hard-coded binary search. This may actually be competitive - I'll leave it to someone else to implement :).
Complete lookup table of results. This I'm sure is theoretically fastest, but would require a 1TB lookup table - not practical yet - maybe in a few years if Moore's Law continues to hold.
Are 'FYL2X' and other FPU instructions allowed? – Digital Trauma – 2015-03-03T03:54:54.900
@DigitalTrauma yes – Lance Pollard – 2015-03-03T04:06:00.443
Log10(0) is NaN. What should the output be for input 0? – Digital Trauma – 2015-03-03T04:06:53.323
@FryAmTheEggman The winner is the one with the fewest clock cycles. – Lance Pollard – 2015-03-03T04:07:11.247
@DigitalTrauma Magnitude for Log10(0) should be
0
. – Lance Pollard – 2015-03-03T04:07:58.0701Should the result be an integer? If so, how should it be rounded? – Digital Trauma – 2015-03-03T05:26:14.837
@DigitalTrauma yes the result should be an integer. It should be as if you took a string and counted the number of characters (I don't know if that is a feasible solution or not). So order of magnitude for
1023
would be3
. For9999
,3
. For10000
,4
. For5
,0
. For12
,1
, etc. – Lance Pollard – 2015-03-03T05:30:24.5273So the input and output are both integers, yes ? But the input can be as great as 10^12, so it's too big for a 32 bit int. So do we assume 64 bit integer input ? – Paul R – 2015-03-03T07:15:06.350
3Is the winning criterion based on the maximum or average number of cycles? And if it's average, what is the distribution of inputs? – Runer112 – 2015-03-03T14:15:28.960
Should 999 be magnitude 2? For all the other examples the magnitude is number of digits - 1. Please clarify. – Digital Trauma – 2015-03-03T14:44:43.803
2Which processor is targeted? The linked document lists more than 20 different processes (AMD, Intel, others) and there is considerable variation in the latencies. – Digital Trauma – 2015-03-03T17:16:43.573
@DigitalTrauma Any of the ~7
Intel Core
ones (listed on page 10). IdeallyNehalem
(Intel Core i7). – Lance Pollard – 2015-03-04T00:05:07.833