Fastest way to compute order of magnitude in x86 assembly

12

The task is simple: write assembly that computes the order of magnitude of an integer using as few clock cycles as possible.

  • Order of magnitude is defined as log10, not log2.
  • The range of valid input is 0 to 1012, inclusive. Behaviour for input outside that range is undefined.
  • Values should be rounded down to the nearest integer, with the exception that given input 0 the output should be 0. (You can consider this to be the best representation of negative infinity that's possible in unsigned integers).
  • Must be x86 assembly.
  • The integer must be a runtime value, not a static/inline integer. So we don't know what it is at compile time.
  • Assume that you have an integer loaded into a register already. (But include setting the value in the register in the answer for clarity).
  • Can't call any external libraries or functions.
  • Free to use any of the available instructions in the Intel docs.
  • No C.
  • Any of the ~7 Intel Core architectures are acceptable (listed on page 10). Ideally Nehalem (Intel Core i7).

The winning answer is one that uses the fewest clock cycles possible. That is, it can have the most operations per second. Approximate clock cycle summaries are here: http://www.agner.org/optimize/instruction_tables.pdf. Calculating clock cycles can happen after the answer is posted.

Lance Pollard

Posted 2015-03-03T03:30:27.703

Reputation: 223

Are 'FYL2X' and other FPU instructions allowed? – Digital Trauma – 2015-03-03T03:54:54.900

@DigitalTrauma yes – Lance Pollard – 2015-03-03T04:06:00.443

Log10(0) is NaN. What should the output be for input 0? – Digital Trauma – 2015-03-03T04:06:53.323

@FryAmTheEggman The winner is the one with the fewest clock cycles. – Lance Pollard – 2015-03-03T04:07:11.247

@DigitalTrauma Magnitude for Log10(0) should be 0. – Lance Pollard – 2015-03-03T04:07:58.070

1Should the result be an integer? If so, how should it be rounded? – Digital Trauma – 2015-03-03T05:26:14.837

@DigitalTrauma yes the result should be an integer. It should be as if you took a string and counted the number of characters (I don't know if that is a feasible solution or not). So order of magnitude for 1023 would be 3. For 9999, 3. For 10000, 4. For 5, 0. For 12, 1, etc. – Lance Pollard – 2015-03-03T05:30:24.527

3So the input and output are both integers, yes ? But the input can be as great as 10^12, so it's too big for a 32 bit int. So do we assume 64 bit integer input ? – Paul R – 2015-03-03T07:15:06.350

3Is the winning criterion based on the maximum or average number of cycles? And if it's average, what is the distribution of inputs? – Runer112 – 2015-03-03T14:15:28.960

Should 999 be magnitude 2? For all the other examples the magnitude is number of digits - 1. Please clarify. – Digital Trauma – 2015-03-03T14:44:43.803

2Which processor is targeted? The linked document lists more than 20 different processes (AMD, Intel, others) and there is considerable variation in the latencies. – Digital Trauma – 2015-03-03T17:16:43.573

@DigitalTrauma Any of the ~7 Intel Core ones (listed on page 10). Ideally Nehalem (Intel Core i7). – Lance Pollard – 2015-03-04T00:05:07.833

Answers

8

7 cycles, constant time

Here's a solution based on my answer to this SO Question. It uses BSR to count how many bits are needed to hold the number. It looks up how many decimal digits needed to represent the largest number that many bits can hold. Then it subtracts 1 if the actual number is less than the nearest power of 10 with that many digits.

    .intel_syntax noprefix
    .globl  main
main:
    mov rdi, 1000000000000              #;your value here
    bsr rax, rdi
    movzx   eax, BYTE PTR maxdigits[1+rax]
    cmp rdi, QWORD PTR powers[0+eax*8]
    sbb al, 0
    ret
maxdigits:
    .byte   0
    .byte   0
    .byte   0
    .byte   0
    .byte   1
    .byte   1
    .byte   1
    .byte   2
    .byte   2
    .byte   2
    .byte   3
    .byte   3
    .byte   3
    .byte   3
    .byte   4
    .byte   4
    .byte   4
    .byte   5
    .byte   5
    .byte   5
    .byte   6
    .byte   6
    .byte   6
    .byte   6
    .byte   7
    .byte   7
    .byte   7
    .byte   8
    .byte   8
    .byte   8
    .byte   9
    .byte   9
    .byte   9
    .byte   9
    .byte   10
    .byte   10
    .byte   10
    .byte   11
    .byte   11
    .byte   11
    .byte   12
powers:
    .quad   0
    .quad   10
    .quad   100
    .quad   1000
    .quad   10000
    .quad   100000
    .quad   1000000
    .quad   10000000
    .quad   100000000
    .quad   1000000000
    .quad   10000000000
    .quad   100000000000
    .quad   1000000000000

Compiles on GCC 4.6.3 for ubuntu and returns the value in the exit code.

I'm not confident interpreting that cycles table for any modern processor, but using @DigitalTrauma's method, on the Nehalim processor, I get 7?

ins        uOps Latency
---        -    - 
BSR r,r  : 1    3
MOVZX r,m: 1    -
CMP m,r/i: 1    1 
SBB r,r/i: 2    2

AShelly

Posted 2015-03-03T03:30:27.703

Reputation: 4 281

Oh - saw DigitalTrauma's after I started writing mine. Similar ideas. Using his counting methodology I think get 3,1,1,1=6 for BSR,MOV,CMP,SBB – AShelly – 2015-03-03T20:16:28.613

Yep, I think yours beats mine. Just goes to show a) I'm not an assembly programmer and b) assembly is best left alone by us mere mortals ;-) – Digital Trauma – 2015-03-03T20:22:45.507

The integer must be a runtime value, not a static/inline integer. So we don't know what it is at compile time. – cat – 2016-06-28T01:30:33.270

1right, and the next line says: "Assume that you have an integer loaded into a register already. (But include setting the value in the register in the answer for clarity)." Which is what I have done. – AShelly – 2016-06-28T01:43:10.840

replace the movzx eax with mov al. The top 24 bits of eax will already be zero, so the zx is redundant (and it's expensive). – peter ferrie – 2017-11-24T17:16:31.990

BTW, clang will generate your code now; here's a 64-bit version :

uint32_t ilog10(uint64_t v) { uint32_t lz = 63 ^ __builtin_clzll(v); constexpr char digits[64] = { 0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19}; uint32_t guess = digits[lz]; constexpr uint64_t powers[21] = {1ULL, 10ULL, 100ULL, 1000ULL, etc}; return guess - (v < powers[guess]); }

See https://godbolt.org/g/55Bd5N

– jorgbrown – 2018-02-13T21:38:57.503

@peterferrie: movzx eax, byte mem is cheaper than mov al, mem on Intel CPUs. mov al,mem is a micro-fused ALU+load instruction that merges into the old value (How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent). movzx is a pure load that zero-extends in the load port. See Any way to move 2 bytes in 32-bit x86 using MOV without causing a mode switch or cpu stall? for a summary of movzx costs on recent CPUs.

– Peter Cordes – 2019-03-06T22:42:12.077

In this case EAX already has to be ready (because it's an input to the addressing mode) so the false dependency doesn't hurt. But the partial-register stall when the caller reads EAX on Nehalem or earlier does hurt. And the merge uop will un-laminate from the load on SnB/IvB because of the indexed addressing mode. – Peter Cordes – 2019-03-06T23:20:32.177

6

Best case 8 cycles, Worst case 12 cycles

Since it is not clear in the question, I am basing this off the Ivy Bridge latencies.

The approach here is to use the bsr (bit scan reverse) instruction as a poor-man's log2(). The result is used as an index into a jump table which contains entries for bits 0 to 42. I am assuming that given that operation on 64bit data is implicitly required, then use of the bsr instruction is OK.

In the best case inputs, the jumptable entry can map the bsr result directly to the magnitude. e.g. For inputs in the range 32-63, the bsr result will be 5, which is mapped to a magnitude of 1. In this case, the instruction path is:

Instruction    Latency

bsrq                 3
jmp                  2
movl                 1
jmp                  2

total                8

In the worst case inputs, the bsr result will map to two possible magnitudes, so the jumptable entry does one additional cmp to check if the input is > 10n. E.g. for inputs in the range 64-127, the bsr result will be 6. The corresponding jumptable entry then checks if the input > 100 and sets the output magnitude accordingly.

In addition for the worst case path, we have an additional mov instruction to load a 64bit immediate value for use in the cmp, so the worst case instruction path is:

Instruction    Latency

bsrq                 3
jmp                  2
movabsq              1
cmpq                 1
ja                   2
movl                 1
jmp                  2

total               12

Here is the code:

    /* Input is loaded in %rdi */
    bsrq    %rdi, %rax
    jmp *jumptable(,%rax,8)
.m0:
    movl    $0, %ecx
    jmp .end
.m0_1:
    cmpq    $9, %rdi
    ja  .m1
    movl    $0, %ecx
    jmp .end
.m1:
    movl    $1, %ecx
    jmp .end
.m1_2:
    cmpq    $99, %rdi
    ja  .m2
    movl    $1, %ecx
    jmp .end
.m2:
    movl    $2, %ecx
    jmp .end
.m2_3:
    cmpq    $999, %rdi
    ja  .m3
    movl    $2, %ecx
    jmp .end
.m3:
    movl    $3, %ecx
    jmp .end
.m3_4:
    cmpq    $9999, %rdi
    ja  .m4
    movl    $3, %ecx
    jmp .end
.m4:
    movl    $4, %ecx
    jmp .end
.m4_5:
    cmpq    $99999, %rdi
    ja  .m5
    movl    $4, %ecx
    jmp .end
.m5:
    movl    $5, %ecx
    jmp .end
.m5_6:
    cmpq    $999999, %rdi
    ja  .m6
    movl    $5, %ecx
    jmp .end
.m6:
    movl    $6, %ecx
    jmp .end
.m6_7:
    cmpq    $9999999, %rdi
    ja  .m7
    movl    $6, %ecx
    jmp .end
.m7:
    movl    $7, %ecx
    jmp .end
.m7_8:
    cmpq    $99999999, %rdi
    ja  .m8
    movl    $7, %ecx
    jmp .end
.m8:
    movl    $8, %ecx
    jmp .end
.m8_9:
    cmpq    $999999999, %rdi
    ja  .m9
    movl    $8, %ecx
    jmp .end
.m9:
    movl    $9, %ecx
    jmp .end
.m9_10:
    movabsq $9999999999, %rax
    cmpq    %rax, %rdi
    ja  .m10
    movl    $9, %ecx
    jmp .end
.m10:
    movl    $10, %ecx
    jmp .end
.m10_11:
    movabsq $99999999999, %rax
    cmpq    %rax, %rdi
    ja  .m11
    movl    $10, %ecx
    jmp .end
.m11:
    movl    $11, %ecx
    jmp .end
.m11_12:
    movabsq $999999999999, %rax
    cmpq    %rax, %rdi
    ja  .m12
    movl    $11, %ecx
    jmp .end
.m12:
    movl    $12, %ecx
    jmp .end

jumptable:
    .quad   .m0
    .quad   .m0
    .quad   .m0
    .quad   .m0_1
    .quad   .m1
    .quad   .m1
    .quad   .m1_2
    .quad   .m2
    .quad   .m2
    .quad   .m2_3
    .quad   .m3
    .quad   .m3
    .quad   .m3
    .quad   .m3_4
    .quad   .m4
    .quad   .m4
    .quad   .m4_5
    .quad   .m5
    .quad   .m5
    .quad   .m5_6
    .quad   .m6
    .quad   .m6
    .quad   .m6
    .quad   .m6_7
    .quad   .m7
    .quad   .m7
    .quad   .m7_8
    .quad   .m8
    .quad   .m8
    .quad   .m8_9
    .quad   .m9
    .quad   .m9
    .quad   .m9
    .quad   .m9_10
    .quad   .m10
    .quad   .m10
    .quad   .m10_11
    .quad   .m11
    .quad   .m11
    .quad   .m11_12
    .quad   .m12
    .quad   .m12
    .quad   .m12

.end:
/* output is given in %ecx */

This was mostly generated from gcc assembler output for proof-of-concept C code I wrote. Note the C code uses a computable goto to implement the jump table. It also uses the __builtin_clzll() gcc builtin, which compiles to the bsr instruction (plus an xor).


I considered several solutions before arriving at this one:

  • FYL2X to calculate the natural log, then FMUL by the necessary constant. This would presumably win this if it was an [tag:instruction:golf] contest. But FYL2X has as latency of 90-106 for Ivy bridge.

  • Hard-coded binary search. This may actually be competitive - I'll leave it to someone else to implement :).

  • Complete lookup table of results. This I'm sure is theoretically fastest, but would require a 1TB lookup table - not practical yet - maybe in a few years if Moore's Law continues to hold.

Digital Trauma

Posted 2015-03-03T03:30:27.703

Reputation: 64 644

If necessary I can calculate an average latency for all allowed input. – Digital Trauma – 2015-03-03T19:54:31.970

jmp and jcc don't have latency, only throughput costs. Branch-prediction + speculative execution mean control dependencies aren't part of data dependency chains. – Peter Cordes – 2019-03-06T22:37:36.033