GNU C, POSIX C, or MSVC: usually 2 instructions with low latency (see below for perf)
You can do the same thing even more portably with Rust, where foo.trailing_zeros()
is a language built-in for all integer primitive types.
Performance varies a lot with hardware, but this inlines to whatever the hardware is capable of. This works on any architecture that has a GNU or POSIX compiler.
Most CPU architectures can count leading zeros or trailing zeros in one instruction. (See Wikipedia's Find First Set article). We can use either, since there's only one set bit. Knowing that our input is always non-zero also simplifies things. (Especially for x86 where BSF and BSR leave their output undefined when the input is zero).
The trick is to write something that will compile to that for the target architecture. Since pure C unfortunately doesn't provide a portable way to express this, we use macros to detect language extensions.
#define _GNU_SOURCE // request that string.h define ffsl
#include <stdint.h>
#include <stdlib.h>
#ifdef __unix__
// https://stackoverflow.com/questions/11350878/how-can-i-determine-if-the-operating-system-is-posix-in-c
#include <unistd.h> // for _POSIX_VERSION
#endif
#ifdef _POSIX_VERSION
#include <strings.h>
#include <string.h> // for GNU-extension ffsl in case we want that
int bit_position_posix(uint32_t v) {
#ifdef __GNUC__
// tell the compiler about the assumption that v!=0
// unfortunately doesn't help gcc or clang make better ffs() code :( And ICC makes worse code
if (v == 0)
__builtin_unreachable();
#endif
return ffs(v);
}
#endif
#ifdef __GNUC__
int bit_position_gnu(uint32_t v) {
// v is guaranteed to be non-zero, so we can use a function that's undefined in that case
// we can count from the front or back, since there's only one bit set
// but the compiler doesn't know this so we use macros to select a method that doesn't e.g. require an RBIT instruction on ARM
// TODO: tweak this for more architectures
#if (defined(__i386__) || defined(__x86_64__))
return __builtin_ctz(v) + 1;
#else
return 32 - __builtin_clz(v) + 1;
#endif
}
#endif
#ifdef _MSC_VER
// https://msdn.microsoft.com/en-us/library/wfd9z0bb.aspx
#include <intrin.h>
int bit_position_msvc(uint32_t v) {
unsigned long idx;
int nonzero = _BitScanForward(&idx, v);
return idx + 1;
}
#endif
static inline
int bit_position(uint32_t v)
{
#ifdef __GNUC__
return bit_position_gnu(v);
#elif defined(_MSC_VER)
return bit_position_msvc(v);
#elif defined(_POSIX_VERSION)
return bit_position_posix(v);
#else
#error no good bitscan detected for this platform
#endif
}
sample asm output from the Godbolt compiler explorer:
# gcc6.2 for x86-64 -O3
bit_position_gnu:
xor eax, eax # break BSF/TZCNT's false dependency on destination
rep bsf eax, edi # 3 cycle latency # runs as BSF on CPUs that don't support TZCNT
add eax, 1 # 1 cycle latency
ret
#MSVC: same thing but without the xor-zeroing
# yes, beta Godbolt has cl.exe installed
bit_position_msvc PROC
bsf eax, ecx
inc eax
ret 0
bit_position_msvc ENDP
Using count leading vs. count trailing zeros shouldn't hurt the code on x86, but it does. With count-leading, gcc doesn't fold the 33-...
part into the 32-bsr()
that it does as part of implementing the builtin (since BSR produces the bit-index of the high bit, rather than counting leading zeros). See comments on this question and my answer there for more discussion about using reg_width - __builtin_clzl
vs. using __builtin_ctz
on power-of-2 integer.
Perf analysis: (see Agner Fog's instruction tables and microarch pdf, and the SO x86 tag wiki).
ADD is always 1c latency on all x86 CPUs worth mentioning. On Intel Core2, BSF is 2c latency with one per clock throughput. On Nehalem and later (up to Skylake at least), it's 3 cycle latency.
BSF is slow on Atom (16c) and Silvermont (10c), and on P4 Prescott (16c). Some of the pure-C sequences may be faster on those CPUs, since gcc still uses BSF even with -march=atom
(which implies -mtune=atom
).
BSF is fairly slow on all AMD CPUs, but TZCNT is fast (2c latency). This is why gcc uses rep bsf
(which will decode as TZCNT on CPUs which support it, but as just plain BSF on CPUs which don't, ignoring prefixes which don't apply). So on AMD Piledriver and later, this sequence has 3c total latency. But on AMD Bulldozer (which only has LZCNT, not TZCNT), we should have used 32 - lzcnt(v) + 1
(but BSR vs. LZCNT produce different results even for non-zero inputs, so that binary would only be usable on CPUs that support LZCNT).
This is not limited to x86, either:
bit_position_gnu: # ARM gcc5.4 -O3 -mcpu=cortex-a53
clz r0, r0
rsb r0, r0, #33
bx lr
bit_position_gnu: # ARM64 gcc5.4 -O3
mov w1, 33
clz w0, w0
sub w0, w1, w0
ret
Performance: 2 cycle latency on ARM9E-S ([the first google hit for clz cycles][4]). So AFAICT,
clz` has single-cycle latency. This implies CLZ is always single-cycle on CPUs that support it (ARMv5T and later).
Reverse-subtract with an immediate, and regular sub with a register, are both almost certainly single-cycle latency. (The mov of the constant in AArch64 is off the critical path, but for in-order CPUs is farther from being free.)
When an architecture doesn't have a bit-scan instruction that gcc knows how to use, it emits a call to a library function. e.g.
# gcc5.4 for MIPS el with -O3. Maybe there's an option to enable using clz if it's not baseline?
bit_position_gnu:
lui $28,%hi(__gnu_local_gp)
addiu $sp,$sp,-32
addiu $28,$28,%lo(__gnu_local_gp)
sw $31,28($sp)
lw $25,%call16(__clzsi2)($28) ## function address of __clzsi2
jalr $25
nop
li $3,33 # 0x21
lw $31,28($sp)
addiu $sp,$sp,32
j $31
subu $2,$3,$2
I don't have any compilers that support POSIX but not GNU, so I was only able to see if / how the ffs()
version compiled by making it non-inline. It basically sucks because it accounts for the input=0 case even though we know that can't happen.
Related: an SO C++ question, but that doesn't have the simplifying factor that only a single bit is set.
Related: an SO C++ question, but that doesn't have the simplifying factor that only a single bit is set.
– Peter Cordes – 2016-12-18T08:12:43.863Is it expected that number fits into numeric format? Or we should process decimal string? – Qwertiy – 2016-12-18T08:24:52.943
1This is tagged [tag:fastest-code] but it's not at all clear how the speed of answers is going to be measured. – Martin Ender – 2016-12-18T11:33:33.340
Seems that almost all answers assume this is [tag:code-golf]. – Adám – 2016-12-18T11:48:47.357
@Adám: Look at how old the question is. [fastest-code] was probably really new then, and probably people missed the tag. – Peter Cordes – 2016-12-18T21:43:59.407
@MartinEnder: I think latency in clock cycles is an pretty obvious metric here, and probably the most likely to be relevant. Throughput would be the other main possibility. The big problem here is that there can't be an overall winner since no single implementation strategy is optimal across all architectures. Even within x86, there are big differences across microarchitectures. (e.g. AMD Bulldozer vs. Intel Haswell vs. Prescott). My micro-optimization answer would fit a lot better on SO than here on ppcg, so that's probably a sign that this question is off-topic for this site. – Peter Cordes – 2016-12-18T21:51:50.397
6
@PeterCordes Normally (these days, anyway), [tag:fastest-code] challenges are scored by being measured on a single machine (the OP's usually) or they are held on a very specific architecture to ensure comparability. Alternatively, the OP could provide a benchmark script to assess the participant's machine's speed which can be used to scale a timing they obtain themselves, but that requires a certain level of trust (and doesn't account for architecture-specific optimisations).
– Martin Ender – 2016-12-18T22:14:52.047@MartinEnder: Scaling timing results according to some other benchmark is not a good enough way to normalize for most uses. Different memory / cache / CPU speed ratios are another error source. I actually answered an SO question about that fairly recently. Thanks for the GOLF architecture link. Interesting. I suggested using Knuth's MMIX (for which compilers and cycle-accurate simulators already exist) in my SO answer about creating a platform for speed comparisons.
– Peter Cordes – 2016-12-18T22:26:55.703