position of the only 1 in a number in binary format

2

2

Every number of the form 2i (where i is a non-negative integer) has a single 1 in its binary representation. Given such a number, output the 1-based position of the 1.

Examples:

DECIMAL   BINARY    POSITION
--------------------------
1            0001      1
2            0010      2 
4            0100      3
8            1000      4 

N         0...1...0     ?

mohammad shamsi

Posted 2011-12-04T16:05:35.923

Reputation: 131

Question was closed 2016-12-20T04:10:44.093

Related: an SO C++ question, but that doesn't have the simplifying factor that only a single bit is set.

– Peter Cordes – 2016-12-18T08:12:43.863

Is it expected that number fits into numeric format? Or we should process decimal string? – Qwertiy – 2016-12-18T08:24:52.943

1This is tagged [tag:fastest-code] but it's not at all clear how the speed of answers is going to be measured. – Martin Ender – 2016-12-18T11:33:33.340

Seems that almost all answers assume this is [tag:code-golf]. – Adám – 2016-12-18T11:48:47.357

@Adám: Look at how old the question is. [fastest-code] was probably really new then, and probably people missed the tag. – Peter Cordes – 2016-12-18T21:43:59.407

@MartinEnder: I think latency in clock cycles is an pretty obvious metric here, and probably the most likely to be relevant. Throughput would be the other main possibility. The big problem here is that there can't be an overall winner since no single implementation strategy is optimal across all architectures. Even within x86, there are big differences across microarchitectures. (e.g. AMD Bulldozer vs. Intel Haswell vs. Prescott). My micro-optimization answer would fit a lot better on SO than here on ppcg, so that's probably a sign that this question is off-topic for this site. – Peter Cordes – 2016-12-18T21:51:50.397

6

@PeterCordes Normally (these days, anyway), [tag:fastest-code] challenges are scored by being measured on a single machine (the OP's usually) or they are held on a very specific architecture to ensure comparability. Alternatively, the OP could provide a benchmark script to assess the participant's machine's speed which can be used to scale a timing they obtain themselves, but that requires a certain level of trust (and doesn't account for architecture-specific optimisations).

– Martin Ender – 2016-12-18T22:14:52.047

@MartinEnder: Scaling timing results according to some other benchmark is not a good enough way to normalize for most uses. Different memory / cache / CPU speed ratios are another error source. I actually answered an SO question about that fairly recently. Thanks for the GOLF architecture link. Interesting. I suggested using Knuth's MMIX (for which compilers and cycle-accurate simulators already exist) in my SO answer about creating a platform for speed comparisons.

– Peter Cordes – 2016-12-18T22:26:55.703

Answers

15

Stolen wholesale from Bit twiddling hacks:

uint8_t f(uint32_t v)
{
    static const uint8_t MultiplyDeBruijnBitPosition2[32] = {
          0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
            31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
    };
    return MultiplyDeBruijnBitPosition2[(uint32_t)(v * 0x077CB531U) >> 27];
}

Hasturkun

Posted 2011-12-04T16:05:35.923

Reputation: 1 206

@llmariKaronen I'll help you. – Lucas – 2012-09-22T23:17:56.777

4+1 Excellent. Given that this is a [tag:fastest-code] challenge, I'd give this a +2 if I could. – Ilmari Karonen – 2011-12-04T19:06:22.933

6

R, 14 or 22 characters

p=function(x)log2(x)+1

It remains fully vectorized, so not only can you do it for one number you can do it for any vector of numbers:

> n <- 2^seq(0,10)
> n
 [1]    1    2    4    8   16   32   64  128  256  512 1024
> p(n)
 [1]    1    2    3    4    5    6    7    8    9   10   11

Doesn't seem terribly slow either:

Unit: microseconds
   expr   min    lq median    uq    max
1 p(16) 4.749 4.749   4.75 5.029 33.524

As an expression, it can be cut down to 14 characters: log2(scan())+1

> log2(scan())+1
1: 1
2: 2
3: 4
4: 8
5: 16
6: 
Read 5 items
[1] 1 2 3 4 5

Ari B. Friedman

Posted 2011-12-04T16:05:35.923

Reputation: 1 013

1

GNU C, POSIX C, or MSVC: usually 2 instructions with low latency (see below for perf)

You can do the same thing even more portably with Rust, where foo.trailing_zeros() is a language built-in for all integer primitive types.

Performance varies a lot with hardware, but this inlines to whatever the hardware is capable of. This works on any architecture that has a GNU or POSIX compiler.

Most CPU architectures can count leading zeros or trailing zeros in one instruction. (See Wikipedia's Find First Set article). We can use either, since there's only one set bit. Knowing that our input is always non-zero also simplifies things. (Especially for x86 where BSF and BSR leave their output undefined when the input is zero).

The trick is to write something that will compile to that for the target architecture. Since pure C unfortunately doesn't provide a portable way to express this, we use macros to detect language extensions.

#define _GNU_SOURCE  // request that string.h define ffsl
#include <stdint.h>
#include <stdlib.h>

#ifdef __unix__
// https://stackoverflow.com/questions/11350878/how-can-i-determine-if-the-operating-system-is-posix-in-c
#include <unistd.h>  // for _POSIX_VERSION
#endif

#ifdef _POSIX_VERSION
#include <strings.h>
#include <string.h>  // for GNU-extension ffsl in case we want that
int bit_position_posix(uint32_t v) {
#ifdef __GNUC__
  // tell the compiler about the assumption that v!=0
  // unfortunately doesn't help gcc or clang make better ffs() code :(  And ICC makes worse code
  if (v == 0)
    __builtin_unreachable();
#endif

  return ffs(v);
}
#endif

#ifdef __GNUC__
int bit_position_gnu(uint32_t v) {
  // v is guaranteed to be non-zero, so we can use a function that's undefined in that case

  // we can count from the front or back, since there's only one bit set
  // but the compiler doesn't know this so we use macros to select a method that doesn't e.g. require an RBIT instruction on ARM

  // TODO: tweak this for more architectures
#if (defined(__i386__) || defined(__x86_64__))
  return __builtin_ctz(v) + 1;
#else
  return 32 - __builtin_clz(v) + 1;
#endif
}
#endif

#ifdef _MSC_VER
// https://msdn.microsoft.com/en-us/library/wfd9z0bb.aspx
#include <intrin.h>
int bit_position_msvc(uint32_t v) {
  unsigned long idx;
  int nonzero = _BitScanForward(&idx, v);
  return idx + 1;
}
#endif

static inline
  int bit_position(uint32_t v)
{
#ifdef __GNUC__
  return bit_position_gnu(v);
#elif defined(_MSC_VER)
  return bit_position_msvc(v);
#elif defined(_POSIX_VERSION)
  return bit_position_posix(v);
#else
  #error no good bitscan detected for this platform
#endif
}

sample asm output from the Godbolt compiler explorer:

# gcc6.2 for x86-64 -O3
bit_position_gnu:
    xor     eax, eax    # break BSF/TZCNT's false dependency on destination
    rep bsf eax, edi    # 3 cycle latency   # runs as BSF on CPUs that don't support TZCNT
    add     eax, 1      # 1 cycle latency
    ret

#MSVC: same thing but without the xor-zeroing
# yes, beta Godbolt has cl.exe installed
bit_position_msvc PROC
    bsf      eax, ecx
    inc      eax
    ret      0
bit_position_msvc ENDP

Using count leading vs. count trailing zeros shouldn't hurt the code on x86, but it does. With count-leading, gcc doesn't fold the 33-... part into the 32-bsr() that it does as part of implementing the builtin (since BSR produces the bit-index of the high bit, rather than counting leading zeros). See comments on this question and my answer there for more discussion about using reg_width - __builtin_clzl vs. using __builtin_ctz on power-of-2 integer.


Perf analysis: (see Agner Fog's instruction tables and microarch pdf, and the SO x86 tag wiki).

ADD is always 1c latency on all x86 CPUs worth mentioning. On Intel Core2, BSF is 2c latency with one per clock throughput. On Nehalem and later (up to Skylake at least), it's 3 cycle latency.

BSF is slow on Atom (16c) and Silvermont (10c), and on P4 Prescott (16c). Some of the pure-C sequences may be faster on those CPUs, since gcc still uses BSF even with -march=atom (which implies -mtune=atom).

BSF is fairly slow on all AMD CPUs, but TZCNT is fast (2c latency). This is why gcc uses rep bsf (which will decode as TZCNT on CPUs which support it, but as just plain BSF on CPUs which don't, ignoring prefixes which don't apply). So on AMD Piledriver and later, this sequence has 3c total latency. But on AMD Bulldozer (which only has LZCNT, not TZCNT), we should have used 32 - lzcnt(v) + 1 (but BSR vs. LZCNT produce different results even for non-zero inputs, so that binary would only be usable on CPUs that support LZCNT).

This is not limited to x86, either:

bit_position_gnu:  # ARM gcc5.4 -O3 -mcpu=cortex-a53
    clz     r0, r0
    rsb     r0, r0, #33
    bx      lr


bit_position_gnu:  # ARM64 gcc5.4 -O3
    mov     w1, 33
    clz     w0, w0
    sub     w0, w1, w0
    ret

Performance: 2 cycle latency on ARM9E-S ([the first google hit for clz cycles][4]). So AFAICT,clz` has single-cycle latency. This implies CLZ is always single-cycle on CPUs that support it (ARMv5T and later).

Reverse-subtract with an immediate, and regular sub with a register, are both almost certainly single-cycle latency. (The mov of the constant in AArch64 is off the critical path, but for in-order CPUs is farther from being free.)


When an architecture doesn't have a bit-scan instruction that gcc knows how to use, it emits a call to a library function. e.g.

# gcc5.4 for MIPS el with -O3.  Maybe there's an option to enable using clz if it's not baseline?
bit_position_gnu:
    lui     $28,%hi(__gnu_local_gp)
    addiu   $sp,$sp,-32
    addiu   $28,$28,%lo(__gnu_local_gp)
    sw      $31,28($sp)

    lw      $25,%call16(__clzsi2)($28)   ## function address of __clzsi2
    jalr        $25
    nop

    li      $3,33                 # 0x21
    lw      $31,28($sp)
    addiu   $sp,$sp,32
    j       $31
    subu    $2,$3,$2

I don't have any compilers that support POSIX but not GNU, so I was only able to see if / how the ffs() version compiled by making it non-inline. It basically sucks because it accounts for the input=0 case even though we know that can't happen.


Related: an SO C++ question, but that doesn't have the simplifying factor that only a single bit is set.

Peter Cordes

Posted 2011-12-04T16:05:35.923

Reputation: 2 810

1

J - 4 characters

2&^.

Test:

i=: 2^ i. 10
2&^. i
    0 1 2 3 4 5 6 7 8 9     

jpjacobs

Posted 2011-12-04T16:05:35.923

Reputation: 3 440

0

JavaScript (ES6), 19 17 bytes

a=>Math.log2(a)+1

n4melyh4xor

Posted 2011-12-04T16:05:35.923

Reputation: 507

@MartinEnder, got it. – n4melyh4xor – 2016-12-18T11:31:58.897

Actually, I just noticed that this isn't code golf at all. – Martin Ender – 2016-12-18T11:32:18.053

0

Python, 38 bytes

import math
print int(math.log(input(),2)+1)

Coding man

Posted 2011-12-04T16:05:35.923

Reputation: 1 193

0

Ruby 20

No input or output is necessary because N is given and the question asks for "code":

Math::log2(N).to_i+1

daniero

Posted 2011-12-04T16:05:35.923

Reputation: 17 193

Math.log2 will spare you a character. (This is also the convention for calling class/module methods.) Although this is a fastest-code question, anyway. :) – Adam Prescott – 2013-10-23T20:06:18.513

0

K, 8

1+&|0b\:

.

k)(1+&|0b\:)'(2*)\[10;1]
1
2
3
4
5
6
7
8
9
10
11

tmartin

Posted 2011-12-04T16:05:35.923

Reputation: 3 917

0

Mathematica

I was surprised to find that Method #2 is much faster than Method #1.

Some data

 t = Table[2^k, {k, 0, 10^5}];

Method #1 Log base 2

 Log[2, t]; // AbsoluteTiming

{0.203758, Null}


Method 2: where is the 1 bit?

 Length[IntegerDigits[#, 2]] & t; // AbsoluteTiming

Break up the base 2 representation of the number into bits.

Find how many bits there are. (The 1 will always be farthest left.)

{0.100730, Null}

DavidC

Posted 2011-12-04T16:05:35.923

Reputation: 24 524

0

Python, 39 chars

f=lambda n:len(bin(n))-bin(n).find('1')

Daniel Lubarov

Posted 2011-12-04T16:05:35.923

Reputation: 301

btw, this isn't code-golf... – FlipTack – 2016-12-18T09:10:06.917

0

Haskell 65 characters

import List
main=n<-readLn;print$findIndices(>n)[2^n|n<-[0..]]!!0

Thomas Eding

Posted 2011-12-04T16:05:35.923

Reputation: 796

0

If you like to make performance tests, I guess you need really big values for 2^i, so I use BigInt:

@annotation.tailrec
def binexp (n: BigInt, sofar: Int = 0): BigInt = 
  if (n < 2) 1 + sofar else binexp (n/2, sofar + 1)

The annotation is only to check, whether the compiler can perform a tailcall optimization.

I'm curios how you measure this challenge.

user unknown

Posted 2011-12-04T16:05:35.923

Reputation: 4 210

0

n = n - 1; //n is the input no.
count = 0; //count of set bits
while(n = (n & (n-1)))
  count++;
return count + 1; 

Shwetank Gupta

Posted 2011-12-04T16:05:35.923

Reputation: 1

Did some characters get lost to markup? At present you have a while loop which either never executes or runs forever. – Peter Taylor – 2011-12-10T19:02:44.703

@PeterTaylor Looks good to me. Language is unspecified, but assuming a rather classical = is assignment not equality test, coercing to boolean is testing for nonzero, then the while loop clears the rightmost bit at each iteration, and stops after it cleared the whole of it. – J B – 2011-12-15T23:52:18.630

@JB, you're right, I was probably reading it as == without realising. – Peter Taylor – 2011-12-16T08:31:07.103