C++ bit magic
0.84ms with simple RNG, 1.67ms with c++11 std::knuth
0.16ms with slight algorithmic modification (see edit below)
The python implementation runs in 7.97 seconds on my rig. So this is 9488 to 4772 times faster depending on what RNG you choose.
#include <iostream>
#include <bitset>
#include <random>
#include <chrono>
#include <stdint.h>
#include <cassert>
#include <tuple>
#if 0
// C++11 random
std::random_device rd;
std::knuth_b gen(rd());
uint32_t genRandom()
{
return gen();
}
#else
// bad, fast, random.
uint32_t genRandom()
{
static uint32_t seed = std::random_device()();
auto oldSeed = seed;
seed = seed*1664525UL + 1013904223UL; // numerical recipes, 32 bit
return oldSeed;
}
#endif
#ifdef _MSC_VER
uint32_t popcnt( uint32_t x ){ return _mm_popcnt_u32(x); }
#else
uint32_t popcnt( uint32_t x ){ return __builtin_popcount(x); }
#endif
std::pair<unsigned, unsigned> convolve()
{
const uint32_t n = 6;
const uint32_t iters = 1000;
unsigned firstZero = 0;
unsigned bothZero = 0;
uint32_t S = (1 << (n+1));
// generate all possible N+1 bit strings
// 1 = +1
// 0 = -1
while ( S-- )
{
uint32_t s1 = S % ( 1 << n );
uint32_t s2 = (S >> 1) % ( 1 << n );
uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
static_assert( n < 16, "packing of F fails when n > 16.");
for( unsigned i = 0; i < iters; i++ )
{
// generate random bit mess
uint32_t F;
do {
F = genRandom() & fmask;
} while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );
// Assume F is an array with interleaved elements such that F[0] || F[16] is one element
// here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
// and ~MSB(F) & LSB(F) returns 1 for all elements that are negative
// this results in the distribution ( -1, 0, 0, 1 )
// to ease calculations we generate r = LSB(F) and l = MSB(F)
uint32_t r = F % ( 1 << n );
// modulo is required because the behaviour of the leftmost bit is implementation defined
uint32_t l = ( F >> 16 ) % ( 1 << n );
uint32_t posBits = l & ~r;
uint32_t negBits = ~l & r;
assert( (posBits & negBits) == 0 );
// calculate which bits in the expression S * F evaluate to +1
unsigned firstPosBits = ((s1 & posBits) | (~s1 & negBits));
// idem for -1
unsigned firstNegBits = ((~s1 & posBits) | (s1 & negBits));
if ( popcnt( firstPosBits ) == popcnt( firstNegBits ) )
{
firstZero++;
unsigned secondPosBits = ((s2 & posBits) | (~s2 & negBits));
unsigned secondNegBits = ((~s2 & posBits) | (s2 & negBits));
if ( popcnt( secondPosBits ) == popcnt( secondNegBits ) )
{
bothZero++;
}
}
}
}
return std::make_pair(firstZero, bothZero);
}
int main()
{
typedef std::chrono::high_resolution_clock clock;
int rounds = 1000;
std::vector< std::pair<unsigned, unsigned> > out(rounds);
// do 100 rounds to get the cpu up to speed..
for( int i = 0; i < 10000; i++ )
{
convolve();
}
auto start = clock::now();
for( int i = 0; i < rounds; i++ )
{
out[i] = convolve();
}
auto end = clock::now();
double seconds = std::chrono::duration_cast< std::chrono::microseconds >( end - start ).count() / 1000000.0;
#if 0
for( auto pair : out )
std::cout << pair.first << ", " << pair.second << std::endl;
#endif
std::cout << seconds/rounds*1000 << " msec/round" << std::endl;
return 0;
}
Compile in 64-bit for extra registers. When using the simple random generator the loops in convolve() run without any memory access, all variables are stored in the registers.
How it works: rather than storing S
and F
as in-memory arrays, it is stored as bits in an uint32_t.
For S
, the n
least significant bits are used where an set bit denotes an +1 and an unset bit denotes an -1.
F
requires at least 2 bits to create an distribution of [-1, 0, 0, 1]. This is done by generating random bits and examining the 16 least significant (called r
) and 16 most significant bits (called l
). If l & ~r
we assume that F is +1, if ~l & r
we assume that F
is -1. Otherwise F
is 0. This generates the distribution we're looking for.
Now we have S
, posBits
with an set bit on every location where F == 1 and negBits
with an set bit on every location where F == -1.
We can prove that F * S
(where * denotes multiplication) evaluates to +1 under the condition (S & posBits) | (~S & negBits)
. We can also generate similar logic for all cases where F * S
evaluates to -1. And finally, we know that sum(F * S)
evaluates to 0 if and only if there is an equal amount of -1's and +1's in the result. This is very easy to calculate by simply comparing the number of +1 bits and -1 bits.
This implementation uses 32 bit ints, and the maximum n
accepted is 16. It is possible to scale the implementation to 31 bits by modifying the random generate code, and to 63 bits by using uint64_t instead of uint32_t.
edit
The folowing convolve function:
std::pair<unsigned, unsigned> convolve()
{
const uint32_t n = 6;
const uint32_t iters = 1000;
unsigned firstZero = 0;
unsigned bothZero = 0;
uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
static_assert( n < 16, "packing of F fails when n > 16.");
for( unsigned i = 0; i < iters; i++ )
{
// generate random bit mess
uint32_t F;
do {
F = genRandom() & fmask;
} while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );
// Assume F is an array with interleaved elements such that F[0] || F[16] is one element
// here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
// and ~MSB(F) & LSB(F) returns 1 for all elements that are negative
// this results in the distribution ( -1, 0, 0, 1 )
// to ease calculations we generate r = LSB(F) and l = MSB(F)
uint32_t r = F % ( 1 << n );
// modulo is required because the behaviour of the leftmost bit is implementation defined
uint32_t l = ( F >> 16 ) % ( 1 << n );
uint32_t posBits = l & ~r;
uint32_t negBits = ~l & r;
assert( (posBits & negBits) == 0 );
uint32_t mask = posBits | negBits;
uint32_t totalBits = popcnt( mask );
// if the amount of -1 and +1's is uneven, sum(S*F) cannot possibly evaluate to 0
if ( totalBits & 1 )
continue;
uint32_t adjF = posBits & ~negBits;
uint32_t desiredBits = totalBits / 2;
uint32_t S = (1 << (n+1));
// generate all possible N+1 bit strings
// 1 = +1
// 0 = -1
while ( S-- )
{
// calculate which bits in the expression S * F evaluate to +1
auto firstBits = (S & mask) ^ adjF;
auto secondBits = (S & ( mask << 1 ) ) ^ ( adjF << 1 );
bool a = desiredBits == popcnt( firstBits );
bool b = desiredBits == popcnt( secondBits );
firstZero += a;
bothZero += a & b;
}
}
return std::make_pair(firstZero, bothZero);
}
cuts the runtime to 0.160-0.161ms. Manual loop unroll (not pictured above) makes that 0.150. The less trivial n=10, iter=100000 case runs under 250ms. I'm sure i can get it under 50ms by leveraging additional cores but that's too easy.
This is done by making the inner loop branch free and swapping the F and S loop.
If bothZero
is not required i can cut down the run time to 0.02ms by sparsely looping over all possible S arrays.
You may be able to gain a speedup by precomputing
1<<n
. – None – 2018-11-01T10:35:43.6403Could you provide a gcc friendly version and also what your command line would be please? I am not sure I can test it currently. – None – 2014-05-01T17:49:03.590
I know nothing about this but google tells me that __builtin_popcount might be a replacement for _mm_popcnt_u32() . – None – 2014-05-01T17:58:37.827
3Code updated, uses #ifdef switch to select the correct popcnt command. It compiles with
-std=c++0x -mpopcnt -O2
and takes 1.01ms to run in 32 bit mode (i don't have a 64-bit GCC version at hand). – Stefan – 2014-05-01T18:06:25.007Could you make it print the output? I am not sure if it is actually doing anything currently :) – None – 2014-05-01T18:26:07.663
output: http://pastebin.com/Xg7xVst2 (you'll have to change the
– Stefan – 2014-05-01T18:31:05.283#if 0
on line 123 for it to print the output of convolve())Thanks. It seems there is no randomness in your code, right? I mean you don't select a random seed. It does appear to be terrifyingly fast. Try increasing the number of iterations to 100,000. – None – 2014-05-01T18:43:49.213
Could you explain your solution by any chance? It really is amazingly fast and I think people would be interested. – None – 2014-05-01T18:53:19.073
@Lembik There is no randomness, only pseudo-randomness in the simple rng case because the seed is fixed. This is not too hard to fix by changing the seed. I've updated the answer with an explanation. – Stefan – 2014-05-01T19:57:33.713
Runs in 1.02 ms on my laptop when compiled with 64-bit gcc. – ceased to turn counterclockwis – 2014-05-01T20:17:21.217
@Stefan This is impressive work! I would hate to have to be the one to fix any bugs in it though :p bit shifting magic. – oligofren – 2014-05-02T09:27:02.837
8You are clearly a wizard. + 1 – BurntPizza – 2014-05-02T15:33:22.537
Hmm, I've got my work cut out for me. Glad that this epic code got me the populist badge. – Kyle Kanos – 2014-05-06T12:44:51.050