How Slow Is Python Really (Part II)?

52

23

This is a follow up to How slow is Python really? (Or how fast is your language?).

It turns out it was a bit too easy to get a x100 speedup for my last question. For those who have enjoyed the challenge but want something harder where they can really use their low level skills, here is part II. The challenge is to get a x100 speedup for the following python code as tested on my computer.

To make it more difficult I am using pypy this time. The current timing for me is 1 minute and 7 seconds using pypy 2.2.1.

Rules

  1. The first person to submit code which I can run, is correct and is x100 times faster on my machine will be awarded a bounty of 50 points.
  2. I will award the win to the fastest code after a week.
import itertools 
import operator 
import random

n = 8 
m  = 8 
iters = 1000  

# creates an array of 0s with length m
# [0, 0, 0, 0, 0, 0, 0, 0]
leadingzerocounts = [0]*m

# itertools.product creates an array of all possible combinations of the 
# args passed to it.
#
# Ex:
#   itertools.product("ABCD", "xy") --> Ax Ay Bx By Cx Cy Dx Dy
#   itertools.product("AB", repeat=5) --> [
#    ('A', 'A', 'A', 'A', 'A'),
#    ('A', 'A', 'A', 'A', 'B'),
#    ('A', 'A', 'A', 'B', 'A'),
#    ('A', 'A', 'A', 'B', 'B'),
#    etc.
#   ]
for S in itertools.product([-1,1], repeat = n+m-1):
    for i in xrange(iters):
        F = [random.choice([-1,0,0,1]) for j in xrange(n)]

        # if the array is made up of only zeros keep recreating it until
        # there is at least one nonzero value.
        while not any(F):
            F = [random.choice([-1,0,0,1]) for j in xrange(n)]

        j = 0
        while (j < m and sum(map(operator.mul, F, S[j:j+n])) == 0):
            leadingzerocounts[j] +=1
            j += 1
print leadingzerocounts

The output should be similar to

[6335185, 2526840, 1041967, 439735, 193391, 87083, 40635, 19694]

You must use a random seed in your code and any random number generator that is good enough to give answers close to the above will be accepted.

My Machine The timings will be run on my machine. This is a standard ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Explanation of code

This code iterates over all arrays S of length n+m-1 that are made up for -1s and 1s. For each array S it samples 1000 non-zero random arrays F of length n made up of -1,0 or 1 with probability of 1/4, 1/2,/14 of taking each values. It then computes the inner products between F and each window of S of length n until it finds a non-zero inner product. It adds 1 to leadingzerocounts at each position it found a zero inner product.

Status

  • Perl. 2.7 times slowdown by @tobyink. (Compared to pypy not cpython.)

  • J. 39 times speedup by @Eelvex.

  • C. 59 times speed up by @ace.
  • Julia. 197 times faster not including start up time by @one-more-minute. 8.5 times speed up including start up time (it's faster using 4 processors in this case than 8).
  • Fortran. 438 times speed up by @semi-extrinsic.
  • Rpython. 258 times speed up by @primo.
  • C++. 508 times speed up by @ilmale.

(I stopped timing the new improvements because they are too fast and iters was too small.)


It was pointed out that timings below a second are unreliable and also some languages have a start-up cost. The argument is that if you are to include that you should also include the compilation time of C/C++ etc. Here are the timings for the fastest code with the number of iterations increased to 100,000.

  • Julia. 42 seconds by @one-more-minute.
  • C++. 14 seconds by @GuySirton.
  • Fortran. 14s by @semi-extrinsic.
  • C++. 12s by @ilmale.
  • Rpython. 18s by @primo.
  • C++. 5s by @Stefan.

The winner is.. Stefan!

Follow-up challenge posted. How high can you go? (A coding+algorithms challenge) . This one is harder.

user9206

Posted 2014-04-28T16:47:41.103

Reputation:

3an explanation of what the code is suppose to achieve would be nice, so we can rewrite it and not simply port it – Einacio – 2014-04-28T16:55:43.733

@Einacio Does the explanation I just added help? – None – 2014-04-28T17:12:13.600

Are you sure that the while loop is right? I'm not proficient in python, but it looks to me that the code would not increment i if the inner product is non-zero. – Kyle Kanos – 2014-04-28T17:14:09.427

@KyleKanos That's right. It stops at the first non-zero inner product. Oh I should change the variable name! – None – 2014-04-28T17:18:55.807

6"The first person to submit code which I can run, is correct and is x100 times faster on my machine wins immediately and the competition closes." What is the purpose of closing the competition like that? Why not use a date deadline like most others, so we can see it furthered reduced in other languages? – grovesNL – 2014-04-28T17:44:17.743

@grovesNL It was just to try to make it more interesting! I also think it's going to be very hard so there is a good chance no one will get over that particular hurdle. – None – 2014-04-28T17:49:16.317

@Lembik to make it interesting you could set up a bounty to be granted at the end of the first week, but still change the accepted answers after that if it is beaten – Einacio – 2014-04-28T17:50:14.977

5@Einacio That is a nice idea. I changed the rules which I hope no one will mind. – None – 2014-04-28T17:50:41.060

Perhaps I'm misunderstanding what is supposed to happen when it finds a non-zero inner product, but why are the values in the array decreasing? Shouldn't each value in leadingzerocounts be similar? – grovesNL – 2014-04-28T18:20:31.973

@grovesNL leadingzerocounts counts the number of times you can get that far before seeing a non-zero inner product basically. So you expect it to be less likely to get further. – None – 2014-04-28T18:25:51.230

Come to think of it... I think there might be significant mileage in noticing that it is very wasteful to produce all these arrays of length n+m-1 when most of the time the right hand ends of them are never used. – None – 2014-04-28T18:40:18.457

Are "slight" algorithmic changes allowed? – Eelvex – 2014-04-29T06:42:19.680

@Eelvex If it's computing exactly the same thing mathematically you can change the algorithm any way you like. – None – 2014-04-29T07:57:23.400

>

  • Take the fastest C compatible solution, 2. wrap it as a CPython C extension, 3. profit?
  • < – Lie Ryan – 2014-04-29T11:21:33.837

    I think the OP should clarify two things:

    1. should timings include everything for JIT languages? Compare the Julia answers below.
    2. I think "any RNG that is good enough to give answers close to the above" is a very poor metric for specifying RNG quality. Could you be more specific? Is the xorshift one, used in some answers, allowed? Wikipedia notes "The xorshift generators have been shown to be fast but not reliable."
    3. < – semi-extrinsic – 2014-04-30T09:06:20.340

    I will also add these two comments from Nick Maclaren at the Cambridge Centre for Scientific Computing:
    "Nobody should EVER use 32-bit random number generators for more than about ten million numbers in a row without careful analysis" "No Xorshift generator is good enough to use on its own, because they have some evil properties."
    – semi-extrinsic – 2014-04-30T09:21:36.280

    @semi-extrinsic I think it is too late for me to change the rules wrt the RNG. I will compare the outputs for different values of n to that the python code gives. If it is very close I will accept the code. The timings should include everything for JIT languages. I will post another competition in a couple of days where the timings won't be <1 second which will help Julia people. – None – 2014-04-30T09:37:13.230

    Yes, timings of below 1 second is another issue, they are usually not very reliable. Random numbers are hard to get right though, it's a common problem in computational physics that people use RNGs with poor properties. The "trick" of using one 32 bit random int to give many random bits is also very much frowned upon. – semi-extrinsic – 2014-04-30T09:59:47.923

    @semi-extrinsic I can assure you my next puzzle will not be solved in <1 second :) – None – 2014-04-30T10:42:43.557

    @Lembik Startup time and performance are separate issues, and should be considered separately. I encourage you to either reconsider excluding boot time, or rename this question to "How quickly does your language boot up?". For the faster solutions that's all you're measuring. – one-more-minute – 2014-04-30T12:28:36.130

    @one-more-minute There is no boot-up time with compiled languages. The performance difference between the C++ solution and the one I just posted in Fortran is not due to start-up time (zero in both cases) but because in C++ he can easily compute just one random integer and convert this to 15 instances of (-1, 0 or +1). I have to create 15 random integers. I tried doing it his way, but couldn't find a solution with low enough overhead :( – semi-extrinsic – 2014-04-30T13:00:49.047

    @semi-extrinsic Exactly, so including boot time means restricting the competition to precompiled languages for no real reason. I'm not saying startup time is never important, but in this case it's obscuring what's actually interesting – run time performance and techniques to improve it. – one-more-minute – 2014-04-30T13:23:36.970

    Why do you declare firstzero and bothzero? They don't seem to be used – mdesantis – 2014-04-30T13:39:09.427

    @one-more-minute I completely agree that a problem should be set where even the fastest solutions take more than a minute. Even C++ and Fortran timings are starting to be a little dubious at below 1 second. – semi-extrinsic – 2014-04-30T13:41:26.197

    @Lembik perhaps faster solutions should use 10,000 iterations, so that times are back in the minute range? Then you'd just divide them by 10. – one-more-minute – 2014-04-30T13:45:44.980

    @Lembik I gave up, I was trying this in Go but the missing bits and pieces were just too many and things became confusing when I tried to implement them myself:S. Having done work in python NLTK I honestly miss the existing tools in the language :( – ymg – 2014-04-30T23:32:12.550

    @one-more-minute Basically having timings under a second makes everything unsatisfactory. This is something I have learned. You can't just repeat 10,000 times as a smart compiler will just eliminate those repeats. You have to actually output something different in each iteration and that needs a carefully thought through spec. – None – 2014-05-01T06:47:50.203

    @Lembik I'm not sure I understand – if you set iters=10,000 in your original code, that will fairly straightforwardly make it run 10x longer, no? There's no compiler that could or would optimise those loops away. – one-more-minute – 2014-05-01T09:23:58.970

    Oh I misunderstood! Yes that would work. – None – 2014-05-01T09:57:47.877

    @Lembik Would you mind timing mine again? Make sure you're compiling with optimizations on. There was also an additional flag needed to get the threads going on Linux (and check to see your 8 cores aren't busy)... – Guy Sirton – 2014-05-01T21:35:36.703

    @GuySirton Done. I must be losing track of which code has which compiler flags. Thanks. – None – 2014-05-02T06:07:43.457

    1@Lembik I've improved my Fortran version, making it 2x faster again on my machine. Could you time it again? :) – semi-extrinsic – 2014-05-02T12:16:48.747

    1@semi-extrinsic Done. – None – 2014-05-02T19:20:15.183

    If you are counting C / C++ compilation times, you should also count Java compilation times. Also this is about the fastest code, not about the fastest compiler... – Jeroen – 2014-05-06T18:23:35.723

    1@JeroenBollen Actually I didn't count either but you would be right had I done so. – None – 2014-05-06T18:40:15.380

    It has been very interesting to follow. It is a pity that this coding challenge included random numbers (PRNGs) and moreover without any rules about their quality. This is more a library issue than of programming languages and this has made the results much less comparable and interesting. – Philm – 2014-06-22T20:17:14.047

    Answers

    13

    C++ bit magic

    ~16ms multithreaded, 56ms singlethreaded. ~4000 speedup.

    (speedup is based on multithreaded code on my i7-2820QM and the 1 min 9 seconds mentioned in the question. Since the OP's system has worse single threaded performance than my CPU but better multi threaded perfiormance i expect this number to be accurate)

    The multithreaded part is quite inefficient due to the spawning of threads. I could probably do better by leveraging my custom job library but that one has bugs under unix systems.. For an explanation and almost identical code without threading refer to https://codegolf.stackexchange.com/a/26485/20965.

    edit

    I gave each thread it's own RNG and cut down the bit length to 32 which reduced the runtime by a few ms.

    #include <iostream>
    #include <bitset>
    #include <random>
    #include <chrono>
    #include <stdint.h>
    #include <cassert>
    #include <array>
    #include <tuple>
    #include <memory>
    #include <thread>
    #include <future>
    #include <string.h>
    
    
    #ifdef _MSC_VER
    uint32_t popcnt( uint32_t x ){ return _mm_popcnt_u32(x); }
    #else
    uint32_t popcnt( uint32_t x ){ return __builtin_popcount(x); }
    #endif
    
    
    
    void convolve()
    {
        static const unsigned threadCount = 32;
        static const unsigned n = 8;
        static const unsigned m = 8;
        static const unsigned totalIters = 1000;
        static_assert( n <= 16, "packing of F fails when n > 16.");
        static uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
    
        std::array< uint32_t, m * threadCount > out;
        std::vector< std::future<void> > threads;
    
        for( int threadId = 0; threadId < threadCount; threadId++)
        {
            threads.emplace_back( std::async( [&, threadId]
            {
                std::random_device rd;
                std::knuth_b gen(rd());
                uint32_t nextRandomNumber = gen();
    
                const unsigned iters = totalIters / threadCount;
    
                std::array< uint32_t, m > leadingZeros;
                for( auto& x : leadingZeros )
                    x = 0;
    
                for( unsigned i = 0; i < iters; i++ )
                {
                    // generate random bit mess
                    uint32_t F;
                    do {
                        // this funky looking construction shortens the dependancy chain of F
                        F = nextRandomNumber & fmask;
                        nextRandomNumber = gen();
                    } while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );
    
                    // Assume F is an array with interleaved elements such that F[0] || F[16] is one element
                    // here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
                    // and  ~MSB(F) & LSB(F) returns 1 for all elements that are negative
                    // this results in the distribution ( -1, 0, 0, 1 )
                    // to ease calculations we generate r = LSB(F) and l = MSB(F)
    
                    uint32_t r = F % ( 1 << n );
                    // modulo is required because the behaviour of the leftmost bit is implementation defined
                    uint32_t l = ( F >> 16 ) % ( 1 << n );
    
                    uint32_t posBits = l & ~r;
                    uint32_t negBits = ~l & r;
                    assert( (posBits & negBits) == 0 );
    
                    uint32_t mask = posBits | negBits;
                    uint32_t totalBits = popcnt( mask );
                    // if the amount of -1 and +1's is uneven, sum(S*F) cannot possibly evaluate to 0
                    if ( totalBits & 1 )
                        continue;
    
                    uint32_t adjF = posBits & ~negBits;
                    uint32_t desiredBits = totalBits / 2;
    
                    uint32_t S = (1 << (n + m -1));
                    // generate all possible N+1 bit strings
                    // 1 = +1
                    // 0 = -1
                    while ( S-- )
                    {
                        for( int shift = 0; shift < m; shift++ )
                        {
                            uint32_t s = (S >> shift) % ( 1 << n );
                            auto firstBits = (s & mask) ^ adjF;
    
                            if ( desiredBits == popcnt( firstBits ) )
                            {
                                leadingZeros[shift] = leadingZeros[shift] + 1;
                            }
                            else
                            {
                                break;
                            }
                        }
                    }
                }
    
                memcpy( out.data() + (threadId * m), leadingZeros.data(), sizeof( leadingZeros[0] ) * m );
            } ));
    
        };
    
        std::array< uint32_t, m > leadingZeros;
        for( auto& x : leadingZeros )
            x = 0;
    
        for( auto& thread : threads )
        {
            thread.wait();
        }
    
        for( int i = 0; i < (threadCount * m); i++ )
        {
            leadingZeros[i % m] += out[i];
        }
    
    
        for( auto x : leadingZeros )
            std::cout << x << ", ";
    
        std::cout << std::endl;
    }
    
    int main()
    {
        typedef std::chrono::high_resolution_clock clock;
        int rounds = 100;
    
        // do some rounds to get the cpu up to speed..
        for( int i = 0; i < rounds / 10; i++ )
        {
            convolve();
        }
    
    
        auto start = clock::now();
    
        for( int i = 0; i < rounds; i++ )
            convolve();
    
        auto end = clock::now();
        double seconds = std::chrono::duration_cast< std::chrono::microseconds >( end - start ).count() / 1000000.0;
    
        std::cout << seconds/rounds*1000 << " msec/round" << std::endl;
    
        return 0;
    }
    

    Sample output:

       6317312, 2515072, 1034368, 434048, 190144, 85200, 39804, 19168,
       6226944, 2481408, 1031168, 438080, 192896, 86816, 40484, 19490,
       6321152, 2524672, 1045376, 442880, 195680, 88464, 41656, 20212,
       6330624, 2517504, 1031104, 430208, 187696, 83976, 38976, 18708,
       6304768, 2510336, 1030720, 433056, 190880, 86824, 40940, 19840,
       6272512, 2494720, 1028160, 432352, 189168, 84752, 39540, 19052,
       6233600, 2507520, 1046912, 447008, 198224, 89984, 42092, 20292,
    

    Stefan

    Posted 2014-04-28T16:47:41.103

    Reputation: 891

    "

    The multithreaded part is quite inefficient due to the spawning of threads. I could probably do better by leveraging my custom just library but that one has bugs under unix systems" -- Can you post some info about your library? I'd love to see some threading stuff for Windows – Robert Fraser – 2015-01-31T11:15:12.070

    @over i meant to say job library... i used https://github.com/id-Software/DOOM-3-BFG/blob/master/neo/idlib/ParallelJobList.h as an example. It's possible to make it almost lock free, which leads to good performance even when the individual jobs are small.

    – Stefan – 2015-01-31T13:27:23.283

    The output isn't right I think, maybe there is a bug? Compare to what is in the question. In particular the last column should be a number close to 19180. – None – 2014-05-02T06:10:31.047

    @Lembik i can see what you mean. I think that the random output is not random enough which creates funky output sometimes. With the C++11 random generator it works fine. I'll fix the code later today. – Stefan – 2014-05-02T12:08:47.687

    I get Stefan.cpp:104:101: error: ‘memcpy’ was not declared in this scope memcpy( out.data() + (threadId * m), leadingZeros.data(), sizeof( leadingZeros[0] ) * m ); – None – 2014-05-05T08:23:37.180

    I think that i need to include string.h. Try it again. – Stefan – 2014-05-05T09:41:08.650

    You compile this with g++ -O3 -std=c++0x -pthread -Wl,--no-as-needed Stefan.cpp -o Stefan – None – 2014-05-05T18:55:51.753

    I tried it with totalIters=100,000 and surprisingly it took 13s. What do you get? – None – 2014-05-05T19:03:14.357

    That is because without the -mpopcnt flag GCC generates an __popcountdi2 library call instead of the popcntl SSE4 instruction, which does exactly the same thing except in 1 clock cycle instead of 20. On my (not too fast) linx server i get around 4 seconds with one core as soon as i add -mpopcnt with 100.000 iters and 60ms with 1000 iters. I'm not sure about the crash, it works fine on all my systems. On a side note, i find your previous challenge much more interesting than this and the next one. Try to think of something new next time :) – Stefan – 2014-05-05T21:02:58.753

    g++ -O3 -std=c++0x -pthread -Wl,--no-as-needed -mpopcnt Stefan.cpp -o Stefan now gives 5 seconds. Thanks. The crash was before I understood the right compile options. – None – 2014-05-05T21:05:06.443

    16

    C++ x150 x450 x530

    Instead of array I used bits (and dark magic).
    Thanks @ace for the faster random function.

    How does it work: the first 15th bits of the integer s represent the array S[15]; the zeroes represent -1, the ones represent +1. The array F is build in a similar way. But with two bit for each symbol.

    • 00 represent -1
    • 01 and 10 represent 0
    • 11 represent 1

    Cause S and F have a different representation I have to interleave S with itself to be comparable with F.

    • 0 (-1) became 00 (-1 in the representation of F)
    • 1 (+1) became 11 (+1 in the representation of F)

    Now we can simply use Carnot to compute the inner product. Remember that one variable can only assume value 00 or 11

    0 . 00 = 11 (-1 * -1 = +1)
    0 . 01 = 10 (-1 * 0 = 0)
    0 . 10 = 01 (-1 * 0 = 0)
    0 . 11 = 00 (-1 * +1 = -1)
    1 . 00 = 00 (+1 * -1 = -1)
    1 . 10 = 10 (+1 * 0 = 0)
    1 . 01 = 01 (+1 * 0 = 0)
    1 . 11 = 11 (+1 * +1 = +1)

    Looks like a not xor to me. :)

    Sum the ones is just a game of shift and mask, nothing really complex.

    #include <array>
    #include <ctime>
    
    // From standford bithacks
    // http://graphics.stanford.edu/~seander/bithacks.html
    inline int32_t interleaveBit(int32_t x)
    {
       static const uint32_t B[] = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
       x = (x | ( x << 8)) & B[3];
       x = (x | ( x << 4)) & B[2];
       x = (x | ( x << 2)) & B[1];
       x = (x | ( x << 1)) & B[0];
       return x | (x << 1);
    }
    
    inline int32_t sumOnes(int32_t v)
    {
       static int b[] = { 1, 0, 0, 1};
       int s = 0;
       for( int i = 0; i < 8; ++i)
       {
          const int a = 3&(v>>(i*2));
          s += b[a];
       }
       return s;
    }
    
    inline int32_t sumArray(int32_t v)
    {
       static int b[] = { -1, 0, 0, 1};
       int s = 0;
       for( int i = 0; i < 8; ++i)
       {
          const int a = 3&(v>>(i*2));
          s += b[a];
       }
       return s;
    }
    
    uint32_t x, y = 24252, z=57768, w=1564; //PRNG seeds
    
    int32_t myRand()
    {
       uint32_t t;
       t = x ^ (x<<1);
       x = y;
       y = z;
       z = w;
       w = w ^ ( w >> 19) ^ t ^ (t >> 8);
       return w;
    }
    
    int main()
    {
       std::array<int, 8> leadingZero{0};
       x = static_cast<int32_t>(time(nullptr)); // seed PRNG
       const int maxS = 1 << 15;
       for(int s = 0; s < maxS; ++s)
       {
          const int32_t x = interleaveBit(s);
          for(int i = 0; i < 1000; ++i)
          {
             int32_t random;
             do
             {
                random = 0xFFFF & myRand();
             }while(sumOnes(random) == 0);
             int j = 7;
             while( j >= 0 )
             {
                const int32_t h = (x >> (j*2));
                const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
                if(sumArray(l) == 0)
                {
                   leadingZero[j]++;
                } else
                {
                   break;
                }
                j--;
             }
    
          }
       }
       for(int i = 7; i >= 0; --i)
       {
          printf("%d ", leadingZero[i]);
       }
       printf("\n");
       return 0;
    }
    

    Here a sample output:

    6332350 2525218 1041716 438741 192917 87159 41023 19908 
    
    real 0m0.372s
    user 0m0.371s
    sys  0m0.001s
    

    The program has been compiled with:

    gcc -std=c++11 -O3 -msse4.2 -Wall -lstdc++ 26371.cpp -o fastPy
    

    on Fedora 20 with gcc 4.8.2 The Cpu is a i7 8core.

    Probably I can gain some ms tweaking compiler parameters.

    While this is the OP solution time on my machine:

    time pypy 26371.py
    [6330609, 2523914, 1040885, 439303, 192708, 86987, 40710, 19498]
    
    real 0m53.061s
    user 0m53.016s
    sys  0m0.022s
    

    Edit:

    Just adding openmp and change the order of the for I have a x3 gain, leading to a x450 performance improvement against OP code. :D In this case the leadingZero array must be atomic. The random global... are random, they will be more random.

     #pragma omp parallel for
     for(int i = 0; i < 1000; ++i)
     {
        int32_t random;
        do
        {
           random = 0xFFFF & myRand();
        }while(sumOnes(random) == 0);
        for(int s = 0; s < maxS; ++s)
        {
           const int32_t x = interleaveBit(s);
           int j = 7;
           while( j >= 0 )
           {
              const int32_t h = (x >> (j*2));
              const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
              if( sumArray(l) == 0 )
              {
                 leadingZero[j]++;
              } else
              {
                 break;
              }
              j--;
           }
        }
     }
    

    need to add -fopenmp to the compiler flag


    Edit:2 As suggester by user71404 I changed the sumOnes and sumArray functions and now it's uber fast.

    real  0m0.101s
    user  0m0.101s
    sys   0m0.000s
    

    With openmp is slower, cause the atomics add too much overhead.

    real  0m0.253s
    user  0m1.870s
    sys   0m0.001s
    

    Without atomics is even faster, but I get wrong result.

    2137992 1147218 619297 321243 155815 70946 32919 15579

    real   0m0.048s
    user   0m0.338s
    sys    0m0.001s
    

    To understand sumArray consider that 16 bit represent and array of 8 numbers.
    00 have no 1 and represent -1
    01 and 10 have one 1 and represent 0
    11 have two 1s and represent 1
    So that built-in count the number of bit set to 1 [http://en.wikipedia.org/wiki/Hamming_weight] and to each group we remove 1. Cool.

    sumOnes is just black magic.

    Here the latest compile flags and code.

    gcc -std=c++11 -mfpmath=sse -O3 -flto -march=native -funroll-loops -Wall -lstdc++

    #include <cstdint>
    #include <cstdio>
    #include <ctime>
    
    inline int32_t interleaveBit(int32_t x)
    {
       static const uint32_t B[] = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
       x = (x | ( x << 8)) & B[3];
       x = (x | ( x << 4)) & B[2];
       x = (x | ( x << 2)) & B[1];
       x = (x | ( x << 1)) & B[0];
       return x | (x << 1);
    }
    
    inline int32_t sumOnes(int32_t v)
    {
       /* 0xAAAA == 0b1010 1010 1010 1010 */
       return !!(0xAAAA & (v ^ ~(v << 1)));
    }
    
    inline int32_t sumArray(int32_t v)
    {
       return __builtin_popcount(v) - 8;
    }
    
    uint32_t x, y = 24252, z = 57768, w = 1564; //PRNG seeds
    
    int32_t myRand()
    {
       uint32_t t;
       t = x ^ (x << 1);
       x = y;
       y = z;
       z = w;
       w = w ^ ( w >> 19) ^ t ^ (t >> 8);
       return w;
    }
    
    int main()
    {
       int leadingZero[8] = { 0 };
       x = static_cast<int32_t>(time(nullptr)); // seed PRNG
       const int maxS = 1 << 15;
       for( int i = 0; i < 1000; ++i )
       {
          int32_t random;
          do
          {
             random = 0xFFFF & myRand();
          } while(sumOnes(random) == 0 );
          for( int s = 0; s < maxS; ++s )
          {
             const int32_t x = interleaveBit(s);
             int j = 7;
             while( j >= 0 )
             {
                const int32_t h = (x >> (j * 2));
                const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
                if( sumArray(l) == 0 )
                {
                   leadingZero[j]++;
                } else
                {
                   break;
                }
                j--;
             }
          }
       }
       printf("[%d, %d, %d, %d, %d, %d, %d, %d]\n",
          leadingZero[7], leadingZero[6],
          leadingZero[5], leadingZero[4],
          leadingZero[3], leadingZero[2],
          leadingZero[1], leadingZero[0]);
       return 0;
    }
    

    ilmale

    Posted 2014-04-28T16:47:41.103

    Reputation: 1 147

    Now I can't wait to test this! Sadly this won't be for a few hours. – None – 2014-04-30T09:38:03.607

    1The following was in a suggested edit, but I think that it may fit better as a comment. You can replace your sumOnes, sumArray with the following (seems to give me a 2x speed up over the openmp version).

    `inline int32_t sumOnes(int32_t v) { /* 0xAAAA == 0b1010 1010 1010 1010 */ return !! (0xAAAA & (v ^ ~(v << 1))); }

    inline int32_t sumArray(int32_t v) { return __builtin_popcount(v) - 8; }` this was suggested by @user71404 – user12205 – 2014-04-30T14:33:41.933

    @user71404: profiler said that the program spent all it's time in those two functions, but I was really tired yesterday think something better that that. I'll try this evening (UTC). Thanks. – ilmale – 2014-04-30T14:40:57.730

    Would you mind changing the second code snippet to be the full copy and pastable code? I must be doing something wrong in my attempts to get your openmp code working so this would help a lot. – None – 2014-04-30T18:14:18.717

    Nice. I thought this could be done with bit operations. – Guy Sirton – 2014-04-30T20:09:41.723

    I understood sumOnes. If the group have same bit (11 or 00) they will set to zero. – ilmale – 2014-04-30T20:21:57.840

    Just to add a note: if I remove the final printf it take 0.001s to run. :D – ilmale – 2014-04-30T20:27:49.660

    @ilmale Is that because everything is optimized away? That won't count ;-) – Guy Sirton – 2014-04-30T20:54:34.207

    Maybe this should be submitted in the C category? ;-) – Guy Sirton – 2014-05-01T07:02:19.447

    I started with C++ in mind.. then I found that I can solve with bit trick. bitset didn't seems a good solution for really high speed. That's why I like C++, you can use really low level stuff when you need them. – ilmale – 2014-05-01T14:14:42.280

    10

    Julia: 0.7s, 120x faster

    As user20768 demonstrated, a straightforward port of the code to Julia is about twice as fast as PyPy. But we can do a lot better than that.

    function pleadingzerocounts(; n = 8,
                                  m = 8,
                                  iters = 1000)
      @parallel (+) for S = 1:2^(8+8-1)
        leading_counts = zeros(Int, m)
        F = Array(Int, n)
        for i = 1:iters
          flag = 0
          while flag == 0
            for i = 1:n
              v = (1-(rand(Int8)&3))%2
              @inbounds F[i] = v
              flag += v & 1
            end
          end
          for j = 1:m
            sum = 0
            for i = 1:n
              @inbounds sum += S & (1 << (j + i - 2)) > 0 ? F[i] : -F[i]
            end
            sum == 0 ?
              (leading_counts[j] += 1) :
              break
          end
        end
        leading_counts
      end
    end
    
    function main()
      # Warm up the JIT
      pleadingzerocounts()
      # Then go for real
      println(@time pleadingzerocounts())
    end
    

    You can run this using julia -p 8 -e 'require("golf.jl");main()' (the 8 is the number of processes, you might want to play around with it). On the latest Julia prerelease this takes 0.7s vs. 1m22s for PyPy.

    If you have enough cores on your computer, and perhaps spin up a few AWS instances, you should be able to shave off some more :)

    one-more-minute

    Posted 2014-04-28T16:47:41.103

    Reputation: 220

    I've tried it on Julia 1.0 with the proper changes: with distributed instead of parallel, and with F = Array{Int}(undef, n). Using addprocs(8) on an old i7 930 computer it needs 613ms to complete, measure with btime. – skan – 2018-11-26T20:33:54.657

    I'm quite sure you're measuring the timing wrong. Python with Pypy is also a JIT-based language, but the timings made by the OP include the JIT compilation time. You are excluding it.

    I installed the latest Julia git version and tested your code, and on my machine your command with 8 processes takes 6.6 seconds to finish, but it prints "elapsed time 0.588.. seconds". – semi-extrinsic – 2014-04-30T08:53:55.517

    The Python timing does include PyPy startup and JIT warmup, but that takes seconds at most – the difference over a minute of runtime is negligible.

    I'm happy if the OP changes the way the Python is timed (it won't make any difference), but including Julia's startup time wouldn't be reasonable. – one-more-minute – 2014-04-30T09:04:24.120

    I asked the OP in the comments to the original question, and he said "The timings should include everything for JIT languages." He also stated he will create a new challenge where the solutions will take much longer than 1 second, leaving Julia in the competition. – semi-extrinsic – 2014-04-30T12:17:01.197

    In that case the optimal solution is to use a serial algorithm – that takes about 2s. I'd publish the code but this competition is now somewhat redundant, since everyone already knows that C++ boots faster than everything else. – one-more-minute – 2014-04-30T12:33:56.640

    I just posted my Fortran solution, so I don't see why you shouldn't post the fastest Julia one (if you already have the code). – semi-extrinsic – 2014-04-30T13:04:09.150

    I get julia -p 8 -e 'require("one-more-minute.jl");main()' ERROR: @simd not defined in include at boot.jl:238 at /home/user/one-more-minute.jl:0 . This is on julia version 0.2.1 – None – 2014-04-30T18:28:52.993

    Ah, that's a new feature then. It will work on a prerelease, otherwise you can just remove the @simd completely.

    – one-more-minute – 2014-04-30T19:29:23.860

    I've updated the code to work on 0.2.1, but note that 0.3 will boot much more quickly (and will probably run faster overall, too). – one-more-minute – 2014-04-30T19:45:05.790

    Any why the time reported is so inaccurate? I increased the number of iterations 100,000 and posted the result in the question. – None – 2014-05-01T10:33:27.767

    I was running the code twice to make the timing more reliable – if you get rid of the first pleadingzerocounts() line in main() the total time should halve. Thanks! – one-more-minute – 2014-05-01T10:40:13.407

    As you will see from my latest tests, it is very impressively fast. Thank you! – None – 2014-05-01T12:25:17.487

    5

    C, 1.210s

    With OP's code running 1m45.729s on my machine.

    Compilation:

    gcc -O3 -march=native -fwhole-program -fstrict-aliasing -ftree-vectorize -Wall ./test2.c -o ./test2
    

    Special thanks: @dyp for compilation flags and ideas for optimisations.

    #include <stdio.h>
    #include <time.h>
    
    #define n (8)
    #define m (8)
    #define iters (1000)
    int leadingzerocounts[m]; // declared as global so initialised to 0
    unsigned int x,y=34353,z=57768,w=1564; //PRNG seeds
    
    /* xorshift PRNG
     * Taken from https://en.wikipedia.org/wiki/Xorshift#Example_implementation
     * Used under CC-By-SA */
    int myRand() {
        unsigned int t;
        t = x ^ (x << 11);
        x = y; y = z; z = w;
        return w = w ^ (w >> 19) ^ t ^ (t >> 8);
    }
    
    int dotproduct(int*F, int*S) {
        unsigned int i;
        int sum=0;
        for(i=0; i<n; i++) {
            sum+=F[i]*S[i];
        }
        return sum;
    }
    
    int main() {
        unsigned int i, j, tmp;
        x=(int)time(NULL); //seed PRNG
    
        int S[n+m-1];
        for(i=0; i<(1<<(n+m-1)); i++) {
            tmp=i;
            for(j=0; j<n+m-1; j++) {
                S[j]=(tmp&1)*(-2)+1;
                tmp>>=1;
            }
            for(j=0; j<iters; j++) {
                int F[n];
                unsigned int k, flag=0;
                do {
                    for(k=0; k<n; k++) {
                        F[k]=(1-(myRand()&3))%2;
                        flag+=(F[k]&1);
                    }
                } while(!flag);
                for(k=0; k<m&&!dotproduct(F, S+k); k++) {
                    leadingzerocounts[k]++;
                }
            }
        }
        for(i=0; i<m; i++) printf("%d ", leadingzerocounts[i]);
        return 0;
    }
    

    Sample output:

    6334411 2527506 1042239 439328 192914 87005 40847 19728
    

    user12205

    Posted 2014-04-28T16:47:41.103

    Reputation: 8 752

    That's a very nice starting point. 5.2s on my computer so a speedup of around 13. – None – 2014-04-28T17:54:03.370

    It seems you can speed up your program by using rand()&3 instead of rand()%4 (although I'm not quite sure it produces the same distribution); replacing the branching for F[k] by using something like F[k] -= F[k]&2; also helped a lot. Using some additional optimizations switches, runtime went from 9.3 s down to 6.4 s on my VM. I suspect calling rand() less often could further improve performance. – dyp – 2014-04-28T18:50:17.477

    @dyp interestingly, on my machine, with the gcc -O3 flag, rand()%4 is faster than rand()&3 by about 0.025s, but using no optimization flags, rand()&3 is faster by ~0.45s. – user12205 – 2014-04-28T18:58:24.927

    1Interesting indeed, I can make similar observations when dropping all those optimization flags. Try -march=native -fwhole-program -fstrict-aliasing -ftree-vectorize Btw. I got down to < 4 s by using some C++11 including a MT19937 plus a uniform_int_distribution. – dyp – 2014-04-28T19:12:57.613

    @dyp thanks for your flags. I managed to get it down to < 4 without C++11, thanks to you suggesting to reduce the conditional branching after the rand() call. – user12205 – 2014-04-28T19:18:40.767

    3.365s or a speedup of about 20 :) – None – 2014-04-28T19:23:05.793

    Yeah, but you started with 5 s; I started with 9 s. IIRC, the generator classes can store unused "randomness" and therefore doesn't need to invoke the generator algorithm every time. – dyp – 2014-04-28T19:24:40.697

    At the risk of trying to solve my own problem.. :) ... I think the real benefit will come from not actually creating all 2^(n+m-1)1000 pairs. The vast majority of those pairs will never look at the right hand end of S so you should only need to create something much closer to 2^(n)1000 pairs. How exactly to get this to work is another question :) – None – 2014-04-28T19:44:36.480

    @dyp Since I am not very familiar with C++ and my knnowledge of C++11 is almost non-existent I don't think I will plan on writing a C++11 version. I did however switch to a PRNG using xorshift, which greatly reduced my runtime. – user12205 – 2014-04-28T20:09:39.893

    11.119s making a speedup of roughly 59! – None – 2014-04-28T20:11:10.130

    1@ace Yes, I just wanted to point this out. It was simpler for me just to try some of the standard library PRNGs in C++. Note that you can use one 32-bit integer result from a PRNG to produce 8 entries for F. – dyp – 2014-04-28T20:17:11.377

    1Since n is equal to 8, you can probably use AVX (or 2*SSE) to compute the dotproduct with a proper S storage. – Michael M. – 2014-04-29T05:50:47.950

    2

    SSE version, small speed-up : https://gist.github.com/anonymous/11394210 (dont forget to include smmintrin.h)

    – Michael M. – 2014-04-29T08:41:33.117

    @Michael I am no SSE expert but my processor seems to support all kinds of SSE versions http://www.cpu-world.com/CPUs/Bulldozer/AMD-FX-Series%20FX-8350.html .

    – None – 2014-04-29T09:00:56.817

    @Lembik, yes it also supports AVX (sadly, not mine) – Michael M. – 2014-04-29T09:09:11.403

    @Michael Is http://stackoverflow.com/questions/10454150/intel-avx-256-bits-version-of-dot-product-for-double-precision-floating-point relevant? Sorry I really don't know anything about AVX/SSE so I may be way off.

    – None – 2014-04-29T09:11:15.537

    1@Lembik, no that's floating-point dot product, not usable for integers. Anyway, I think we can use improve SSE using short instead of int and do math on 8 numbers simultaneously. – Michael M. – 2014-04-29T09:21:49.653

    1

    Yes, even better using short and 16-bits intrinsics ! https://gist.github.com/anonymous/11395262

    – Michael M. – 2014-04-29T09:34:14.920

    Ah OK, great. Did you get your original suggestion from http://stackoverflow.com/questions/23186348/integer-dot-product-using-sse-avx ?

    – None – 2014-04-29T09:34:38.213

    @Lembik, absolutely. – Michael M. – 2014-04-29T09:36:39.540

    @Lembik My original algorithm did not depend on n. Is it acceptable if it does? I.e. it will only work if n is 8, it won't work if it is some other number, even if you redefine n. Since Michael's SSE suggestion and dyp's PRNG suggestion both actually depend on this. – user12205 – 2014-04-29T10:19:42.323

    @ace I would prefer it if it worked for any even n to be honest. – None – 2014-04-29T10:21:04.940

    @ace Are you available in chat? I would like to ask about your code. – None – 2014-04-29T10:25:47.157

    @Lembik okay I'm on chat – user12205 – 2014-04-29T10:31:57.863

    You can still do it for any n with SSE, you have to treat arrays elements 8 by 8. You can also use char instead of short and treat elements 16 by 16 since numbers are very smalls. – Michael M. – 2014-04-29T11:58:38.113

    5

    Perl

    This is nowhere near as fast as the C solution, but is pretty fast for a high-level interpreted language I think. It shaves about 40% off the running time of the Python implementation.

    #!/usr/bin/env perl
    
    use v5.10;
    use strict;
    use warnings;
    use Algorithm::Combinatorics qw( variations_with_repetition );
    use List::Util qw( any sum );
    
    use constant {
      N        => 8,
      M        => 8,
      ITERS    => 1000,
    };
    
    my @leadingzerocounts;
    
    my $variations = variations_with_repetition([-1, 1], N + M - 1);
    while (my $S = $variations->next)
    {
      for my $i (1 .. ITERS)
      {
        my @F;
        until (@F and any { $_ } @F)
        {
          @F = map +((-1,0,0,1)[rand 4]), 1..N;
        }
    
        my $j = 0;
        while ($j < M)
        {
          last if sum map $F[$_]*$S->[$j+$_], 0..N-1;
          $leadingzerocounts[$j++]++;
        }
      }
    }
    
    say join ", ", @leadingzerocounts;
    

    The Algorithm::Combinatorics is available in Ubuntu (sudo apt-get install libalgorithm-combinatorics-perl). The other modules used are core Perl modules, so should already be installed as part of the base Ubuntu installation.

    tobyink

    Posted 2014-04-28T16:47:41.103

    Reputation: 1 233

    1It won't affect speed, but it's 0..N-1 range in the last map, right? Did you forget to use warnings? :-) Though logic in OP is confusing, sliding window never gets to the last element of S. – user2846289 – 2014-04-29T11:21:09.963

    Ah. I just figured that the arrays were of mismatched sizes, so I disabled warnings allowing the missing elements to be treated as zero. N-1 improves this. And it does actually improve the speed very slightly — it's now about 40% faster than the Python implementation. – tobyink – 2014-04-29T13:22:19.600

    I think your code requires a very modern version of List::Util. On ubuntu 14.04 I get "any" is not exported by the List::Util module – None – 2014-04-30T18:26:00.223

    Oh yes, that's true - you'll probably need to install List::Util off CPAN. any can alternatively be found in List::MoreUtils, which although not a core module is one of the most commonly used CPAN modules. – tobyink – 2014-04-30T18:36:35.527

    4

    Julia: 4.66x slower!

    I am really beginning to doubt the statistics on their website...

    Note that the following Julia code is effectively a direct transcription of the OP's Python code without any optimisations. I use the time() function to exclude Julia's slow startup time...

    srand(27182818284590)
    t = time()
    
    require("Iterators")
    
    n = 8
    m = 8
    iters = 1000
    bothzero = 0
    leadingzerocounts = zeros(m)
    
    for S in Iterators.product(fill([-1,1], n+m-1)...)
        for i = 1:iters
            F = [[-1 0 0 1][rand(1:4)] for j = 1:n]
            while all((x) -> x == 0, F)
                F = [[-1 0 0 1][rand(1:4)] for j = 1:n]
            end
            j = 1
            while j <= m && sum(map(*, F, S[j:j+n-1])) == 0
                leadingzerocounts[j] += 1
                j += 1
            end
        end
    end
    
    println(leadingzerocounts)
    
    t = time() - t
    println("$t seconds")
    

    Julia: 5 m 32.912 s

    OP's code in PyPy: 1 m 11.506 s

    Julia output:

    6332170
    2525472
    1041522
    438761
    193119
    86873
    40705
    19662
    

    nimble agar

    Posted 2014-04-28T16:47:41.103

    Reputation: 167

    Global variables, imports, and array comprehensions are slow. This isn't how one would typically write a Julia program for performance. – Alex A. – 2015-04-12T04:54:16.773

    7+1 for your <s>shamelessness</s> sportsmanship. – user12205 – 2014-04-29T14:20:42.400

    4

    Julia: 1 min 21.4s (2.2x faster) (modification of Arman's code)

    Op's code in PyPy: 3 min 1.4 s

    Both done in the REPL, not including time to load packages.

    function foo()                                                                                                                                                             
        n = 8                                                                                                                                                                  
        m = 8                                                                                                                                                                  
        iters = 1000                                                                                                                                                           
        bothzero = 0                                                                                                                                                           
        leadingzerocounts = zeros(Int,m)                                                                                                                                       
        P=[-1,0,0,1]                                                                                                                                                           
    
        for S in Iterators.product(fill([-1,1], n+m-1)...)                                                                                                                     
            Sm=[S...]                                                                                                                                                          
            for i = 1:iters                                                                                                                                                    
                F = P[rand(1:4,n)]                                                                                                                                             
                while all(F==0)                                                                                                                                                
                    F = P[rand(1:4,n)]                                                                                                                                         
                end                                                                                                                                                            
                j = 1                                                                                                                                                          
    
                while j <= m && dot(F,Sm[j:j+n-1]) == 0                                                                                                                        
                    leadingzerocounts[j] += 1                                                                                                                                  
                    j += 1                                                                                                                                                     
                end                                                                                                                                                            
            end                                                                                                                                                                
        end                                                                                                                                                                    
    
        println(leadingzerocounts)                                                                                                                                             
    end 
    

    There are some problems with Arman's code making it very slow: It uses a lot of anonymous functions and higher order functions unnecessarily. To test if all of a vector F is zero, why not just write all(F==0) instead of all(x->x==0,F)? It is shorter, and a literally a thousand times faster.

    It also uses sum(map(*,x,y)) as the dot product instead of simply dot(x,y). The first version 650 times slower for a vector of 10k doubles. And the dot product function is implemented as a for loop in pure Julia.

    Also, array comprehensions are slow. It is better to write [0,1,0,-1][rand(1:4,n)] instead of [[-1 0 0 1][rand(1:4)] for j = 1:n].

    Finally, global variables are bad juju in Julia. Julia is only fast if you write code in such a way that allows the JIT and type inference to work. A big part of this is type stability: The compiler must be able to be sure that a variable's type will not change while inside a loop, for example.

    user20768

    Posted 2014-04-28T16:47:41.103

    Reputation: 111

    Thanks! I see that I still have quite a bit to learn about the Julia Language before I can go making claims about its speed :) Really glad to see that a few trivial fixes to my code increase its execution time several fold. – nimble agar – 2014-05-01T06:01:38.690

    4

    RPython 0.187s (258x faster)

    Original Source w/ PyPy2.2.1: 1m 6.718s

    Now with threading, back-support for standard Python has been dropped. The number of worker threads can be specified as a command line parameter, default is two.

    from time import time, sleep
    from math import fmod
    
    from rpython.rlib.rthread import start_new_thread, allocate_lock, get_ident
    class Random:
      __slots__ = ['s']
    
      def __init__(self, s=1):
        self.s = s
    
      def init_genrand(self, seed):
        self.s = seed
    
      def genrand32(self):
        # xorshift PRNG with period 2^32-1
        # see http://core.kmi.open.ac.uk/download/pdf/6250138.pdf
        self.s ^= (self.s << 13)
        self.s ^= (self.s >> 17)
        self.s ^= (self.s << 5)
        return self.s
    
    class ThreadEnv:
      __slots__ = ['n', 'm', 'iters', 'counts', 'running', 'lock']
    
      def __init__(self):
        self.n = 8
        self.m = 8
        self.iters = 1000
        self.counts = [0]*8
        self.running = 0
        self.lock = None
    
    env = ThreadEnv()
    truth = [-1,0,0,1]
    
    def main(argv):
      argc = len(argv)
      if argc < 4 or argc > 5:
        print 'Usage: %s N M ITERS [NUM_THREADS=2]'%argv[0]
        return 1
    
      if argc == 5:
        num_threads = int(argv[4])
      else:
        num_threads = 2
    
      env.n = int(argv[1])
      env.m = int(argv[2])
      env.iters = int(argv[3]) // num_threads
      env.counts = [0]*env.m
      env.lock = allocate_lock()
    
      for i in xrange(num_threads-1):
        start_new_thread(run,())
        env.running += 1
    
      env.running += 1
    
      # use the main process as a worker
      run()
    
      # wait for any laggers
      while env.running:
        sleep(0.01)
    
      print env.counts
      return 0
    
    def run():
      n, m, iters = env.n, env.m, env.iters
      counts = [0]*m
      sbits = [0]*(n+m-1)
    
      random = Random()
      seed = int(fmod(time(), 2147483.648)*1000) ^ get_ident()
      random.init_genrand(seed)
    
      for S in xrange(1<<n+m-1):
        i = 0
        sbit = 0
        while not sbit:
          sbits[i] ^= 3
          sbit = sbits[i]
          i += 1
    
        for i in xrange(iters):
          f = 0
          while not f:
            F = random.genrand32()
    
            G, x = F, 0
            for k in xrange(n):
              x += truth[(G&3)^sbits[k]]
              f |= x
              G >>= 2
    
          if not x:
            counts[0] += 1
            for j in xrange(1, m):
              G, x = F, 0
              for k in xrange(j, n+j):
                x += truth[(G&3)^sbits[k]]
                G >>= 2
              if x: break
              counts[j] += 1
    
      # passing True stalls until a lock can be obtained
      env.lock.acquire(True)
    
      for i in xrange(m):
        env.counts[i] += counts[i]
      env.running -= 1
    
      env.lock.release()
    
    def target(*args):
      return main, None
    

    RPython is a restricted subset of Python, which can be translated to C and then compiled using the RPython Toolchain. Its expressed purpose is to aid in the creation of language interpreters, but it can also be used to compile simple programs like the one above. Most of the 'fancier' features of Python, such as itertools or even map are not available.

    To compile, make a local clone of the current pypy repository, and run the following:

    $ pypy %PYPY_REPO%/rpython/bin/rpython --thread convolution.py
    

    The resulting executable will be named convolution-c or similar in the current working directory.

    I've parameterized the input variables, so the program should be run as:

    convolution-c 8 8 1000
    

    to match the sample code.


    Implementation Notes

    S in itertools.product([-1,1], repeat = n+m-1) becomes S in xrange(1<<n+m-1), interpreting S as a bit map: [0, 1] → [-1, 1]

    Likewise, F is also a bit map, with each two bits representing a single value:
    [00, 01, 10, 11] → [-1, 0, 0, 1]

    A truth table is used to lookup the product, rather than performing a mulitplication.

    Because 32-bit signed integers are used, n may be no larger than 15, and n+m no larger than 31. Arbitrary integer support can be achieved with the rpython.rlib.rbigint module, if necessary.

    The first iteration of the dot-product loop is unrolled, and combined with the nullity test of F.

    A homebrew PRNG is used, source listed. The author of the paper demonstrates a period of 232-1, and claims that it passes all Diehard tests save one, although I haven't personally confirmed this.

    The random seed changes every millisecond, which is as about as good using a timestamp will allow. Additionally, each worker thread xors their process id with this value, to ensure that they each have a different seed.


    Sample Timings

    2 worker threads:

    $ timeit convolution-c 8 8 1000 2
    [6331845, 2526161, 1042330, 440018, 193724, 87147, 40943, 19603]
    
    Elapsed Time:     0:00:00.375
    Process Time:     0:00:00.687
    System Calls:     6927
    

    4 worker threads:

    $ timeit convolution-c 8 8 1000 4
    [6334565, 2527684, 1043502, 440216, 193225, 87398, 40799, 19338]
    
    Elapsed Time:     0:00:00.218
    Process Time:     0:00:00.796
    System Calls:     3417
    

    8 worker threads:

    $ timeit convolution-c 8 8 1000 8
    [6327639, 2522483, 1039869, 437884, 192460, 86771, 40420, 19403]
    
    Elapsed Time:     0:00:00.187
    Process Time:     0:00:00.734
    System Calls:     3165
    

    OP's original source:

    $ timeit pypy convolution-orig.py
    [6330610, 2525644, 1041481, 438980, 193001, 86622, 40598, 19449]
    
    Elapsed Time:     0:01:06.718
    Process Time:     0:01:06.718
    System Calls:     11599808
    

    Timing for 100000 iterations:

    $ timeit convolution-c 8 8 100000 8
    [633156171, 252540679, 104129386, 43903716, 19307215, 8709157, 4072133, 1959124]
    
    Elapsed Time:     0:00:16.625
    Process Time:     0:01:02.406
    System Calls:     171341
    

    primo

    Posted 2014-04-28T16:47:41.103

    Reputation: 30 891

    I have never seen an rpython program before. This is great. Now is there an equivalent pure python program that pypy can run in 1.03s? – None – 2014-04-30T11:41:22.903

    @Lembik I'd like to see one. I thought 4.7s was pretty good, considering my first attempt at pure python was ~15s. – primo – 2014-04-30T13:01:29.127

    Yes, sorry for the delay. I haven't got the code running yet but will as soon as possible. – None – 2014-05-02T11:51:49.160

    You should try adding a JIT. Now THAT would be fast! – kirbyfan64sos – 2014-05-02T22:21:51.253

    @Lembik thanks for the mention ;) Out of curiousity, did it run fastest with 4 worker threads, or 8? – primo – 2014-05-02T23:28:55.897

    @kirbyfan64sos I'm not sure what you'd want to JIT compile. The only thing that could possibly define an execution frame would be [S, F], but the code shouldn't ever see the same combination twice. – primo – 2014-05-02T23:31:01.230

    You can just use a value that you know stays the same for the red and all the rest as greens. Then, put a can_enter_loop at the end of every iteration. I wrote an RPython version of the "tree" program with a JIT for fun using that technique. It actually worked! – kirbyfan64sos – 2014-05-03T19:43:36.343

    2

    Nimrod

    import times, locks, strutils, unsigned
    
    const
      N = 8
      M = 8
      iters = 1000
      numThreads = 8
    
    type
      SVec = array[0..N+M-1, int]
      FVec = array[0..N-1, int]
      ComputeThread = TThread[int]
    
    var
      rngSeed = int(epochTime()*1000)
      totalLeadingZeros: array[0..M-1, int]
      lock: TLock
    
    type
      RNGState = object
        x, y, z, w: uint32
    
    proc newRNG(seed: int): RNGState =
      result.x = uint32(seed)
    
    proc random(rng: var RNGState): int =
      let t = rng.x xor (rng.x shl 11)
      rng.x = rng.y; rng.y = rng.z; rng.z = rng.w
      rng.w = rng.w xor (rng.w shr 19) xor t xor (t shr 8)
      result = int(rng.w)
    
    proc initVecRand(v: var FVec, rng: var RNGState) =
      const values = [ -1, 0, 0, 1 ]
      var rnd = rng.random
      var bitAcc = 0
      for i in 0 .. <len(v):
        let val = values[rnd and 3]
        rnd = rnd shr 2
        v[i] = val
        bitAcc = bitAcc or val
      if bitAcc == 0:
        initVecRand(v, rng)
    
    proc convolve(s: SVec, f: FVec, offset: int): int =
      for i in 0 .. <len(f):
        result += s[i+offset]*f[i]
    
    proc iterate(v: var SVec) =
      for i in 0 .. <len(v):
        if v[i] == -1:
          v[i] = 1
          return
        v[i] = -1
    
    proc mainThread(id: int) {.thread.} =
      const numS = 1 shl (N+M-1)
      var
        s: SVec
        f: FVec
        leadingZeros: array[0..M-1, int]
        rng = newRNG(rngSeed + id)
      for k in 0 .. <len(s):
        s[k] = -1
      for i in 1..numS:
        for j in countUp(id, iters, numThreads):
          initVecRand(f, rng)
          if convolve(s, f, 0) == 0:
            leadingZeros[0] += 1
            for k in 1 .. <M:
              if convolve(s, f, k) == 0:
                leadingZeros[k] += 1
              else:
                break
        iterate(s)
      acquire(lock)
      for i in 0 .. <M:
        totalLeadingZeros[i] += leadingZeros[i]
      release(lock)
    
    proc main =
      let startTime = epochTime()
      var threads: array[1..numThreads, ComputeThread]
      initLock(lock)
      for i in 1..numThreads:
        createThread(threads[i], mainThread, i)
      for i in 1..numThreads:
        joinThread(threads[i])
      echo("Leading zeros: ", @totalLeadingZeros)
      let endTime = epochTime()
      echo("Time taken:    ", formatFloat(endTime - startTime, ffDecimal, 3),
           " seconds")
    
    main()
    

    Example output:

    Leading zeros: @[6333025, 2525808, 1042466, 439138, 192391, 86751, 40671, 19525]
    Time taken:    0.145 seconds
    

    Nimrod compiles to C, therefore the choice of C compiler for the backend matters, too.

    Using clang, compile with:

    nimrod cc --threads:on --cc=clang --passc:-flto -d:release conv.nim
    

    Using gcc, compile with:

    nimrod cc --threads:on --cc=gcc --passc:-flto -d:release conv.nim
    

    Omit --passc:-flto if you have an older C compiler that doesn't support LTO. Omit the --cc=... option if you are fine with the default choice for the C compiler. The code requires Nimrod 0.9.4 or 0.9.5.

    On my quadcore iMac (2.66 GHz core i5), the code runs in about .15 seconds with gcc 4.9, .16 seconds with clang, compared to 88 seconds for PyPy 2.2.1 (i.e. a 500+ times speedup). Unfortunately, I don't have access to a machine with more than four cores that also has PyPy installed or where I could easily install PyPy, though I get about .1 seconds (with a lot of measurement noise) on a 64-core AMD Opteron 6376 1.4 GHz (according to /proc/cpuinfo) with gcc 4.4.6.

    The implementation tries to be faithful to the original rather than optimizing code at the cost of readability, while not forgoing obvious optimizations. Interestingly enough, tail recursion in initVecRand() is a bit faster than a loop with a break instruction with both gcc and clang. Manually unrolling one iteration of the convolve test loop inside the main loop also produced a speedup, presumably due to better branch prediction.

    Reimer Behrends

    Posted 2014-04-28T16:47:41.103

    Reputation: 899

    How do you get nimrod for ubuntu? – None – 2014-05-01T13:42:10.667

    @Lembik A quick Google search would give you http://nimrod-lang.org/download.html

    – user12205 – 2014-05-01T13:54:39.403

    @ace I had also included the link in my post (though it's hard to see with blue on black now that I look at it). – Reimer Behrends – 2014-05-01T14:02:32.430

    You could speed this up even more by increasing the seed size to 128 bits: http://xorshift.di.unimi.it/

    – user60561 – 2014-05-02T22:36:49.163

    2

    Java

    I translated the above C++ solution to Java:

    import java.util.Random;
    import java.util.Arrays;
    
    public class Bench2 {
      public static int[] bits = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
      public static int[] oneValues = { 1, 0, 0, 1 };
      public static int[] values = { -1, 0, 0, 1 };
      public static int n = 8;
      public static int m = 8;
      public static int iters = 1000;
    
      private static int x,y=34353,z=57768,w=1564;
    
      public static void main( String[] args ) {
        x = (int) (System.currentTimeMillis()/1000l);
    
        int[] leadingzerocounts = new int[ m ];
        Arrays.fill( leadingzerocounts, 0 );
    
        int maxS = 1 << 15;
    
        for( int s = 0; s < maxS; s++ ) {
          int x = interleaveBit( s );
    
          for( int i=0; i<iters; i++ ) {
            int random;
    
            do {
              random = 0xFFFF & fastRandom( );
            } while( sumOnes( random ) == 0 );
    
            int j = 7;
    
            while( j >= 0 ) {
              int h = ( x >> (j*2) );
              int l = 0xFFFF & (~(random ^ h));
    
              if( sumArray( l ) == 0 ) {
                leadingzerocounts[ j ]++;
              } else {
                break;
              }
    
              j--;
            }
          }
        }
    
        for( int i = 7; i >= 0; --i ) {
          System.out.print( leadingzerocounts[ i ] + " " );
        }
    
        System.out.println( );
      }
    
      public static int interleaveBit( int x ) {
        x = (x | ( x << 8)) & bits[3];
        x = (x | ( x << 4)) & bits[2];
        x = (x | ( x << 2)) & bits[1];
        x = (x | ( x << 1)) & bits[0];
        return x | (x << 1);
      }
    
      public static int sumOnes( int v ) {
        return (0xAAAA & (v ^ ~(v << 1)));
        // int s = 0;
    
        // for( int i = 0; i < 8; ++i ) {
        //   int a = 3 & ( v >> (i*2) );
        //   s += oneValues[ a ];
        // }
    
        // return s;
      }
    
      public static int sumArray( int v ) {
        return Integer.bitCount( v ) - 8;
        // int s = 0;
    
        // for( int i=0; i<8; ++i ) {
        //   int a = 3 & ( v >> (i*2) );
        //   s += values[ a ];
        // }
    
        // return s;
      }
    
      public static int fastRandom( ) {
        long t;
        t = x ^ (x << 11);
        x = y; y = z; z = w;
        return w = (int)( w ^ (w >> 19) ^ t ^ (t >> 8));
      }
    }
    

    On my machine I get following output for the java program:

    time java Bench2
    6330616 2524569 1040372 439615 193290 87131 40651 19607 
    java Bench2  0.36s user 0.02s system 102% cpu 0.371 total
    

    The OPs program runs about 53 seconds on my machine:

    time pypy start.py
    [6330944, 2524897, 1040621, 439317, 192731, 86850, 40830, 19555]
    pypy start.py  52.96s user 0.06s system 99% cpu 53.271 total
    

    The c++ program executed only about 0.15 seconds:

    time ./benchcc
    [6112256, 2461184, 1025152, 435584, 193376, 87400, 40924, 19700]
    ./benchcc  0.15s user 0.00s system 99% cpu 0.151 total
    

    That is about 2.5x faster than the corresponding java solution (I didn't exclude VM startup). This java solutions is about 142x faster than the program executed with PyPy.

    Since I was personally interested, I set iters to 100_000 for Java and C++ but the factor of 2.5 didn't decrease in favor of Java if anything got bigger.

    EDIT: I ran the programs on a 64bit Arch Linux PC.

    EDIT2: I want to add that I started with a rough translation of the python code:

    import java.util.Random;
    import java.util.Arrays;
    
    public class Bench {
        public static int[] values = { -1, 0, 0, 1 };
        public static int n = 8;
        public static int m = 8;
        public static int iters = 1000;
    
        private static int x,y=34353,z=57768,w=1564; 
    
        public static void main( String[] args ) {
            x = (int) (System.currentTimeMillis()/1000l);
    
            int[] leadingzerocounts = new int[ m ];
            Arrays.fill( leadingzerocounts, 0 );
    
            int[] S = new int[ n+m-1 ];
            Arrays.fill( S, -1 );
    
            do {
                for( int i=0; i<iters; i++ ) {
                    int[] F = new int[ n ];
    
                    do {
                        randomArray( F );
                    } while( containsOnlyZeros( F ) );
    
                    for( int j=0; j < m && check( F, S, j ); j++ ) {
                        leadingzerocounts[ j ] += 1;
                    }
                }
            } while( next( S ) );
    
            System.out.println( Arrays.toString( leadingzerocounts ) );
        }
    
        public static void randomArray( int[] F ) {
            for( int i = 0; i<F.length; i++ ) {
                F[ i ] = (1-(fastRandom()&3))%2;
            }
        }
    
        public static boolean containsOnlyZeros( int[] F ) {
            for( int x : F ) {
                if( x != 0 ) {
                    return false;
                }
            }
    
            return true;
        }
    
        public static boolean next( int[] S ) {
            for( int i=0; i<S.length; i++ ) {
                if( ( S[ i ] = -S[ i ] ) == 1 ) {
                    return true;    
                }
            }
    
            return false;
        }
    
        public static boolean check( int[] F, int[] S, int j ) {
          int sum = 0;
    
          for( int i=0; i<n; i++ ) {
              sum += F[ i ] * S[ j + i ];
          }
    
          return sum == 0;
        }
    
        public static int fastRandom( ) {
            long t;
            t = x ^ (x << 11);
            x = y; y = z; z = w;
            return w = (int)( w ^ (w >> 19) ^ t ^ (t >> 8));
        }
    }
    

    This program ran about 3.6 seconds:

    time java Bench   
    [6330034, 2524369, 1040723, 439261, 193673, 87338, 40840, 19567]
    java Bench  3.64s user 0.01s system 101% cpu 3.600 total
    

    Which is about 14 times faster than the PyPy solution. (Choosing the standard random function over the fastRandom function leads to an execution time of 5 seconds)

    dinfuehr

    Posted 2014-04-28T16:47:41.103

    Reputation: 121

    2

    Python 3.5 + numpy 1.10.1, 3.76 seconds

    The tests were run on my Macbook Pro. OP's code took ~6 mins on the same machine.

    The reason I'm answering this question in fact is because I don't have 10 reputations and can't answer Part I :-p

    For the past few days, I have been trying to figure out how to perform massive convolutions efficiently with numpy (without relying on a third-party package, even scipy). As I came across this series of challenges during my research, I decided to give it a try. I may have come to this game to late, but here is my attempt using Python 3.5 and numpy 1.10.1.

    def test_convolv():
        n = 8 
        m  = 8 
        iters = 1000
        ilow = np.ceil(0+n/2).astype(int)
        ihigh = np.ceil(m+n/2).astype(int)
    
        leadingzerocounts = np.zeros(m)
    
        # Pre-compute S & F
        S = np.array(list(itertools.product([-1,1], repeat = n+m-1)))
        choicesF = np.random.choice(np.array([-1, 0, 0, 1], dtype=np.int8), size=n*iters).reshape(iters,n)
        imask = ~np.any(choicesF, axis=1)
        while np.any(imask):
            imasksize = np.count_nonzero(imask)
            choicesF[imask,:] = np.random.choice(np.array([-1, 0, 0, 1], dtype=np.int8), size=n*imasksize).reshape(imasksize, n)
            imask = ~np.any(choicesF, axis=1)
    
        for i in np.arange(iters):
            F = choicesF[i, :]
            # This is where the magic is: by flattening the S array, 
            # I try to take advantage of speed of the np.convolve 
            # (really numpy.multiarray.correlate). 
            FS = (np.convolve(S.reshape(-1), F, 'same').reshape(S.shape))[:, ilow:ihigh]
            jmask_not = (FS[:, 0] != 0)
            leadingzerocounts[0] = leadingzerocounts[0]+np.count_nonzero(~jmask_not)
            for j in np.arange(n-1)+1:
                jmask = (FS[jmask_not, j] != 0)
                leadingzerocounts[j] = leadingzerocounts[j] + np.count_nonzero(~jmask)
                jmask_not[(jmask_not.nonzero()[0])[jmask]] = False
    
        print(leadingzerocounts)
    

    I pre-computed the S and F arrays, and flattened the S array while performing the convolution, which (based on my experiments) could take advantage of the speed of np.convolve. In other words, as I didn't find a vectorized convolution routine, I fake-vectorized the code by flattening the whole array and hoped np.convolved would do the vectorization under the hood for me, which seemed to be working. Note I used mode='same' and trimmed the leading and trailing elements that were useless.

    On my Macbook Pro, the test results give 3.76 seconds. When I ran OP's code (modified to Python 3.5), I got about 6 minutes. The speedup is about 100 times.

    One drawback is that because the S and F arrays are to be stored, the memory requirement can be a problem if the sizes are too big.

    I used the same method for Part I and I got a ~ 60-100x speedup on my laptop.

    As I did everything on my Macbook Pro, if someone could test my code and let me know how it goes on your machine, I would appreciate it very much!

    Trying.Too.Hard

    Posted 2014-04-28T16:47:41.103

    Reputation: 21

    1

    J, 130x~50x speedup?

    n =: m =: 8
    len =: 1000
    S =: (] - 0 = ])S0=: #:i.2^<:+/n,m
    k =: (n#0) -.~ (_1 0 0 1) {~ (n#4) #: i.4^n
    sn =: (]-0=])#:i.2^n
    ku =: ~. k
    M =: 0=+/"1 sn *"1/ ku
    fs =: (ku&i.)"1 k
    snum =: n #.\"1 S0
    
    run =: 3 : 0
     r =: n#0
     for_t. (snum) do.
       rn =: fs{~? len # #k
       r =: r + +/"1*/\rn{"1 t{M
     end.
     r
    )
    echo run 0
    exit''
    

    Times on a random debian:

    u#>time j slowpy.ijs
    6334123 2526955 1041600 440039 193567 87321 40754 19714
    
    real    0m2.453s
    user    0m2.368s
    sys     0m0.084s
    
    
    u#>time python slow_pyth.py
    [6331017, 2524166, 1041731, 438731, 193599, 87578, 40919, 19705]
    
    real    5m25.541s
    user    5m25.548s
    sys     0m0.012s
    

    I think there is room for improvement.

    Eelvex

    Posted 2014-04-28T16:47:41.103

    Reputation: 5 204

    The Python script is supposed to be executed using pypy, not python, which is why your script seems to be giving 130x speed-up. – user12205 – 2014-04-29T12:11:46.857

    @ace yes I noticed but I can't install pypy :-/ I think the order of magnitude will remain though. – Eelvex – 2014-04-29T12:25:59.000

    Not necessarily... http://i.imgur.com/n566hzw.png

    – user12205 – 2014-04-29T12:45:55.583

    Indeed, not necessarily. – Eelvex – 2014-04-29T12:48:36.230

    What problem do you have installing pypy? – None – 2014-04-29T13:16:03.290

    @Lembik I'm not sure; something is not quite right with this debian installation. 'sid/main puppy ... 404 Not Found' – Eelvex – 2014-04-29T13:20:38.427

    @ace, it seems on par with the C version, no? – Eelvex – 2014-04-29T13:21:57.277

    Just grab https://bitbucket.org/squeaky/portable-pypy/downloads/pypy-2.2.1-linux_x86_64-portable.tar.bz2 if you are on 64 bit debian.

    – None – 2014-04-29T13:22:33.143

    @Lembik why don't you just grab a J interpreter and time it on your machine? I'm sure Eelvex can provide a download link or something like that... – user12205 – 2014-04-29T14:23:15.490

    @Lembik thanks I'll try that. You can get J here if you like.

    – Eelvex – 2014-04-29T17:30:31.270

    1

    Fortran: 316x

    Okay, Fortran: I've got it up to 106x 155x 160x 316x speedup when using an Xorshift RNG and OpenMP on a 4 core i7 CPU. Other than that, there are no big tricks. For the iterator to construct S, I just use the binary representation of the 16-bit integer i. You'll note that apart from the inline RNG and the "iterator"/mapping from i to S, the code is just as high-level as the Python code.

    Edit: removed the "if" in the Xorshift, now using "r = abs(w/...)" instead of "r=w/...". Goes from 106x to 155x.

    Edit2: This generates 15x as many random numbers as the C++ solution. If someone has a zero-overhead solution for converting a random int into an array of 0s and 1s in Fortran, I'm all ears. Then we could beat C++ :)

    Edit3: The first edit introduced a bug, as Lembik pointed out. This is fixed now, with a tiny improvement in speedup. I will try to use the suggestion by Eelvex to get more speedup.

    Edit4: Profiling indicated that converting to real and back to integer with nint() was slow. I replaced this with one integer division doing both scaling and rounding, going from 160x to 316x speedup.

    Compile with:

    gfortran -O3 -march=native -fopenmp golf.f90

    program golf
    implicit none
    integer, parameter :: m=8, n=8
    integer :: F(n), S(m+n-1), leadingzerocounts(m)
    integer :: j,k,bindec,enc,tmp,x=123456789,y=362436069,z=521288629,w=88675123
    integer*2 :: i
    real :: r
    
    leadingzerocounts=0
    
    !$OMP parallel do private(i,enc,j,bindec,S,F,k,tmp,x,y,z,w,r) reduction(+:leadingzerocounts) schedule(dynamic)
    do i=0,32766
      enc=i
      ! Short loop to convert i into the array S with -1s and 1s
      do j=16,2,-1
        bindec=2**(j-1)
        if (enc-bindec .ge. 0) then
          S(j-1)=1
          enc=enc-bindec
        else
          S(j-1)=-1
        endif
      end do
      do j=1,1000
        F=0
        do while (.not. any(F /= 0))
          do k=1,n
            ! Start Xorshift RNG
            tmp = ieor(x,ishft(x,11))
            x = y
            y = z
            z = w
            w = ieor(ieor(w,ishft(w,-19)),ieor(tmp,ishft(tmp,-8)))
            ! End Xorshift RNG
            ! Just scale it inside the nint:
            !F(k)=nint(w/2147483648.0)
            ! Scaling by integer division is faster, but then we need the random 
            ! number to be in (-2,2) instead of [-1,1]:
            F(k)=w/1073741824
    
          end do
        end do
        do k=1,m
          if (dot_product(F,S(k:k+n-1)) /= 0) exit
          leadingzerocounts(k)=leadingzerocounts(k)+1
        end do
      end do
    end do
    !$OMP end parallel do
    
    print *, leadingzerocounts
    
    end
    

    Example output:

    $ time ./a.out
    6329624 2524831 1039787 438809 193044 6860 40486 19517
    ./a.out 1.45s user 0.00s system 746% cpu 0.192 total

    OP's code:

    $ time pypy golf.py
    pypy golf.py 60.68s user 0.04s system 99% cpu 1:00.74 total

    semi-extrinsic

    Posted 2014-04-28T16:47:41.103

    Reputation: 641

    What I used in J was a prebuild list of 4^n numbers in base-4, then converted to triadic and excluding 0. The RNG just chooses from this list. – Eelvex – 2014-04-30T15:23:38.223

    I am not sure your code is correct. For 100,000 iterations I get 633140285 271390368 118307997 52751245 23725837 10744292 4944464 2388125 but I think it should be closer to 633170604 252560981 104156146 43911426 19316309 8713324 4073378 1959440. This is a consistent difference between runs. – None – 2014-05-01T10:36:53.463

    1Ah, thanks, @Lembik, my edit to speedup (removing the if-statement) was indeed a bug. I've updated my code so it should be correct now. I will try to post a version using the suggestion by Eelvex later. – semi-extrinsic – 2014-05-01T18:29:44.980

    That also sped it up it seems! – None – 2014-05-01T18:34:55.027

    Yes, slight speedup I guess. I realized I was adding 1.0 and then subtracting 1.0 inside a tight loop. – semi-extrinsic – 2014-05-01T20:17:33.463

    1

    C++: x200 (4-core i7, should scale to x400 on 8-core)

    Trying for a more straightforward C++11 (Tested with VS 2012, gcc and clang) solution with parallelization.

    To get this to compile and run under Linux with gcc 4.8.1:

    g++ -O3 -msse -msse2 -msse3 -march=native -std=c++11 -pthread -Wl,--no-as-needed golf.cpp

    Under Linux we also need std::launch::async to force multiple threads. I was missing that in an earlier version.

    In Visual Studio (2012+) this should just work but make a release build for timing...

    On my oldish dual core i3 this runs in ~0.9 seconds. On my i7 quad core this is 0.319s vs. pypy 66 seconds.

    On an 8-core i7 this should in the x400 speedup range. Switching to C style arrays would speed it up but I was interested in staying with C++ containers. For me it's interesting to see the speedup you can get while staying relatively close to the problem domain and at a relatively high level, something I think C++ is really good at. Also of note is the relative ease of paralleization using C++11 constructs.

    @ilmale's bit solution is very cool and works for -1/1/0. One could also throw SSE at this and maybe get a significant speedup.

    Beyond the parallelization there's another "trick" in there which is reducing the number of summations. Sample results: 6332947 2525357 1041957 438353 193024 87331 40902 19649

    #include <vector>
    #include <iostream>
    #include <thread>
    #include <future>
    #include <time.h>
    #include <ctime>
    #include <algorithm>
    
    using namespace std;
    
    // Bring some of these constants out to share
    const size_t m = 8;
    const int nthreads = 16;
    const size_t cn = 15;
    const int two_to_cn = 32768;
    
    static unsigned int seed = 35;
    
    int my_random() // not thread safe but let's call that more random!
    {
       seed = seed*1664525UL + 1013904223UL; // numberical recipes, 32 bit
       return ((seed>>30)&1)-!!((seed>>30)&2); // Credit to Dave!
    }
    
    bool allzero(const vector<int>& T)
    {
       for(auto x : T)
       {
          if(x!=0)
          {
             return false;
          }
       }
       return true;
    }
    
    // Return the position of the first non-zero element
    size_t convolve_until_nonzero(size_t max_n, const vector<int>& v1, const vector<int>& v2)
    {
       for(size_t i = 0; i<max_n; ++i)
       {
          int result = 0;
          for(size_t j = 0; j<v2.size(); ++j)
          {
             result += v1[i+j]*v2[j];
          }
          if(result!=0)
          {
             return i;
          }
       }
       return max_n;
    }
    
    void advance(vector<int>& v)
    {
       for(auto &x : v)
       {
          if(x==-1)
          {
             x = 1;
             return;
          }
          x = -1;
       }
    }
    
    vector<int> convolve_random_arrays(vector<int> S, int range)
    {
       const int iters = 1000;
       int bothzero = 0;
       int firstzero = 0;
    
       time_t current_time;
       time(&current_time);
       seed = current_time;
    
    
       vector<int> F(m);
       vector<int> leadingzerocounts(m+1);
    
       for(auto &x: leadingzerocounts)
       {
          x = 0;
       }
    
       for(int i=0; i<range; ++i)
       {
          for(int j=0; j<iters; ++j)
          {
             do
             {
                for(auto &x : F)
                {
                   x = my_random();
                }
             } while(allzero(F));
             leadingzerocounts[convolve_until_nonzero(m, S, F)]++;
          }
          advance(S);
       }
    
       // Finish adding things up...
       for(int i=m-1; i>0; --i)
       {
          leadingzerocounts[i] += leadingzerocounts[i+1];
       }
    
       vector<int> withoutfirst(leadingzerocounts.begin()+1, leadingzerocounts.end());
       return withoutfirst;
    }
    
    int main(int argc, char* argv[])
    {
    
       vector<int> leadingzerocounts(m);
    
       for(auto &x: leadingzerocounts)
       {
          x = 0;
       }
    
       clock_t start = clock();
    
       vector<int> S(cn);
       for(auto &x : S)
       {
          x = -1;
       }
    
       vector< future< vector< int > > > fs; // The future results of the threads
    
       // Go make threads to work on parts of the problem
       for(int i=0; i<nthreads; ++i)
       {
          vector<int> S_reversed = S; // S counts using LSBs but we want the thread start to be in MSBs
          reverse(S_reversed.begin(), S_reversed.end());
          fs.push_back(async(std::launch::async, convolve_random_arrays, S_reversed, two_to_cn/nthreads));
          advance(S);
       }
       // And now collect the data
       for(auto &f : fs)
       {
          vector<int> result = f.get();
          for(int i=0; i<result.size(); ++i)
          {
             leadingzerocounts[i] += result[i];
          }
       }
    
       for(auto count : leadingzerocounts)
       {
          cout << count << endl;
       }
    
       return 0;
    }
    

    Guy Sirton

    Posted 2014-04-28T16:47:41.103

    Reputation: 551