150

I have this code which I have written in Python/NumPy

from __future__ import division
import numpy as np
import itertools

n = 6
iters = 1000
firstzero = 0
bothzero = 0
""" The next line iterates over arrays of length n+1 which contain only -1s and 1s """
for S in itertools.product([-1, 1], repeat=n+1):
    """For i from 0 to iters -1 """
    for i in xrange(iters):
        """ Choose a random array of length n.
            Prob 1/4 of being -1, prob 1/4 of being 1 and prob 1/2 of being 0. """
        F = np.random.choice(np.array([-1, 0, 0, 1], dtype=np.int8), size=n)
        """The next loop just makes sure that F is not all zeros."""
        while np.all(F == 0):
            F = np.random.choice(np.array([-1, 0, 0, 1], dtype=np.int8), size=n)
        """np.convolve(F, S, 'valid') computes two inner products between
        F and the two successive windows of S of length n."""
        FS = np.convolve(F, S, 'valid')
        if FS[0] == 0:
            firstzero += 1
        if np.all(FS == 0):
            bothzero += 1

print("firstzero: %i" % firstzero)
print("bothzero: %i" % bothzero)

It is counting the number of times the convolution of two random arrays, one which is one longer than the other, with a particular probability distribution, has a 0 at the first position or a 0 in both positions.

I had a bet with a friend who says Python is a terrible language to write code in that needs to be fast. It takes 9s on my computer. He says it could be made 100 times faster if written in a "proper language".

The challenge is to see if this code can indeed by made 100 times faster in any language of your choice. I will test your code and the fastest one week from now will win. If anyone gets below 0.09s then they automatically win and I lose.

Status

Python. 30 times speed up by Alistair Buxon! Although not the fastest solution it is in fact my favourite.
Octave. 100 times speed up by @Thethos.
Rust. 500 times speed up by @dbaupp.
C++. 570 times speed up by Guy Sirton.
C. 727 times speed up by @ace.
C++. Unbelievably fast by @Stefan.

The fastest solutions are now too fast to sensibly time. I have therefore increased n to 10 and set iters = 100000 to compare the best ones. Under this measure the fastest are.

C. 7.5s by @ace.
C++. 1s by @Stefan.

My Machine The timings will be run on my machine. This is a standard ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Follow up posted As this competition was rather too easy to get a x100 speedup, I have posted a followup for those who want to exercise their speed guru expertise. See How Slow Is Python Really (Part II)?

fastest-code

user9206

Posted 2014-04-26T18:26:17.890

Reputation:

Answers

C++ bit magic

0.84ms with simple RNG, 1.67ms with c++11 std::knuth

0.16ms with slight algorithmic modification (see edit below)

The python implementation runs in 7.97 seconds on my rig. So this is 9488 to 4772 times faster depending on what RNG you choose.

#include <iostream>
#include <bitset>
#include <random>
#include <chrono>
#include <stdint.h>
#include <cassert>
#include <tuple>

#if 0
// C++11 random
std::random_device rd;
std::knuth_b gen(rd());

uint32_t genRandom()
{
    return gen();
}
#else
// bad, fast, random.

uint32_t genRandom()
{
    static uint32_t seed = std::random_device()();
    auto oldSeed = seed;
    seed = seed*1664525UL + 1013904223UL; // numerical recipes, 32 bit
    return oldSeed;
}
#endif

#ifdef _MSC_VER
uint32_t popcnt( uint32_t x ){ return _mm_popcnt_u32(x); }
#else
uint32_t popcnt( uint32_t x ){ return __builtin_popcount(x); }
#endif



std::pair<unsigned, unsigned> convolve()
{
    const uint32_t n = 6;
    const uint32_t iters = 1000;
    unsigned firstZero = 0;
    unsigned bothZero = 0;

    uint32_t S = (1 << (n+1));
    // generate all possible N+1 bit strings
    // 1 = +1
    // 0 = -1
    while ( S-- )
    {
        uint32_t s1 = S % ( 1 << n );
        uint32_t s2 = (S >> 1) % ( 1 << n );
        uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
        static_assert( n < 16, "packing of F fails when n > 16.");


        for( unsigned i = 0; i < iters; i++ )
        {
            // generate random bit mess
            uint32_t F;
            do {
                F = genRandom() & fmask;
            } while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );

            // Assume F is an array with interleaved elements such that F[0] || F[16] is one element
            // here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
            // and  ~MSB(F) & LSB(F) returns 1 for all elements that are negative
            // this results in the distribution ( -1, 0, 0, 1 )
            // to ease calculations we generate r = LSB(F) and l = MSB(F)

            uint32_t r = F % ( 1 << n );
            // modulo is required because the behaviour of the leftmost bit is implementation defined
            uint32_t l = ( F >> 16 ) % ( 1 << n );

            uint32_t posBits = l & ~r;
            uint32_t negBits = ~l & r;
            assert( (posBits & negBits) == 0 );

            // calculate which bits in the expression S * F evaluate to +1
            unsigned firstPosBits = ((s1 & posBits) | (~s1 & negBits));
            // idem for -1
            unsigned firstNegBits = ((~s1 & posBits) | (s1 & negBits));

            if ( popcnt( firstPosBits ) == popcnt( firstNegBits ) )
            {
                firstZero++;

                unsigned secondPosBits = ((s2 & posBits) | (~s2 & negBits));
                unsigned secondNegBits = ((~s2 & posBits) | (s2 & negBits));

                if ( popcnt( secondPosBits ) == popcnt( secondNegBits ) )
                {
                    bothZero++;
                }
            }
        }
    }

    return std::make_pair(firstZero, bothZero);
}

int main()
{
    typedef std::chrono::high_resolution_clock clock;
    int rounds = 1000;
    std::vector< std::pair<unsigned, unsigned> > out(rounds);

    // do 100 rounds to get the cpu up to speed..
    for( int i = 0; i < 10000; i++ )
    {
        convolve();
    }


    auto start = clock::now();

    for( int i = 0; i < rounds; i++ )
    {
        out[i] = convolve();
    }

    auto end = clock::now();
    double seconds = std::chrono::duration_cast< std::chrono::microseconds >( end - start ).count() / 1000000.0;

#if 0
    for( auto pair : out )
        std::cout << pair.first << ", " << pair.second << std::endl;
#endif

    std::cout << seconds/rounds*1000 << " msec/round" << std::endl;

    return 0;
}

Compile in 64-bit for extra registers. When using the simple random generator the loops in convolve() run without any memory access, all variables are stored in the registers.

How it works: rather than storing S and F as in-memory arrays, it is stored as bits in an uint32_t.
For S, the n least significant bits are used where an set bit denotes an +1 and an unset bit denotes an -1.
F requires at least 2 bits to create an distribution of [-1, 0, 0, 1]. This is done by generating random bits and examining the 16 least significant (called r) and 16 most significant bits (called l). If l & ~r we assume that F is +1, if ~l & r we assume that F is -1. Otherwise F is 0. This generates the distribution we're looking for.

Now we have S, posBits with an set bit on every location where F == 1 and negBits with an set bit on every location where F == -1.

We can prove that F * S (where * denotes multiplication) evaluates to +1 under the condition (S & posBits) | (~S & negBits). We can also generate similar logic for all cases where F * S evaluates to -1. And finally, we know that sum(F * S) evaluates to 0 if and only if there is an equal amount of -1's and +1's in the result. This is very easy to calculate by simply comparing the number of +1 bits and -1 bits.

This implementation uses 32 bit ints, and the maximum n accepted is 16. It is possible to scale the implementation to 31 bits by modifying the random generate code, and to 63 bits by using uint64_t instead of uint32_t.

edit

The folowing convolve function:

std::pair<unsigned, unsigned> convolve()
{
    const uint32_t n = 6;
    const uint32_t iters = 1000;
    unsigned firstZero = 0;
    unsigned bothZero = 0;
    uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
    static_assert( n < 16, "packing of F fails when n > 16.");


    for( unsigned i = 0; i < iters; i++ )
    {
        // generate random bit mess
        uint32_t F;
        do {
            F = genRandom() & fmask;
        } while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );

        // Assume F is an array with interleaved elements such that F[0] || F[16] is one element
        // here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
        // and  ~MSB(F) & LSB(F) returns 1 for all elements that are negative
        // this results in the distribution ( -1, 0, 0, 1 )
        // to ease calculations we generate r = LSB(F) and l = MSB(F)

        uint32_t r = F % ( 1 << n );
        // modulo is required because the behaviour of the leftmost bit is implementation defined
        uint32_t l = ( F >> 16 ) % ( 1 << n );

        uint32_t posBits = l & ~r;
        uint32_t negBits = ~l & r;
        assert( (posBits & negBits) == 0 );

        uint32_t mask = posBits | negBits;
        uint32_t totalBits = popcnt( mask );
        // if the amount of -1 and +1's is uneven, sum(S*F) cannot possibly evaluate to 0
        if ( totalBits & 1 )
            continue;

        uint32_t adjF = posBits & ~negBits;
        uint32_t desiredBits = totalBits / 2;

        uint32_t S = (1 << (n+1));
        // generate all possible N+1 bit strings
        // 1 = +1
        // 0 = -1
        while ( S-- )
        {
            // calculate which bits in the expression S * F evaluate to +1
            auto firstBits = (S & mask) ^ adjF;
            auto secondBits = (S & ( mask << 1 ) ) ^ ( adjF << 1 );

            bool a = desiredBits == popcnt( firstBits );
            bool b = desiredBits == popcnt( secondBits );
            firstZero += a;
            bothZero += a & b;
        }
    }

    return std::make_pair(firstZero, bothZero);
}

cuts the runtime to 0.160-0.161ms. Manual loop unroll (not pictured above) makes that 0.150. The less trivial n=10, iter=100000 case runs under 250ms. I'm sure i can get it under 50ms by leveraging additional cores but that's too easy.

This is done by making the inner loop branch free and swapping the F and S loop.
If bothZero is not required i can cut down the run time to 0.02ms by sparsely looping over all possible S arrays.

Stefan

Posted 2014-04-26T18:26:17.890

Reputation: 891

You may be able to gain a speedup by precomputing 1<<n. – None – 2018-11-01T10:35:43.640

3Could you provide a gcc friendly version and also what your command line would be please? I am not sure I can test it currently. – None – 2014-05-01T17:49:03.590

I know nothing about this but google tells me that __builtin_popcount might be a replacement for _mm_popcnt_u32() . – None – 2014-05-01T17:58:37.827

3Code updated, uses #ifdef switch to select the correct popcnt command. It compiles with -std=c++0x -mpopcnt -O2 and takes 1.01ms to run in 32 bit mode (i don't have a 64-bit GCC version at hand). – Stefan – 2014-05-01T18:06:25.007

Could you make it print the output? I am not sure if it is actually doing anything currently :) – None – 2014-05-01T18:26:07.663

output: http://pastebin.com/Xg7xVst2 (you'll have to change the #if 0 on line 123 for it to print the output of convolve())

– Stefan – 2014-05-01T18:31:05.283

Thanks. It seems there is no randomness in your code, right? I mean you don't select a random seed. It does appear to be terrifyingly fast. Try increasing the number of iterations to 100,000. – None – 2014-05-01T18:43:49.213

Could you explain your solution by any chance? It really is amazingly fast and I think people would be interested. – None – 2014-05-01T18:53:19.073

@Lembik There is no randomness, only pseudo-randomness in the simple rng case because the seed is fixed. This is not too hard to fix by changing the seed. I've updated the answer with an explanation. – Stefan – 2014-05-01T19:57:33.713

Runs in 1.02 ms on my laptop when compiled with 64-bit gcc. – ceased to turn counterclockwis – 2014-05-01T20:17:21.217

@Stefan This is impressive work! I would hate to have to be the one to fix any bugs in it though :p bit shifting magic. – oligofren – 2014-05-02T09:27:02.837

8You are clearly a wizard. + 1 – BurntPizza – 2014-05-02T15:33:22.537

Hmm, I've got my work cut out for me. Glad that this epic code got me the populist badge. – Kyle Kanos – 2014-05-06T12:44:51.050

Python2.7 + Numpy 1.8.1: 10.242 s

Fortran 90+: 0.029 s 0.003 s 0.022 s 0.010 s

Damn straight you lost your bet! Not a drop of parallelization here too, just straight Fortran 90+.

EDIT I've taken Guy Sirton's algorithm for permuting the array S (good find :D). I apparently also had the -g -traceback compiler flags active which were slowing this code down to about 0.017s. Currently, I am compiling this as

ifort -fast -o convolve convolve_random_arrays.f90

For those who don't have ifort, you can use

gfortran -O3 -ffast-math -o convolve convolve_random_arrays.f90

EDIT 2: The decrease in run-time is because I was doing something wrong previously and got an incorrect answer. Doing it the right way is apparently slower. I still can't believe that C++ is faster than mine, so I'm probably going to spend some time this week trying to tweak the crap out of this to speed it up.

EDIT 3: By simply changing the RNG section using one based on BSD's RNG (as suggested by Sampo Smolander) and eliminating the constant divide by m1, I cut the run-time to the same as the C++ answer by Guy Sirton. Using static arrays (as suggested by Sharpie) drops the run-time to under the C++ run-time! Yay Fortran! :D

EDIT 4 Apparently this doesn't compile (with gfortran) and run correctly (incorrect values) because the integers are overstepping their limits. I've made corrections to ensure it works, but this requires one to have either ifort 11+ or gfortran 4.7+ (or another compiler that allows iso_fortran_env and the F2008 int64 kind).

Here's the code:

program convolve_random_arrays
   use iso_fortran_env
   implicit none
   integer(int64), parameter :: a1 = 1103515245
   integer(int64), parameter :: c1 = 12345
   integer(int64), parameter :: m1 = 2147483648
   real, parameter ::    mi = 4.656612873e-10 ! 1/m1
   integer, parameter :: n = 6
   integer :: p, pmax, iters, i, nil(0:1), seed
   !integer, allocatable ::  F(:), S(:), FS(:)
   integer :: F(n), S(n+1), FS(2)

   !n = 6
   !allocate(F(n), S(n+1), FS(2))
   iters = 1000
   nil = 0

   !call init_random_seed()

   S = -1
   pmax = 2**(n+1)
   do p=1,pmax
      do i=1,iters
         F = rand_int_array(n)
         if(all(F==0)) then
            do while(all(F==0))
               F = rand_int_array(n)
            enddo
         endif

         FS = convolve(F,S)

         if(FS(1) == 0) then
            nil(0) = nil(0) + 1
            if(FS(2) == 0) nil(1) = nil(1) + 1
         endif

      enddo
      call permute(S)
   enddo

   print *,"first zero:",nil(0)
   print *," both zero:",nil(1)

 contains
   pure function convolve(x, h) result(y)
!x is the signal array
!h is the noise/impulse array
      integer, dimension(:), intent(in) :: x, h
      integer, dimension(abs(size(x)-size(h))+1) :: y
      integer:: i, j, r
      y(1) = dot_product(x,h(1:n-1))
      y(2) = dot_product(x,h(2:n  ))
   end function convolve

   pure subroutine permute(x)
      integer, intent(inout) :: x(:)
      integer :: i

      do i=1,size(x)
         if(x(i)==-1) then
            x(i) = 1
            return
         endif
         x(i) = -1
      enddo
   end subroutine permute

   function rand_int_array(i) result(x)
     integer, intent(in) :: i
     integer :: x(i), j
     real :: y
     do j=1,i
        y = bsd_rng()
        if(y <= 0.25) then
           x(j) = -1
        else if (y >= 0.75) then
           x(j) = +1
        else
           x(j) = 0
        endif
     enddo
   end function rand_int_array

   function bsd_rng() result(x)
      real :: x
      integer(int64) :: b=3141592653
      b = mod(a1*b + c1, m1)
      x = real(b)*mi
   end function bsd_rng
end program convolve_random_arrays

I suppose the question now is will you stop using slow-as-molasses Python and use fast-as-electrons-can-move Fortran ;).

Kyle Kanos

Posted 2014-04-26T18:26:17.890

Reputation: 4 270

Excellent +1 :) – Timtech – 2014-04-26T19:35:25.540

1Wouldn't the case statement be faster than a generator function anyway? Unless you're expecting some kind of branch-prediction/cache-line/etc speedup? – OrangeDog – 2014-04-26T20:48:09.600

17Speed should be compared on the same machine. What runtime did you get for the OP's code? – nbubis – 2014-04-26T21:59:05.363

Though not specified in the question, I think the point is that generating the S array should not depend on what n is, i.e. your algorithm should work for all n, it's just that in this test case n happens to be 6, so I don't think that hard-coding the arrays is a good idea – user12205 – 2014-04-26T23:06:49.803

This also doesn't generate the random array correctly. OP's code checks that at least one value in the random array is none-zero, and if it isn't, it generates a new random array. Don't ask me why it does this, but it does. – Alistair Buxton – 2014-04-27T02:49:52.997

@nbubis: I've only got Python2.6 and Numpy 1.4 installed (running SL6.5), so OPs code gives me an error about choice not existing in numpy.random. I've got an Ubuntu livedisk lying at the office, I'll update my answer with python timing tomorrow. – Kyle Kanos – 2014-04-27T03:12:00.133

@ace: I concur that hard-coding isn't the best solution, but I'm not sure how itertools.product does what it does, so I'm kinda stuck there. If I figure it out, I'll certainly update this. – Kyle Kanos – 2014-04-27T03:13:30.180

@AlistairBuxton: That one is easily fixed and, not surprisingly, doesn't change the runtime. – Kyle Kanos – 2014-04-27T03:13:50.313

@nbubis: Scratch my previous comment, I'd forgotten I did install Python2.7.6 a while back. I'm updating the python code time now. – Kyle Kanos – 2014-04-27T03:22:45.910

Your answer looks great but I can't reproduce it on my machine because of the problem with the hard coding. SO has a number of answers about how to implement itertools.product I see. E.g. http://stackoverflow.com/questions/1681269/how-code-a-function-similar-to-itertools-product-in-python-2-5

– None – 2014-04-27T05:51:41.807

@Lembik: I've updated my answer, you should be able to compile it now (instructions included). – Kyle Kanos – 2014-04-27T15:16:20.503

Thanks. I have gfortran as you guessed but haven't managed to get it to compile yet. Of course I have never compiled any fortran program before :) – None – 2014-04-27T16:54:28.153

@Lembik: Join me in chat so I can help you compile it.

– Kyle Kanos – 2014-04-27T16:57:03.323

Is your fortran compiler inlining the functions? – Guy Sirton – 2014-04-27T19:56:03.820

@GuySirton: -fast does inline the functions for ifort – Kyle Kanos – 2014-04-27T20:01:00.410

3The C++ answer implements its own, very lightweight random number generator. Your answer used the default that comes with the compiler, which could be slower? – Sampo Smolander – 2014-04-30T14:26:22.880

3Also, the C++ example appears to be using statically allocated arrays. Try using fixed-length arrays that are set at compile time and see if it shaves any time off. – Sharpie – 2014-04-30T14:45:47.470

@SampoSmolander: Thanks for the suggestion, it certainly helped! – Kyle Kanos – 2014-04-30T15:09:09.313

@Sharpie: Thanks for the suggestion, it certainly helped! – Kyle Kanos – 2014-04-30T15:09:26.150

Thanks for this. I never knew Fortran could be so elegant. – Rick Minerich – 2014-04-30T16:29:56.800

@RickMinerich: Haha, that's only true of Fortran 90+ as F77 is still amazingly hideous! – Kyle Kanos – 2014-04-30T16:35:30.353

What are your compile options? gfortran -O3 -ffast-math -o KyleKanos KyleKanos.f90 gives me integer, parameter :: m1 = 2147483648 1 Error: Integer too big for its kind at (1). This check can be disabled with the option -fno-range-check – None – 2014-04-30T18:53:48.887

@Lembik: I did not get such an error with ifort, but I do get that error on gfortran. Adding -fno-range-check will disable the check and will still work (it did for me). – Kyle Kanos – 2014-04-30T19:00:25.490

@Lembik: Actually, I'm getting wrong answers doing that. I'll update with corrections for ensuring it works. – Kyle Kanos – 2014-04-30T19:02:35.443

I have gfortran 4.8.2 but your new code still gives me that error and when I add -fno-range-check I get the wrong answers. – None – 2014-04-30T19:28:35.893

@Lembik: Do you have libquadmath installed? – Kyle Kanos – 2014-04-30T19:48:47.363

I have the package libquadmath0 installed. – None – 2014-04-30T19:53:01.053

@Lembik: Hmm, it should work then. I don't have 4.8 (only 4.4 which doesn't support f08 additions), so I can't test this. My other suggestion would be to get Intel's non-commericial free compilers, as it definitely works there.

– Kyle Kanos – 2014-04-30T19:55:18.107

I had heard that fortran has really fast compilers... kinda of a apples-to-oranges comparison to python, isn't? :) – unknownprotocol – 2014-04-30T21:50:12.080

@cesardv...that's kinda the point here. – Kyle Kanos – 2014-04-30T21:51:34.180

1@KyleKanos @Lembik the problem is that the integer assignment in fortran is not using implicitly the int64 specification, hence the numbers are int32 before any conversion is made. The code should be: integer(int64) :: b = 3141592653_int64 for all int64. This is part of the fortran standard and is expected by the programmer in a type-declared programming language. (notice that default settings of course can override this) – zeroth – 2014-05-01T10:35:33.970

@zeroth: Intel's compiler doesn't complain any about not adding the _int64 (or any of the other iso_fortran_env and iso_c_binding types). If gfortran requires it, then it is another reason why one should not use that compiler (the primary, of course, being that it's horrendously slow). – Kyle Kanos – 2014-05-05T20:27:32.073

@KyleKanos I think that your basis for "another" reason of discarding gfortran is, at best, wrong. Intel's compiler maybe is just converting for you due to its default settings. But that is a devious thing as it is not in the standard. Hence, you shouldn't expect other compilers to behave the same. Publishing a program which does not rely on intel's compiler then becomes exceedingly difficult if you rely on non-standard extensions. Try the same thing for real and real(8). You will see that doing: real(8) :: a ; a = 0.002 is not the same as a = 0.002_8, even using intels compiler... – zeroth – 2014-05-06T22:22:57.740

... provided that 8 is the double kind (remark that said example only works for numbers that are rationally expressed in the floating point precision, using 1 as example would behave expectedly). As for the use statements, that is also just a design-choice of intel. Again, for cross-compiler compatibility you are required to add those statements either way. And as the standard does state this, I would consider not adding it, bad practice. Of course the intel compiler is largely favored, and I also favor it :) Even though gfortran has with the latest releases gained some momentum! :) – zeroth – 2014-05-06T22:28:07.700

Python 2.7 - ~~0.882s~~ 0.283s

(OP's original: 6.404s)

Edit: Steven Rumbalski's optimization by precomputing F values. With this optimization cpython beats pypy's 0.365s.

import itertools
import operator
import random

n=6
iters = 1000
firstzero = 0
bothzero = 0

choicesF = filter(any, itertools.product([-1, 0, 0, 1], repeat=n))

for S in itertools.product([-1,1], repeat = n+1):
    for i in xrange(iters):
        F = random.choice(choicesF)
        if not sum(map(operator.mul, F, S[:-1])):
            firstzero += 1
            if not sum(map(operator.mul, F, S[1:])):
                bothzero += 1

print "firstzero", firstzero
print "bothzero", bothzero

OP's original code uses such tiny arrays there is no benefit to using Numpy, as this pure python implementation demonstrates. But see also this numpy implementation which is three times faster again than my code.

I also optimize by skipping the rest of the convolution if the first result isn't zero.

Alistair Buxton

Posted 2014-04-26T18:26:17.890

Reputation: 991

I wonder whether such is faster in Nim.

– Cees Timmerman – 2015-06-09T17:01:32.307

1This approach also allows you to use pypy! – None – 2014-04-27T05:52:53.120

11With pypy this runs in about 0.5 seconds. – Alistair Buxton – 2014-04-27T12:41:42.383

2You get a much more convincing speedup if you set n = 10. I get 19s versus 4.6s for cpython versus pypy. – None – 2014-04-27T16:34:09.690

3Another optimization would be to precompute the possiblities for F because there are only 4032 of them. Define choicesF = filter(any, itertools.product([-1, 0, 0, 1], repeat=n)) outside of the loops. Then in the innerloop define F = random.choice(choicesF). I get a 3x speedup with such an approach. – Steven Rumbalski – 2014-04-28T13:36:40.910

3How about compiling this in Cython? Then adding a few tactful static types? – Thane Brimhall – 2014-04-28T20:10:29.013

You should update your PyPy interpreter; every version runs faster than the last. PyPy2.2.1 out does CPython2.7.6 even at n=6 (~0.25 vs. ~0.28), despite the overhead of JIT compiling. For larger n, the difference becomes very noticable. – primo – 2014-04-29T05:41:15.637

@primo I used PyPy 2.2.1 – Alistair Buxton – 2014-04-29T06:46:04.810

@AlistairBuxton results on my machine: http://i.stack.imgur.com/iQeCn.png The modified code I used here: http://codepad.org/K3F4ath6

– primo – 2014-04-29T09:21:55.430

According to the pypy devs you are supposed to write your python in as close a style to C as possible to make it go fast. That means not using random, itertools, filter etc. However it also has a warm up cost so you would need to test it on n=10 for example. – None – 2014-04-29T15:24:23.090

1isn't hoisting the lookups out of the loops a standard python optimization? i.e. declare mul for operator.mul and choice for random.choice outside of the loops. Maybe you can shave some more millis :) – riffraff – 2014-04-30T12:22:09.700

@riffraff that doesn't seem to make any difference here. – Alistair Buxton – 2014-05-01T19:48:39.140

2Put everything in a function and call it at the end. That localizes the names, which also makes the optimzation proposed by @riffraff work. Also, move the creation of range(iters) out of the loop. Altogether, I get a speedup of about 7% over your very nice answer. – Reinstate Monica – 2014-05-02T22:27:08.950

Rust: 0.011s

Original Python: 8.3

A straight translation of the original Python.

extern crate rand;

use rand::Rng;

static N: uint = 6;
static ITERS: uint = 1000;

fn convolve<T: Num>(into: &mut [T], a: &[T], b: &[T]) {
    // we want `a` to be the longest array
    if a.len() < b.len() {
        convolve(into, b, a);
        return
    }

    assert_eq!(into.len(), a.len() - b.len() + 1);

    for (n,place) in into.mut_iter().enumerate() {
        for (x, y) in a.slice_from(n).iter().zip(b.iter()) {
            *place = *place + *x * *y
        }
    }
}

fn main() {
    let mut first_zero = 0;
    let mut both_zero = 0;
    let mut rng = rand::XorShiftRng::new().unwrap();

    for s in PlusMinus::new() {
        for _ in range(0, ITERS) {
            let mut f = [0, .. N];
            while f.iter().all(|x| *x == 0) {
                for p in f.mut_iter() {
                    match rng.gen::<u32>() % 4 {
                        0 => *p = -1,
                        1 | 2 => *p = 0,
                        _ => *p = 1
                    }
                }
            }

            let mut fs = [0, .. 2];
            convolve(fs, s, f);

            if fs[0] == 0 { first_zero += 1 }
            if fs.iter().all(|&x| x == 0) { both_zero += 1 }
        }
    }

    println!("{}\n{}", first_zero, both_zero);
}



/// An iterator over [+-]1 arrays of the appropriate length
struct PlusMinus {
    done: bool,
    current: [i32, .. N + 1]
}
impl PlusMinus {
    fn new() -> PlusMinus {
        PlusMinus { done: false, current: [-1, .. N + 1] }
    }
}

impl Iterator<[i32, .. N + 1]> for PlusMinus {
    fn next(&mut self) -> Option<[i32, .. N+1]> {
        if self.done {
            return None
        }

        let ret = self.current;

        // a binary "adder", that just adds one to a bit vector (where
        // -1 is the zero, and 1 is the one).
        for (i, place) in self.current.mut_iter().enumerate() {
            *place = -*place;
            if *place == 1 {
                break
            } else if i == N {
                // we've wrapped, so we want to stop after this one
                self.done = true
            }
        }

        Some(ret)
    }
}

Compiled with --opt-level=3
My rust compiler is a recent nightly: (rustc 0.11-pre-nightly (eea4909 2014-04-24 23:41:15 -0700) to be precise)

huon

Posted 2014-04-26T18:26:17.890

Reputation: 551

I got it to compile using the nightly version of rust. However I think the code is wrong. The output should be something close to firstzero 27215 bothzero 12086. Instead it gives 27367 6481 – None – 2014-04-27T18:55:34.297

@Lembik, whoops, got my as and bs mixed up in the convolution; fixed (doesn't change the runtime noticably). – huon – 2014-04-28T01:24:05.980

4It's a very nice demonstration of rust's speed. – None – 2014-05-01T06:57:15.070

C++ (VS 2012) - ~~0.026s~~ 0.015s

Python 2.7.6/Numpy 1.8.1 - 12s

Speedup ~x800.

The gap would be a lot smaller if the convolved arrays were very large...

#include <vector>
#include <iostream>
#include <ctime>

using namespace std;

static unsigned int seed = 35;

int my_random()
{
   seed = seed*1664525UL + 1013904223UL; // numerical recipes, 32 bit

   switch((seed>>30) & 3)
   {
   case 0: return 0;
   case 1: return -1;
   case 2: return 1;
   case 3: return 0;
   }
   return 0;
}

bool allzero(const vector<int>& T)
{
   for(auto x : T)
   {
      if(x!=0)
      {
         return false;
      }
   }
   return true;
}

void convolve(vector<int>& out, const vector<int>& v1, const vector<int>& v2)
{
   for(size_t i = 0; i<out.size(); ++i)
   {
      int result = 0;
      for(size_t j = 0; j<v2.size(); ++j)
      {
         result += v1[i+j]*v2[j];
      }
      out[i] = result;
   }
}

void advance(vector<int>& v)
{
   for(auto &x : v)
   {
      if(x==-1)
      {
         x = 1;
         return;
      }
      x = -1;
   }
}

void convolve_random_arrays(void)
{
   const size_t n = 6;
   const int two_to_n_plus_one = 128;
   const int iters = 1000;
   int bothzero = 0;
   int firstzero = 0;

   vector<int> S(n+1);
   vector<int> F(n);
   vector<int> FS(2);

   time_t current_time;
   time(&current_time);
   seed = current_time;

   for(auto &x : S)
   {
      x = -1;
   }
   for(int i=0; i<two_to_n_plus_one; ++i)
   {
      for(int j=0; j<iters; ++j)
      {
         do
         {
            for(auto &x : F)
            {
               x = my_random();
            }
         } while(allzero(F));
         convolve(FS, S, F);
         if(FS[0] == 0)
         {
            firstzero++;
            if(FS[1] == 0)
            {
               bothzero++;
            }
         }
      }
      advance(S);
   }
   cout << firstzero << endl; // This output can slow things down
   cout << bothzero << endl; // comment out for timing the algorithm
}

A few notes:

The random function is being called in the loop so I went for a very light weight linear congruential generator (but generously looked at the MSBs).
This is really just the starting point for an optimized solution.
Didn't take that long to write...
I iterate through all the values of S taking S[0] to be the "least significant" digit.

Add this main function for a self contained example:

int main(int argc, char** argv)
{
  for(int i=0; i<1000; ++i) // run 1000 times for stop-watch
  {
      convolve_random_arrays();
  }
}

Guy Sirton

Posted 2014-04-26T18:26:17.890

Reputation: 551

1Indeed. The tiny size of the arrays in OP's code means using numpy is actually an order of magnitude slower than straight python. – Alistair Buxton – 2014-04-27T04:29:23.127

2Now x800 is what I am talking about! – None – 2014-04-27T05:53:46.517

Very nice! I've increased the speed-up on my code because of your advance function, so my code is now faster than yours :P (but very good competition!) – Kyle Kanos – 2014-04-27T15:18:47.530

I can't get this to compile. It says "error: ‘x’ does not name a type for(auto x : T) " and many other errors. How do you compile it? – None – 2014-04-27T16:39:31.813

Can anyone else compile this code? I can't compile it at all. – None – 2014-04-27T18:43:22.550

1@lembik yes as Mat says. You need C++11 supprt and a main function. Let me know if you need more help to get this to run... – Guy Sirton – 2014-04-27T19:54:31.807

@GuySirton Help getting it to compile would be great. Could you maybe just make it completely self contained to start with and I will try -std=c++11 . I am using g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2 – None – 2014-04-27T20:00:47.557

You need a main function... Let me add one to the answer... – Guy Sirton – 2014-04-27T20:00:55.700

@Lembik: I'm not on my Linux machine right now but let me know if adding main worked. Also pass -O3 to gcc to enable the optimizer. – Guy Sirton – 2014-04-27T20:05:59.733

@GuySirton It did work. It is just missing two print statements now :) – None – 2014-04-27T20:08:49.070

@Lembik: That would slow it down! :-) Let me throw those in for you. – Guy Sirton – 2014-04-27T20:09:50.700

@Lembik: ok. note two lines added to the convolve function... – Guy Sirton – 2014-04-27T20:12:30.563

Oh I just noticed there is no randomness in your code at all! – None – 2014-04-27T20:13:14.527

@Lembik: There is randomness but every run would be the same pseudo-random sequence. If you want a new random sequence every time you just need to initialize the seed, e.g. to the clock. I can do that for you if you want, it won't affect the run-time... Python does this seed init for you. – Guy Sirton – 2014-04-27T20:14:50.250

@GuySirton Yes please. It seems only fair to compare code that has the same functionality. The fortran code for example does this seeding as does the C code. – None – 2014-04-27T20:18:07.957

@Lembik: All the programs here use a psuedo-random generator. Since the impact of the generator will influence the run-time I made my own since we want to benchmark the algorithm, not the random number generator... Does this make sense? (See more here: http://en.wikipedia.org/wiki/Linear_congruential_generator )

– Guy Sirton – 2014-04-27T20:18:27.117

If I had thought about it before I would have specific the random generator that had to be used. But I didn't. So as long as it gives the correct output (that is the random number is "good enough") and is seeded randomly I will accept it. – None – 2014-04-27T20:19:27.683

@Lembik There you are. Random seeding as requested, every run will show randomness. – Guy Sirton – 2014-04-27T20:24:18.350

let us continue this discussion in chat

– Guy Sirton – 2014-04-27T20:46:49.677

2I just tested this and could shave of another 20% by using plain arrays instead of std::vector.. – PlasmaHH – 2014-04-28T13:00:46.633

your my_random is considerably slower than it needs to be; given res=(seed>>30)&3, replacing the switch with (res&1)-!!(res&2) is nearly thrice as fast on my machine – Dave – 2014-05-01T02:09:04.627

@Dave Good idea. The intention here was really to put together quickly something that is a reasonable approach in C++ without too many tricks. Couldn't resist a few though ;-) As I mention this is just a starting point and at the time I wrote this answer it was the fastest random in town (and I was the first to identify that as a factor). I may attempt a more optimized solution to Lembik's new question, we'll see, I might use your variation there ;-) Would be nice if the compiler generated better code for that switch... – Guy Sirton – 2014-05-01T02:44:50.090

The used random generator here fails lots of tests for random output I wager (linear congruity method just isn't particularly good). There's a reason that the standard c++ random generator take as long as they do - if we don't care about the randomness of the result, may I propose ++x & 2? – Voo – 2014-05-04T17:03:23.137

@Voo The question is does ++x & 2 give you answers very close to the code in the question? I suspect not. – None – 2014-05-04T19:11:16.903

That just depends on your definition of "close". One simple example: If you use Dave's improvement (which seems perfectly valid on first glance) you end up using only the lower order bits, which for your particular parameters of the LCG happen to alternatively produce odd and even results (not particularly random that). Humans are horrible at noticing stuff like this, so without some extensive testing that show that the results are not impacted by the low quality PNRG it's rather suspect. Or as a prof of mine used to say: If it does not have to be correct I can make it arbitrarily fast :-) – Voo – 2014-05-04T20:14:23.313

@Voo (++x & 2) probably won't make much of a difference in run time. Give it a try. The main advantage of my providing my own random function is that it is inlined. Different languages will have different implementations of random and any non-secure generator will fail some randomness tests. I don't believe the C++ standard specifies any statistical requirement on random() and the contribution of how fast my specific random runs vs. the other solutions that have provided their own implementation of random is negligible. As Dave noted my random can still be optimized. – Guy Sirton – 2014-05-05T20:12:54.187

@Guy No I'd use C++'s standard mersenne twister implementation and not some horrible C function, because that's clearly the idiomatic solution, gives excellent results (best widely used PNRG by far - there's a reason large numbers of languages use it) and I don't have to make a proof showing that using an inferior PNRG won't change the result (which is generally really hard to do). If C++'s implementation causes a performance degradation when using its standard library compared to other languages, it's only honest to show that in a "what language is faster" anyhow. – Voo – 2014-05-05T23:06:29.643

@Voo C++11 does have random but its quality afaik is implementation specific. I.e. does not need to be mersenne twister. Using standard library degrades to what compiler has a faster standard library implementation and/or can inline the code or take advantage of special instructions so I'm not sure it makes it more meaningful in terms of C++ vs. other languages. I think my specific generator sucks because seed has to 32 bit, not 64 bit, if you try with uint_32 you should see a little better randomness but yeah LCGs aren't that great but they are used in lots of standard implementations. – Guy Sirton – 2014-05-06T05:53:28.623

C

Takes 0.015s on my machine, with OP's original code taking ~ 7.7s. Tried to optimize by generating the random array and convolving in the same loop, but it doesn't seem to make a lot of difference.

The first array is generated by taking an integer, write it out in binary, and change all 1 to -1 and all 0 to 1. The rest should be very straightforward.

Edit: instead of having n as an int, now we have n as a macro-defined constant, so we can use int arr[n]; instead of malloc.

Edit2: Instead of built-in rand() function, this now implements an xorshift PRNG. Also, a lot of conditional statements are removed when generating the random array.

Compile instructions:

gcc -O3 -march=native -fwhole-program -fstrict-aliasing -ftree-vectorize -Wall ./test.c -o ./test

Code:

#include <stdio.h>
#include <time.h>

#define n (6)
#define iters (1000)
unsigned int x,y=34353,z=57768,w=1564; //PRNG seeds

/* xorshift PRNG
 * Taken from https://en.wikipedia.org/wiki/Xorshift#Example_implementation
 * Used under CC-By-SA */
int myRand() {
    unsigned int t;
    t = x ^ (x << 11);
    x = y; y = z; z = w;
    return w = w ^ (w >> 19) ^ t ^ (t >> 8);
}

int main() {
    int firstzero=0, bothzero=0;
    int arr[n+1];
    unsigned int i, j;
    x=(int)time(NULL);

    for(i=0; i< 1<<(n+1) ; i++) {
        unsigned int tmp=i;
        for(j=0; j<n+1; j++) {
            arr[j]=(tmp&1)*(-2)+1;
            tmp>>=1;
        }
        for(j=0; j<iters; j++) {
            int randArr[n];
            unsigned int k, flag=0;
            int first=0, second=0;
            do {
                for(k=0; k<n; k++) {
                    randArr[k]=(1-(myRand()&3))%2;
                    flag+=(randArr[k]&1);
                    first+=arr[k]*randArr[k];
                    second+=arr[k+1]*randArr[k];
                }
            } while(!flag);
            firstzero+=(!first);
            bothzero+=(!first&&!second);
        }
    }
    printf("firstzero %d\nbothzero %d\n", firstzero, bothzero);
    return 0;
}

user12205

Posted 2014-04-26T18:26:17.890

Reputation: 8 752

1I tested this. It is very fast (try n = 10) and gives correct looking output. Thank you. – None – 2014-04-27T16:43:27.797

This implementation doesn't follow the original because if the random vector is all zeros only the last element would be re-generated. In the original the entire vector would be. You need to enclose that loop in do{}while(!flag) or something to that effect. I don't expect it will change the run-time much (may make it faster). – Guy Sirton – 2014-04-27T22:28:36.647

@Guy Sirton Notice that before the continue; statement I assigned -1 to k, so k will loop from 0 again. – user12205 – 2014-04-27T22:55:20.127

1@ace ah! you're right. I was scanning too quickly and it looked like that was -= rather than =- :-) A while loop would be more readable. – Guy Sirton – 2014-04-27T23:01:09.677

J

I don't expect to beat out any compiled languages, and something tells me it'd take a miraculous machine to get less than 0.09 s with this, but I'd like to submit this J anyway, because it's pretty slick.

NB. constants
num =: 6
iters =: 1000

NB. convolve
NB. take the multiplication table                */
NB. then sum along the NE-SW diagonals           +//.
NB. and keep the longest ones                    #~ [: (= >./) #/.
NB. operate on rows of higher dimensional lists  " 1
conv =: (+//. #~ [: (= >./) #/.) @: (*/) " 1

NB. main program
S  =: > , { (num+1) # < _1 1                NB. all {-1,1}^(num+1)
F  =: (3&= - 0&=) (iters , num) ?@$ 4       NB. iters random arrays of length num
FS =: ,/ S conv/ F                          NB. make a convolution table
FB =: +/ ({. , *./)"1 ] 0 = FS              NB. first and both zero
('first zero ',:'both zero ') ,. ":"0 FB    NB. output results

This takes about 0.5 s on a laptop from the previous decade, only about 20x as fast as the Python in the answer. Most of the time is spent in conv because we write it lazily (we compute the entire convolution) and in full generality.

Since we know things about S and F, we can speed things up by making specific optimizations for this program. The best I've been able to come up with is conv =: ((num, num+1) { +//.)@:(*/)"1—select specifically the two numbers that correspond from the diagonal sums to the longest elements of the convolution—which approximately halves the time.

algorithmshark

Posted 2014-04-26T18:26:17.890

Reputation: 8 144

6J is always worth submitting, man :) – Vitaly Dyatlov – 2014-04-30T17:50:12.250

Perl - 9.3X faster...830% improvement

On my ancient netbook, the OP's code takes 53 seconds to run; Alistair Buxton's version takes about 6.5 seconds, and the following Perl version takes about 5.7 seconds.

use v5.10;
use strict;
use warnings;

use Algorithm::Combinatorics qw( variations_with_repetition );
use List::Util qw( any sum );
use List::MoreUtils qw( pairwise );

my $n         = 6;
my $iters     = 1000;
my $firstzero = 0;
my $bothzero  = 0;

my $variations = variations_with_repetition([-1, 1], $n+1);
while (my $S = $variations->next)
{
  for my $i (1 .. $iters)
  {
    my @F;
    until (@F and any { $_ } @F)
    {
      @F = map +((-1,0,0,1)[rand 4]), 1..$n;
    }

    # The pairwise function doesn't accept array slices,
    # so need to copy into a temp array @S0
    my @S0 = @$S[0..$n-1];

    unless (sum pairwise { $a * $b } @F, @S0)
    {
      $firstzero++;
      my @S1 = @$S[1..$n];  # copy again :-(
      $bothzero++ unless sum pairwise { $a * $b } @F, @S1;
    }
  }
}

say "firstzero ", $firstzero;
say "bothzero ", $bothzero;

tobyink

Posted 2014-04-26T18:26:17.890

Reputation: 1 233

Python 2.7 - numpy 1.8.1 with mkl bindings - 0.086s

(OP's original: 6.404s) (Buxton's pure python: 0.270s)

import numpy as np
import itertools

n=6
iters = 1000

#Pack all of the Ses into a single array
S = np.array( list(itertools.product([-1,1], repeat=n+1)) )

# Create a whole array of test arrays, oversample a bit to ensure we 
# have at least (iters) of them
F = np.random.rand(int(iters*1.1),n)
F = ( F < 0.25 )*-1 + ( F > 0.75 )*1
goodrows = (np.abs(F).sum(1)!=0)
assert goodrows.sum() > iters, "Got very unlucky"
# get 1000 cases that aren't all zero
F = F[goodrows][:iters]

# Do the convolution explicitly for the two 
# slots, but on all of the Ses and Fes at the 
# same time
firstzeros = (F[:,None,:]*S[None,:,:-1]).sum(-1)==0
secondzeros = (F[:,None,:]*S[None,:,1:]).sum(-1)==0

firstzero_count = firstzeros.sum()
bothzero_count = (firstzeros * secondzeros).sum()
print "firstzero", firstzero_count
print "bothzero", bothzero_count

As Buxton points out, OP's original code uses such tiny arrays there is no benefit to using Numpy. This implementation leverages numpy by doing all of the F and S cases at once in an array oriented way. This combined with mkl bindings for python leads to a very fast implementation.

Note also that just loading the libraries and starting the interpreter takes 0.076s so the actual computation is taking ~ 0.01 seconds, similar to the C++ solution.

alemi

Posted 2014-04-26T18:26:17.890

Reputation: 221

What are mkl bindings and how do I get them on ubuntu? – None – 2014-04-30T17:53:57.880

Running python -c "import numpy; numpy.show_config()" will show you if your version of numpy is compiled against blas/atlas/mkl, etc. ATLAS is a free accelerated math package that numpy can be linked against, Intel MKL you usually have to pay for (unless you're an academic) and can be linked to numpy/scipy.

– alemi – 2014-04-30T18:57:49.690

For an easy way, use the anaconda python distribution and use the accelerate package. Or use the enthought distribution.

– alemi – 2014-04-30T19:00:32.303

If you're on windows, just download numpy from here. Pre-compiled numpy installers linked against MKL.

– Fake Name – 2014-05-02T10:20:46.637

MATLAB 0.024s

Computer 1

Original Code: ~ 3.3 s
Alistar Buxton's Code: ~ 0.51 s
Alistar Buxton's new Code: ~0.25 s
Matlab Code: ~ 0.024 s (Matlab already running)

Computer 2

Original Code: ~ 6.66 s
Alistar Buxton's Code: ~ 0.64 s
Alistar Buxton's new Code: ?
Matlab: ~ 0.07 s (Matlab already running)
Octave: ~ 0.07 s

I decided to give the oh so slow Matlab a try. If you know how, you can get rid of most of the loops (in Matlab), which makes it quite fast. However, the memory requirements are higher than for looped solutions but this will not be an issue if you don't have very large arrays...

function call_convolve_random_arrays
tic
convolve_random_arrays
toc
end

function convolve_random_arrays

n = 6;
iters = 1000;
firstzero = 0;
bothzero = 0;

rnd = [-1, 0, 0, 1];

S = -1 *ones(1, n + 1);

IDX1 = 1:n;
IDX2 = IDX1 + 1;

for i = 1:2^(n + 1)
    F = rnd(randi(4, [iters, n]));
    sel = ~any(F,2);
    while any(sel)
        F(sel, :) = rnd(randi(4, [sum(sel), n]));
        sel = ~any(F,2);
    end

    sum1 = F * S(IDX1)';
    sel = sum1 == 0;
    firstzero = firstzero + sum(sel);

    sum2 = F(sel, :) * S(IDX2)';
    sel = sum2 == 0;
    bothzero = bothzero + sum(sel);

    S = permute(S); 
end

fprintf('firstzero %i \nbothzero %i \n', firstzero, bothzero);

end

function x = permute(x)

for i=1:length(x)
    if(x(i)==-1)
        x(i) = 1;
            return
    end
        x(i) = -1;
end

end

Here is what I do:

use Kyle Kanos function to permute through S
calculate all n * iters random numbers at once
map 1 to 4 to [-1 0 0 1]
use Matrix multiplication (elementwise sum(F * S(1:5)) is equal to matrix multiplication of F * S(1:5)'
for bothzero: only calculate members that fullfill the first condition

I assume you don't have matlab, which is too bad as I really would have liked to see how it compares...

(The function can be slower the first time you run it.)

mathause

Posted 2014-04-26T18:26:17.890

Reputation: 341

Well I have octave if you can make it work for that...? – None – 2014-04-28T12:12:06.117

I can give it a try - I never worked with octave, though. – mathause – 2014-04-28T12:43:33.223

Ok, I can run it as is in octave if i put the code in a file named call_convolve_random_arrays.m and then call it from octave. – mathause – 2014-04-28T18:29:04.563

Does it need some more code to actually get it to do anything? When I do "octave call_convolve_random_arrays.m" it doesn't output anything. See http://bpaste.net/show/JPtLOCeI3aP3wc3F3aGf/

– None – 2014-04-28T18:33:08.407

sorry, try to open octave and run it then. It should display firstzero, bothzero and execution time. – mathause – 2014-04-28T18:34:16.203

Thanks. That worked. I get Elapsed time is 0.0785012 seconds. octave -eval all_convolve_random_arrays.m sort of works .. well it gives an error message as well as running. – None – 2014-04-28T18:36:02.733

How do Alistar Buxton's new optimizations perform for you? – Steven Rumbalski – 2014-04-29T15:57:58.500

@StevenRumbalski I added the time for the new code above. I think there is nothing similar to itertools.product in Matlab, so I would have to think about how to create this matrix. But I think that could gain you some time. – mathause – 2014-04-30T14:39:55.337

@Lembik I did not get any errors. What error do you get and what version are you using? – mathause – 2014-04-30T14:42:11.533

Julia: 0.30 s

Op's Python: 21.36 s (Core2 duo)

71x speedup

function countconv()                                                                                                                                                           
    n = 6                                                                                                                                                                      
    iters = 1000                                                                                                                                                               
    firstzero = 0                                                                                                                                                              
    bothzero = 0                                                                                                                                                               
    cprod= Iterators.product(fill([-1,1], n+1)...)                                                                                                                             
    F=Array(Float64,n);                                                                                                                                                        
    P=[-1. 0. 0. 1.]                                                                                                                                                                                                                                                                                                             

    for S in cprod                                                                                                                                                             
        Sm=[S...]                                                                                                                                                              
        for i = 1:iters                                                                                                                                                        
            F=P[rand(1:4,n)]                                                                                                                                                  
            while all(F==0)                                                                                                                                                   
                F=P[rand(1:4,n)]                                                                                                                                              
            end                                                                                                                                                               
            if  dot(reverse!(F),Sm[1:end-1]) == 0                                                                                                                           
                firstzero += 1                                                                                                                                                 
                if dot(F,Sm[2:end]) == 0                                                                                                                              
                    bothzero += 1                                                                                                                                              
                end                                                                                                                                                            
            end                                                                                                                                                                
        end                                                                                                                                                                    
    end
    return firstzero,bothzero
end

I did some modifications of Arman's Julia answer: First of all, I wrapped it in a function, as global variables make it hard for Julia's type inference and JIT: A global variable can change its type at any time, and must be checked every operation. Then, I got rid of the anonymous functions and array comprehensions. They aren't really necessary, and are still pretty slow. Julia is faster with lower-level abstractions right now.

There's lots more ways to make it faster, but this does a decent job.

user20768

Posted 2014-04-26T18:26:17.890

Reputation: 111

Are you measuring the time in the REPL or running the whole file from command line? – Aditya – 2014-04-29T05:30:44.113

both from the REPL. – user20768 – 2014-04-29T17:35:45.983

Ok I am posting this just because I feel Java needs to be represented here. I am terrible with other languages and I confess to not understand the problem exactly, so I will need some help to fix this code. I stole most of the code ace's C example, and then borrowed some snippets from others. I hope that isn't a faux pas...

One thing I would like to point out is that languages that optimize at run time need to be run several/many times to get up to full speed. I think it is justified to take the fully optimized speed (or at least the average speed) because most things you are concerned with running fast will be run a bunch of times.

The code still needs to be fixed, but I ran it anyways to see what times I would get.

Here are the results on an Intel(R) Xeon(R) CPU E3-1270 V2 @ 3.50GHz on Ubuntu running it 1000 times:

server:/tmp# time java8 -cp . Tester

firstzero 40000

bothzero 20000

first run time: 41 ms last run time: 4 ms

real 0m5.014s user 0m4.664s sys 0m0.268s

Here is my crappy code:

public class Tester 
{
    public static void main( String[] args )
    {
        long firstRunTime = 0;
        long lastRunTime = 0;
        String testResults = null;
        for( int i=0 ; i<1000 ; i++ )
        {
            long timer = System.currentTimeMillis();
            testResults = new Tester().runtest();
            lastRunTime = System.currentTimeMillis() - timer;
            if( i ==0 )
            {
                firstRunTime = lastRunTime;
            }
        }
        System.err.println( testResults );
        System.err.println( "first run time: " + firstRunTime + " ms" );
        System.err.println( "last run time: " + lastRunTime + " ms" );
    }

    private int x,y=34353,z=57768,w=1564; 

    public String runtest()
    {
        int n = 6;
        int iters = 1000;
        //#define iters (1000)
        //PRNG seeds

        /* xorshift PRNG
         * Taken from https://en.wikipedia.org/wiki/Xorshift#Example_implementation
         * Used under CC-By-SA */

            int firstzero=0, bothzero=0;
            int[] arr = new int[n+1];
            int i=0, j=0;
            x=(int)(System.currentTimeMillis()/1000l);

            for(i=0; i< 1<<(n+1) ; i++) {
                int tmp=i;
                for(j=0; j<n+1; j++) {
                    arr[j]=(tmp&1)*(-2)+1;
                    tmp>>=1;
                }
                for(j=0; j<iters; j++) {
                    int[] randArr = new int[n];
                    int k=0;
                    long flag = 0;
                    int first=0, second=0;
                    do {
                        for(k=0; k<n; k++) {
                            randArr[k]=(1-(myRand()&3))%2;
                            flag+=(randArr[k]&1);
                            first+=arr[k]*randArr[k];
                            second+=arr[k+1]*randArr[k];
                        }
                    } while(allzero(randArr));
                    if( first == 0 )
                    {
                        firstzero+=1;
                        if( second == 0 )
                        {
                            bothzero++;
                        }
                    }
                }
            }
         return ( "firstzero " + firstzero + "\nbothzero " + bothzero + "\n" );
    }

    private boolean allzero(int[] arr)
    {
       for(int x : arr)
       {
          if(x!=0)
          {
             return false;
          }
       }
       return true;
    }

    public int myRand() 
    {
        long t;
        t = x ^ (x << 11);
        x = y; y = z; z = w;
        return (int)( w ^ (w >> 19) ^ t ^ (t >> 8));
    }
}

And I tried running the python code after upgrading python and installing python-numpy but I get this:

server:/tmp# python tester.py
Traceback (most recent call last):
  File "peepee.py", line 15, in <module>
    F = np.random.choice(np.array([-1,0,0,1], dtype=np.int8), size = n)
AttributeError: 'module' object has no attribute 'choice'

Chris Seline

Posted 2014-04-26T18:26:17.890

Reputation: 161

Comments: Never use currentTimeMillis for benchmarking (use the nano version in System) and 1k runs may not be enough to get the JIT involved (1.5k for client and 10k for server would be the defaults, although you call myRand often enough that that will be JITed which should cause some functions up the callstack to be compiled which may work here).Last but not least the weak PNRG is cheating, but so does the C++ solution and others, so I guess that's not too unfair. – Voo – 2014-05-04T20:25:06.887

On windows you need to avoid currentTimeMillis, but for linux for all but very fine granularity measurements you don't need nano time, and the call to get nano time is much more expensive than millis. So I very much disagree that you should NEVER use it. – Chris Seline – 2014-05-05T20:58:22.893

So you're writing Java code for one particular OS and JVM implementation? Actually I'm not sure which OS you're using, because I just checked in my HotSpot dev tree and Linux uses gettimeofday(&time, NULL) for milliSeconds which isn't monotonical and doesn't give any accuracy guarantees (so on some platforms/kernels exactly the same problems as the currentTimeMillis Windows implementation - so either that one's fine too or neither is). nanoTime on the other hand uses clock_gettime(CLOCK_MONOTONIC, &tp) which clearly is also the right thing to use when benchmarking on Linux. – Voo – 2014-05-05T23:38:56.297

It has never caused an issue for me since I have been coding java on any Linux distro or kernel. – Chris Seline – 2014-05-07T00:35:29.430

Golang version 45X of python on my machine on below Golang codes:

package main

import (
"fmt"
"time"
)

const (
n     = 6
iters = 1000
)

var (
x, y, z, w = 34353, 34353, 57768, 1564 //PRNG seeds
)

/* xorshift PRNG
 * Taken from https://en.wikipedia.org/wiki/Xorshift#Example_implementation
 * Used under CC-By-SA */
func myRand() int {
var t uint
t = uint(x ^ (x << 11))
x, y, z = y, z, w
w = int(uint(w^w>>19) ^ t ^ (t >> 8))
return w
}

func main() {
var firstzero, bothzero int
var arr [n + 1]int
var i, j int
x = int(time.Now().Unix())

for i = 0; i < 1<<(n+1); i = i + 1 {
    tmp := i
    for j = 0; j < n+1; j = j + 1 {
        arr[j] = (tmp&1)*(-2) + 1
        tmp >>= 1
    }
    for j = 0; j < iters; j = j + 1 {
        var randArr [n]int
        var flag uint
        var k, first, second int
        for {
            for k = 0; k < n; k = k + 1 {
                randArr[k] = (1 - (myRand() & 3)) % 2
                flag += uint(randArr[k] & 1)
                first += arr[k] * randArr[k]
                second += arr[k+1] * randArr[k]
            }
            if flag != 0 {
                break
            }
        }
        if first == 0 {
            firstzero += 1
            if second == 0 {
                bothzero += 1
            }
        }
    }
}
println("firstzero", firstzero, "bothzero", bothzero)
}

and the below python codes copyed from above:

import itertools
import operator
import random

n=6
iters = 1000
firstzero = 0
bothzero = 0

choicesF = filter(any, itertools.product([-1, 0, 0, 1], repeat=n))

for S in itertools.product([-1,1], repeat = n+1):
    for i in xrange(iters):
        F = random.choice(choicesF)
        if not sum(map(operator.mul, F, S[:-1])):
            firstzero += 1
            if not sum(map(operator.mul, F, S[1:])):
                bothzero += 1

print "firstzero", firstzero
print "bothzero", bothzero

and the time below:

$time python test.py
firstzero 27349
bothzero 12125

real    0m0.477s
user    0m0.461s
sys 0m0.014s

$time ./hf
firstzero 27253 bothzero 12142

real    0m0.011s
user    0m0.008s
sys 0m0.002s

lunny

Posted 2014-04-26T18:26:17.890

Reputation: 61

1have you thought about using "github.com/yanatan16/itertools" ? also would you say this would work nice in multiple goroutines ? – ymg – 2014-04-30T17:02:45.570

Haskell: ~2000x speedup per core

Compile with 'ghc -O3 -funbox-strict-fields -threaded -fllvm', and run with '+RTS -Nk' where k is the number of cores on your machine.

import Control.Parallel.Strategies
import Data.Bits
import Data.List
import Data.Word
import System.Random

n = 6 :: Int
iters = 1000 :: Int

data G = G !Word !Word !Word !Word deriving (Eq, Show)

gen :: G -> (Word, G)
gen (G x y z w) = let t  = x `xor` (x `shiftL` 11)
                      w' = w `xor` (w `shiftR` 19) `xor` t `xor` (t `shiftR` 8)
                  in (w', G y z w w')  

mask :: Word -> Word
mask = (.&.) $ (2 ^ n) - 1

gen_nonzero :: G -> (Word, G)
gen_nonzero g = let (x, g') = gen g 
                    a = mask x
                in if a == 0 then gen_nonzero g' else (a, g')


data F = F {zeros  :: !Word, 
            posneg :: !Word} deriving (Eq, Show)

gen_f :: G -> (F, G)       
gen_f g = let (a, g')  = gen_nonzero g
              (b, g'') = gen g'
          in  (F a $ mask b, g'')

inner :: Word -> F -> Int
inner s (F zs pn) = let s' = complement $ s `xor` pn
                        ones = s' .&. zs
                        negs = (complement s') .&. zs
                    in popCount ones - popCount negs

specialised_convolve :: Word -> F -> (Int, Int)
specialised_convolve s f@(F zs pn) = (inner s f', inner s f) 
    where f' = F (zs `shiftL` 1) (pn `shiftL` 1)

ss :: [Word]
ss = [0..2 ^ (n + 1) - 1]

main_loop :: [G] -> (Int, Int)
main_loop gs = foldl1' (\(fz, bz) (fz', bz') -> (fz + fz', bz + bz')) . parMap rdeepseq helper $ zip ss gs
    where helper (s, g) = go 0 (0, 0) g
                where go k u@(fz, bz) g = if k == iters 
                                              then u 
                                              else let (f, g') = gen_f g
                                                       v = case specialised_convolve s f
                                                               of (0, 0) -> (fz + 1, bz + 1)
                                                                  (0, _) -> (fz + 1, bz)
                                                                  _      -> (fz, bz)
                                                   in go (k + 1) v g'

seed :: IO G                                        
seed = do std_g <- newStdGen
          let [x, y, z, w] = map fromIntegral $ take 4 (randoms std_g :: [Int])
          return $ G x y z w

main :: IO ()
main = (sequence $ map (const seed) ss) >>= print . main_loop

user1502040

Posted 2014-04-26T18:26:17.890

Reputation: 2 196

Amdahl's law states the parallelization speedup is not linear to the number of parallel processing units. instead they only provide dimishing returns – xaedes – 2019-07-11T07:39:01.867

@xaedes The speedup seems essentially linear for low numbers of cores – user1502040 – 2019-07-12T09:18:17.370

So with 4 cores it's over 9000?! There's no way that could be right.

– Cees Timmerman – 2014-05-06T15:03:56.017

C# 0.135s

C# based on Alistair Buxton's plain python: 0.278s
Parallelised C#: 0.135s
Python from the question: 5.907s
Alistair's plain python: 0.853s

I'm not actually certain this implementation is correct - its output is different, if you look at the results down at the bottom.

There's certainly more optimal algorithms. I just decided to use a very similar algorithm to the Python one.

Single-threaded C

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ConvolvingArrays
{
    static class Program
    {
        static void Main(string[] args)
        {
            int n=6;
            int iters = 1000;
            int firstzero = 0;
            int bothzero = 0;

            int[] arraySeed = new int[] {-1, 1};
            int[] randomSource = new int[] {-1, 0, 0, 1};
            Random rand = new Random();

            foreach (var S in Enumerable.Repeat(arraySeed, n+1).CartesianProduct())
            {
                for (int i = 0; i < iters; i++)
                {
                    var F = Enumerable.Range(0, n).Select(_ => randomSource[rand.Next(randomSource.Length)]);
                    while (!F.Any(f => f != 0))
                    {
                        F = Enumerable.Range(0, n).Select(_ => randomSource[rand.Next(randomSource.Length)]);
                    }
                    if (Enumerable.Zip(F, S.Take(n), (f, s) => f * s).Sum() == 0)
                    {
                        firstzero++;
                        if (Enumerable.Zip(F, S.Skip(1), (f, s) => f * s).Sum() == 0)
                        {
                            bothzero++;
                        }
                    }
                }
            }

            Console.WriteLine("firstzero {0}", firstzero);
            Console.WriteLine("bothzero {0}", bothzero);
        }

        // itertools.product?
        // http://ericlippert.com/2010/06/28/computing-a-cartesian-product-with-linq/
        static IEnumerable<IEnumerable<T>> CartesianProduct<T>
            (this IEnumerable<IEnumerable<T>> sequences)
        {
            IEnumerable<IEnumerable<T>> emptyProduct =
              new[] { Enumerable.Empty<T>() };
            return sequences.Aggregate(
              emptyProduct,
              (accumulator, sequence) =>
                from accseq in accumulator
                from item in sequence
                select accseq.Concat(new[] { item }));
        }
    }
}

Parallel C#:

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace ConvolvingArrays
{
    static class Program
    {
        static void Main(string[] args)
        {
            int n=6;
            int iters = 1000;
            int firstzero = 0;
            int bothzero = 0;

            int[] arraySeed = new int[] {-1, 1};
            int[] randomSource = new int[] {-1, 0, 0, 1};

            ConcurrentBag<int[]> results = new ConcurrentBag<int[]>();

            // The next line iterates over arrays of length n+1 which contain only -1s and 1s
            Parallel.ForEach(Enumerable.Repeat(arraySeed, n + 1).CartesianProduct(), (S) =>
            {
                int fz = 0;
                int bz = 0;
                ThreadSafeRandom rand = new ThreadSafeRandom();
                for (int i = 0; i < iters; i++)
                {
                    var F = Enumerable.Range(0, n).Select(_ => randomSource[rand.Next(randomSource.Length)]);
                    while (!F.Any(f => f != 0))
                    {
                        F = Enumerable.Range(0, n).Select(_ => randomSource[rand.Next(randomSource.Length)]);
                    }
                    if (Enumerable.Zip(F, S.Take(n), (f, s) => f * s).Sum() == 0)
                    {
                        fz++;
                        if (Enumerable.Zip(F, S.Skip(1), (f, s) => f * s).Sum() == 0)
                        {
                            bz++;
                        }
                    }
                }

                results.Add(new int[] { fz, bz });
            });

            foreach (int[] res in results)
            {
                firstzero += res[0];
                bothzero += res[1];
            }

            Console.WriteLine("firstzero {0}", firstzero);
            Console.WriteLine("bothzero {0}", bothzero);
        }

        // itertools.product?
        // http://ericlippert.com/2010/06/28/computing-a-cartesian-product-with-linq/
        static IEnumerable<IEnumerable<T>> CartesianProduct<T>
            (this IEnumerable<IEnumerable<T>> sequences)
        {
            IEnumerable<IEnumerable<T>> emptyProduct =
              new[] { Enumerable.Empty<T>() };
            return sequences.Aggregate(
              emptyProduct,
              (accumulator, sequence) =>
                from accseq in accumulator
                from item in sequence
                select accseq.Concat(new[] { item }));
        }
    }

    // http://stackoverflow.com/a/11109361/1030702
    public class ThreadSafeRandom
    {
        private static readonly Random _global = new Random();
        [ThreadStatic]
        private static Random _local;

        public ThreadSafeRandom()
        {
            if (_local == null)
            {
                int seed;
                lock (_global)
                {
                    seed = _global.Next();
                }
                _local = new Random(seed);
            }
        }
        public int Next()
        {
            return _local.Next();
        }
        public int Next(int maxValue)
        {
            return _local.Next(maxValue);
        }
    }
}

Test output:

Windows (.NET)

The C# is much faster on Windows. Probably because .NET is faster than mono.

User and sys timing doesn't seem to work (used git bash for timing).

$ time /c/Python27/python.exe numpypython.py
firstzero 27413
bothzero 12073

real    0m5.907s
user    0m0.000s
sys     0m0.000s
$ time /c/Python27/python.exe plainpython.py
firstzero 26983
bothzero 12033

real    0m0.853s
user    0m0.000s
sys     0m0.000s
$ time ConvolvingArrays.exe
firstzero 28526
bothzero 6453

real    0m0.278s
user    0m0.000s
sys     0m0.000s
$ time ConvolvingArraysParallel.exe
firstzero 28857
bothzero 6485

real    0m0.135s
user    0m0.000s
sys     0m0.000s

Linux (mono)

bob@phoebe:~/convolvingarrays$ time python program.py
firstzero 27059
bothzero 12131

real    0m11.932s
user    0m11.912s
sys     0m0.012s
bob@phoebe:~/convolvingarrays$ mcs -optimize+ -debug- program.cs
bob@phoebe:~/convolvingarrays$ time mono program.exe
firstzero 28982
bothzero 6512

real    0m1.360s
user    0m1.532s
sys     0m0.872s
bob@phoebe:~/convolvingarrays$ mcs -optimize+ -debug- parallelprogram.cs
bob@phoebe:~/convolvingarrays$ time mono parallelprogram.exe
firstzero 28857
bothzero 6496

real    0m0.851s
user    0m2.708s
sys     0m3.028s

Bob

Posted 2014-04-26T18:26:17.890

Reputation: 844

1I don't think the code is correct as you say. The outputs are not right. – None – 2014-04-27T16:43:56.740

@Lembik Yea. I'd appreciate it if someone could tell me where it's wrong, though - I can't figure it out (only having a minimal understanding of what it's supposed to do doesn't help). – Bob – 2014-04-28T01:40:50.337

Would be interesting to see how this does with .NET Native http://blogs.msdn.com/b/dotnet/archive/2014/04/02/announcing-net-native-preview.aspx

– Rick Minerich – 2014-04-30T16:33:18.307

@Lembik I've just gone over all of it, as far as I can tell it should be identical to the other Python solution... now I'm really confused. – Bob – 2014-04-30T23:44:29.280

F# solution

Runtime is 0.030s when compiled to x86 on the CLR Core i7 4 (8) @ 3.4 Ghz

I have no idea if the code is correct.

Functional optimization (inline fold) -> 0.026s
Building via Console Project -> 0.022s
Added a better algorithm for generation of the permutation arrays -> 0.018s
Mono for Windows -> 0.089s
Running Alistair's Python script -> 0.259s

let inline ffoldi n f state =
    let mutable state = state
    for i = 0 to n - 1 do
        state <- f state i
    state

let product values n =
    let p = Array.length values
    Array.init (pown p n) (fun i ->
        (Array.zeroCreate n, i)
        |> ffoldi n (fun (result, i') j ->
            result.[j] <- values.[i' % p]
            result, i' / p
        )
        |> fst
    )

let convolute signals filter =
    let m = Array.length signals
    let n = Array.length filter
    let len = max m n - min m n + 1

    Array.init len (fun offset ->
        ffoldi n (fun acc i ->
            acc + filter.[i] * signals.[m - 1 - offset - i]
        ) 0
    )

let n = 6
let iters = 1000

let next =
    let arrays =
        product [|-1; 0; 0; 1|] n
        |> Array.filter (Array.forall ((=) 0) >> not)
    let rnd = System.Random()
    fun () -> arrays.[rnd.Next arrays.Length]

let signals = product [|-1; 1|] (n + 1)

let firstzero, bothzero =
    ffoldi signals.Length (fun (firstzero, bothzero) i ->
        let s = signals.[i]
        ffoldi iters (fun (first, both) _ ->
            let f = next()
            match convolute s f with
            | [|0; 0|] -> first + 1, both + 1
            | [|0; _|] -> first + 1, both
            | _ -> first, both
        ) (firstzero, bothzero)
    ) (0, 0)

printfn "firstzero %i" firstzero
printfn "bothzero %i" bothzero

David Grenier

Posted 2014-04-26T18:26:17.890

Reputation: 141

Ruby

Ruby (2.1.0) 0.277s
Ruby (2.1.1) 0.281s
Python (Alistair Buxton) 0.330s
Python (alemi) 0.097s

n = 6
iters = 1000
first_zero = 0
both_zero = 0

choices = [-1, 0, 0, 1].repeated_permutation(n).select{|v| [0] != v.uniq}

def convolve(v1, v2)
  [0, 1].map do |i|
    r = 0
    6.times do |j|
      r += v1[i+j] * v2[j]
    end
    r
  end
end

[-1, 1].repeated_permutation(n+1) do |s|
  iters.times do
    f = choices.sample
    fs = convolve s, f
    if 0 == fs[0]
      first_zero += 1
      if 0 == fs[1]
        both_zero += 1
      end
    end
  end
end

puts 'firstzero %i' % first_zero
puts 'bothzero %i' % both_zero

Landstander

Posted 2014-04-26T18:26:17.890

Reputation: 131

thread wouldnt be complete without PHP

6.6x faster

PHP v5.5.9 - 1.223 0.646 sec;

vs

Python v2.7.6 - 8.072 sec

<?php

$n = 6;
$iters = 1000;
$firstzero = 0;
$bothzero = 0;

$x=time();
$y=34353;
$z=57768;
$w=1564; //PRNG seeds

function myRand() {
    global $x;
    global $y;
    global $z;
    global $w;
    $t = $x ^ ($x << 11);
    $x = $y; $y = $z; $z = $w;
    return $w = $w ^ ($w >> 19) ^ $t ^ ($t >> 8);
}

function array_cartesian() {
    $_ = func_get_args();
    if (count($_) == 0)
        return array();
    $a = array_shift($_);
    if (count($_) == 0)
        $c = array(array());
    else
        $c = call_user_func_array(__FUNCTION__, $_);
    $r = array();
    foreach($a as $v)
        foreach($c as $p)
            $r[] = array_merge(array($v), $p);
    return $r;
}

function rand_array($a, $n)
{
    $r = array();
    for($i = 0; $i < $n; $i++)
        $r[] = $a[myRand()%count($a)];
    return $r;
}

function convolve($a, $b)
{
    // slows down
    /*if(count($a) < count($b))
        return convolve($b,$a);*/
    $result = array();
    $w = count($a) - count($b) + 1;
    for($i = 0; $i < $w; $i++){
        $r = 0;
        for($k = 0; $k < count($b); $k++)
            $r += $b[$k] * $a[$i + $k];
        $result[] = $r;
    }
    return $result;
}

$cross = call_user_func_array('array_cartesian',array_fill(0,$n+1,array(-1,1)));

foreach($cross as $S)
    for($i = 0; $i < $iters; $i++){
        while(true)
        {
            $F = rand_array(array(-1,0,0,1), $n);
            if(in_array(-1, $F) || in_array(1, $F))
                break;
        }
        $FS = convolve($S, $F);
        if(0==$FS[0]) $firstzero += 1;
        if(0==$FS[0] && 0==$FS[1]) $bothzero += 1;
    }

echo "firstzero $firstzero\n";
echo "bothzero $bothzero\n";

Used a custom random generator (stolen from C answer), PHP one sucks and numbers dont match
convolve function simplified a bit to be more fast
Checking for array-with-zeros-only is very optimized too (see $F and $FS checkings).

Outputs:

$ time python num.py 
firstzero 27050
bothzero 11990

real    0m8.072s
user    0m8.037s
sys 0m0.024s
$ time php num.php
firstzero 27407
bothzero 12216

real    0m1.223s
user    0m1.210s
sys 0m0.012s

Edit. Second version of script works for for just 0.646 sec:

<?php

$n = 6;
$iters = 1000;
$firstzero = 0;
$bothzero = 0;

$x=time();
$y=34353;
$z=57768;
$w=1564; //PRNG seeds

function myRand() {
    global $x;
    global $y;
    global $z;
    global $w;
    $t = $x ^ ($x << 11);
    $x = $y; $y = $z; $z = $w;
    return $w = $w ^ ($w >> 19) ^ $t ^ ($t >> 8);
}

function array_cartesian() {
    $_ = func_get_args();
    if (count($_) == 0)
        return array();
    $a = array_shift($_);
    if (count($_) == 0)
        $c = array(array());
    else
        $c = call_user_func_array(__FUNCTION__, $_);
    $r = array();
    foreach($a as $v)
        foreach($c as $p)
            $r[] = array_merge(array($v), $p);
    return $r;
}

function convolve($a, $b)
{
    // slows down
    /*if(count($a) < count($b))
        return convolve($b,$a);*/
    $result = array();
    $w = count($a) - count($b) + 1;
    for($i = 0; $i < $w; $i++){
        $r = 0;
        for($k = 0; $k < count($b); $k++)
            $r += $b[$k] * $a[$i + $k];
        $result[] = $r;
    }
    return $result;
}

$cross = call_user_func_array('array_cartesian',array_fill(0,$n+1,array(-1,1)));

$choices = call_user_func_array('array_cartesian',array_fill(0,$n,array(-1,0,0,1)));

foreach($cross as $S)
    for($i = 0; $i < $iters; $i++){
        while(true)
        {
            $F = $choices[myRand()%count($choices)];
            if(in_array(-1, $F) || in_array(1, $F))
                break;
        }
        $FS = convolve($S, $F);
        if(0==$FS[0]){
            $firstzero += 1;
            if(0==$FS[1])
                $bothzero += 1;
        }
    }

echo "firstzero $firstzero\n";
echo "bothzero $bothzero\n";

Vitaly Dyatlov

Posted 2014-04-26T18:26:17.890

Reputation: 151

Q, 0.296 seg

n:6; iter:1000  /parametrization (constants)
c:n#0           /auxiliar constant (sequence 0 0.. 0 (n))
A:B:();         /A and B accumulates results of inner product (firstresult, secondresult)

/S=sequence with all arrays of length n+1 with values -1 and 1
S:+(2**m)#/:{,/x#/:-1 1}'m:|n(2*)\1 

f:{do[iter; F:c; while[F~c; F:n?-1 0 0 1]; A,:+/F*-1_x; B,:+/F*1_x];} /hard work
f'S               /map(S,f)
N:~A; +/'(N;N&~B) / ~A is not A (or A=0) ->bitmap.  +/ is sum (population over a bitmap)
                  / +/'(N;N&~B) = count firstResult=0, count firstResult=0 and secondResult=0

Q is a collection oriented language (kx.com)

Code rewrited to explote idiomatic Q, but no other clever optimizations

Scripting languages optimize programmer time, not execution time

Q is not the best tool for this problem

First coding attempt = not a winner, but reasonable time (approx. 30x speedup)

quite competitive among interpreters
stop and choose another problem

NOTES.-

program uses default seed (repeteable execs) To choose another seed for random generator use \S seed
Result is given as a squence of two ints, so there is a final i-suffix at second value 27421 12133i -> read as (27241, 12133)
Time not counting interpreter startup. \t sentence mesures time consumed by that sentence

J. Sendra

Posted 2014-04-26T18:26:17.890

Reputation: 396

Very interesting thank you. – None – 2016-05-22T08:42:09.083

Julia: 12.149 6.929 s

Despite their claims to speed, the initial JIT compilation time holds us back!

Note that the following Julia code is effectively a direct translation of the original Python code (no optimisations made) as a demonstration that you can easily transfer your programming experience to a faster language ;)

require("Iterators")

n = 6
iters = 1000
firstzero = 0
bothzero = 0

for S in Iterators.product(fill([-1,1], n+1)...)
    for i = 1:iters
        F = [[-1 0 0 1][rand(1:4)] for _ = 1:n]
        while all((x) -> round(x,8) == 0, F)
            F = [[-1 0 0 1][rand(1:4)] for _ = 1:n]
        end
        FS = conv(F, [S...])
        if round(FS[1],8) == 0
            firstzero += 1
        end
        if all((x) -> round(x,8) == 0, FS)
            bothzero += 1
        end
    end
end

println("firstzero ", firstzero)
println("bothzero ", bothzero)

Edit

Running with n = 8 takes 32.935 s. Considering that the complexity of this algorithm is O(2^n), then 4 * (12.149 - C) = (32.935 - C), where C is a constant representing the JIT compilation time. Solving for C we find that C = 5.2203, suggesting that actual execution time for n = 6 is 6.929 s.

nimble agar

Posted 2014-04-26T18:26:17.890

Reputation: 167

How about increasing n to 8 to see if Julia comes into its own then? – None – 2014-04-27T06:11:03.860

This ignores many of the performance tips here: http://julia.readthedocs.org/en/latest/manual/performance-tips/. See also the other Julia entry which does significantly better. The submission is appreciated though :-)

– StefanKarpinski – 2014-04-30T15:08:31.680

Rust, 6.6 ms, 1950x speedup

Pretty much a direct translation of Alistair Buxton's code to Rust. I considered making use of multiple cores with rayon (fearless concurrency!), but this didn't improve the performance, probably because it's very fast already.

extern crate itertools;
extern crate rand;
extern crate time;

use itertools::Itertools;
use rand::{prelude::*, prng::XorShiftRng};
use std::iter;
use time::precise_time_ns;

fn main() {
    let start = precise_time_ns();

    let n = 6;
    let iters = 1000;
    let mut first_zero = 0;
    let mut both_zero = 0;
    let choices_f: Vec<Vec<i8>> = iter::repeat([-1, 0, 0, 1].iter().cloned())
        .take(n)
        .multi_cartesian_product()
        .filter(|i| i.iter().any(|&x| x != 0))
        .collect();
    // xorshift RNG is faster than default algorithm designed for security
    // rather than performance.
    let mut rng = XorShiftRng::from_entropy(); 
    for s in iter::repeat(&[-1, 1]).take(n + 1).multi_cartesian_product() {
        for _ in 0..iters {
            let f = rng.choose(&choices_f).unwrap();
            if f.iter()
                .zip(&s[..s.len() - 1])
                .map(|(a, &b)| a * b)
                .sum::<i8>() == 0
            {
                first_zero += 1;
                if f.iter().zip(&s[1..]).map(|(a, &b)| a * b).sum::<i8>() == 0 {
                    both_zero += 1;
                }
            }
        }
    }
    println!("first_zero = {}\nboth_zero = {}", first_zero, both_zero);

    println!("runtime {} ns", precise_time_ns() - start);
}

And Cargo.toml, as I use external dependencies:

[package]
name = "how_slow_is_python"
version = "0.1.0"

[dependencies]
itertools = "0.7.8"
rand = "0.5.3"
time = "0.1.40"

Speed comparison:

$ time python2 py.py
firstzero: 27478
bothzero: 12246
12.80user 0.02system 0:12.90elapsed 99%CPU (0avgtext+0avgdata 23328maxresident)k
0inputs+0outputs (0major+3544minor)pagefaults 0swaps
$ time target/release/how_slow_is_python
first_zero = 27359
both_zero = 12162
runtime 6625608 ns
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 2784maxresident)k
0inputs+0outputs (0major+189minor)pagefaults 0swaps

6625608 ns is about 6.6 ms. This means 1950 times speedup. There are many optimizations possible here, but I was going for readability rather than performance. One possible optimization would be use arrays instead of vectors for storing choices, as they will always have n elements. It's also possible to use RNG other than XorShift, as while Xorshift is faster than the default HC-128 CSPRNG, it's slowest than naivest of PRNG algorithms.

Konrad Borowski

Posted 2014-04-26T18:26:17.890

Reputation: 11 185

How slow is Python really? (Or how fast is your language?)

Answers

C++ bit magic

0.84ms with simple RNG, 1.67ms with c++11 std::knuth

edit

Python2.7 + Numpy 1.8.1: 10.242 s

Fortran 90+: 0.029 s 0.003 s 0.022 s 0.010 s

Rust: 0.011s

Original Python: 8.3

C

J

MATLAB 0.024s

Julia: 0.30 s

Op's Python: 21.36 s (Core2 duo)

Haskell: ~2000x speedup per core

C# 0.135s

Single-threaded C

Parallel C#:

Test output:

Windows (.NET)

Linux (mono)

F# solution

Ruby

thread wouldnt be complete without PHP

6.6x faster

PHP v5.5.9 - 1.223 0.646 sec;

vs

Python v2.7.6 - 8.072 sec

Q, 0.296 seg

Julia: 12.149 6.929 s

Edit

Rust, 6.6 ms, 1950x speedup