Computer: you do the math

13

1

This challenge is partly an algorithms challenge, involves some math and is partly simply a fastest code challenge.

For some positive integer n, consider a uniformly random string of 1s and 0s of length n and call it A. Now also consider a second uniformly chosen random string of length n whose values are -1, 0, or 1 and call it B_pre. Now let B be B_pre+B_pre. That is B_pre concatenated to itself.

Now consider the inner product of A and B[j,...,j+n-1] and call it Z_j and index from 1.

Task

The output should be a list of n+1 fractions. The ith term in the output should be the exact probability that all of the first i terms Z_j with j <= i equal 0.

Score

The largest n for which your code gives the correct output in under 10 minutes on my machine.

Tie Breaker

If two answers have the same score, the one submitted first wins.

In the (very very) unlikely event that someone finds a method to get unlimited scores, the first valid proof of such a solution will be accepted.

Hint

Do not try to solve this problem mathematically, it's too hard. The best way in my view is to go back to basic definitions of probability from high school and find smart ways to get the code to do an exhaustive enumeration of the possibilities.

Languages and libraries

You can use any language which has a freely available compiler/interpreter/etc. for Linux and any libraries which are also freely available for Linux.

My machine The timings will be run on my machine. This is a standard ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code. As a consequence, only use easily available free software and please include full instructions how to compile and run your code.


Some test outputs. Consider just the first output for each n. That is when i=1. For n from 1 to 13 they should be.

 1: 4/6
 2: 18/36
 3: 88/216
 4: 454/1296
 5: 2424/7776
 6: 13236/46656
 7: 73392/279936
 8: 411462/1679616
 9: 2325976/10077696
10: 13233628/60466176
11: 75682512/362797056
12: 434662684/2176782336
13: 2505229744/13060694016

You can also find a general formula for i=1 at http://oeis.org/A081671 .

Leaderboard (split by language)

  • n = 15. Python + parallel python + pypy in 1min49s by Jakube
  • n = 17. C++ in 3min37s by Keith Randall
  • n = 16. C++ in 2min38s by kuroi neko

user9206

Posted 2014-12-11T09:07:36.940

Reputation:

What is your graphic card ? Looks like a job for a GPU. – Michael M. – 2014-12-11T22:57:57.643

1@Knerd I added a table of probabilities to the question instead. I hope it is helpful. – None – 2014-12-12T09:55:18.777

@Mig PCI bridge: Advanced Micro Devices, Inc. [AMD] RS780/RS880 PCI to PCI bridge (PCIE port 1) . This may be disappointing news for you, sorry. – None – 2014-12-12T11:04:17.563

http://oeis.org/A081671 – kennytm – 2014-12-12T11:25:59.683

@KennyTM Thanks I'll add that to the question. – None – 2014-12-12T11:27:32.263

@Lembik could you please add the execution time on your machine to the leaderboard? I'm really curious to see what your 8 cores can make of my code. – None – 2014-12-12T19:23:50.707

@kuroineko It only runs on one core! Have you tried it in your ubuntu install? – None – 2014-12-12T19:44:07.287

@Lembik Pfff... I can't get VirtualBox to allocate more than 1 CPU to my VM, so I can't test that. The horrible processor affinity code does not help either. Would you kindly comment out the lock_on_core() call and see if the scheduler does make some use of your 8 cores when left to its own initiative? – None – 2014-12-12T19:54:44.790

I know nothing about windows but http://askubuntu.com/questions/365615/how-do-i-enable-multiple-cores-in-my-virtual-enviroment seems related to your problem. Maybe there is a windows version of this. I also commented out lock_on_core(index); but it seems to have made no difference.

– None – 2014-12-12T19:57:36.543

Pfff (bis)... VirtualBox does not allow me to change the settings despite what the article claims, and frankly I'm getting tired of this. Now about you native Ubuntu environment, I wonder what use threads are if the bloody OS and/or compiler does not manage to run them on all available cores? Is there some Linux magic to invoke to make that possible at all? – None – 2014-12-12T20:01:44.370

It really should work. Could you make a minimal example for me to test and possibly post on SO? – None – 2014-12-12T20:03:29.360

If it helps, https://bpaste.net/show/ee23472615c3 works and uses all my cores and I have to compile it with g++ -O3 -std=c++0x -pthread -Wl,--no-as-needed Stefan.cpp -o Stefan

– None – 2014-12-12T20:07:40.467

1@Knerd How can I say no. I will try to work out how to run your code in linux but any help much appreciated. – None – 2014-12-11T11:46:08.597

Ok, sorry for deleting comments. For all that didn't read, it was if F# or C# are allowed :) – Knerd – 2014-12-11T11:47:11.467

The other question again, do you maybe have an example of a valid input output? – Knerd – 2014-12-11T11:47:44.220

Answers

5

C++, n=18 in 9 min on 8 threads

(Let me know if it makes it under 10 minutes on your machine.)

I take advantage of several forms of symmetry in the B array. Those are cyclic (shift by one position), reversal (reverse the order of the elements), and sign (take the negative of each element). First I compute the list of Bs we need to try and their weight. Then each B is run through a fast routine (using bitcount instructions) for all 2^n values of A.

Here's the result for n==18:

> time ./a.out 18
 1: 16547996212044 / 101559956668416
 2:  3120508430672 / 101559956668416
 3:   620923097438 / 101559956668416
 4:   129930911672 / 101559956668416
 5:    28197139994 / 101559956668416
 6:     6609438092 / 101559956668416
 7:     1873841888 / 101559956668416
 8:      813806426 / 101559956668416
 9:      569051084 / 101559956668416
10:      510821156 / 101559956668416
11:      496652384 / 101559956668416
12:      493092812 / 101559956668416
13:      492186008 / 101559956668416
14:      491947940 / 101559956668416
15:      491889008 / 101559956668416
16:      449710584 / 101559956668416
17:      418254922 / 101559956668416
18:      409373626 / 101559956668416

real    8m55.854s
user    67m58.336s
sys 0m5.607s

Compile the program below with g++ --std=c++11 -O3 -mpopcnt dot.cc

#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <thread>
#include <mutex>
#include <chrono>

using namespace std;

typedef long long word;

word n;

void inner(word bpos, word bneg, word w, word *cnt) {
    word maxi = n-1;
    for(word a = (1<<n)-1; a >= 0; a--) {
        word m = a;
        for(word i = maxi; i >= 0; i--, m <<= 1) {
            if(__builtin_popcount(m&bpos) != __builtin_popcount(m&bneg))
                break;
            cnt[i]+=w;
        }
    }
}

word pow(word n, word e) {
    word r = 1;
    for(word i = 0; i < e; i++) r *= n;
    return r;
}

typedef struct {
    word b;
    word weight;
} Bentry;

mutex block;
Bentry *bqueue;
word bhead;
word btail;
word done = -1;

word maxb;

// compute -1*b
word bneg(word b) {
    word w = 1;
    for(word i = 0; i < n; i++, w *= 3) {
        word d = b / w % 3;
        if(d == 1)
            b += w;
        if(d == 2)
            b -= w;
    }
    return b;
}

// rotate b one position
word brot(word b) {
    b *= 3;
    b += b / maxb;
    b %= maxb;
    return b;
}

// reverse b
word brev(word b) {
    word r = 0;
    for(word i = 0; i < n; i++) {
        r *= 3;
        r += b % 3;
        b /= 3;
    }
    return r;
}

// individual thread's work routine
void work(word *cnt) {
    while(true) {
        // get a queue entry to work on
        block.lock();
        if(btail == done) {
            block.unlock();
            return;
        }
        if(bhead == btail) {
            block.unlock();
            this_thread::sleep_for(chrono::microseconds(10));
            continue;
        }
        word i = btail++;
        block.unlock();

        // thread now owns bqueue[i], work on it
        word b = bqueue[i].b;
        word w = 1;
        word bpos = 0;
        word bneg = 0;
        for(word j = 0; j < n; j++, b /= 3) {
            word d = b % 3;
            if(d == 1)
                bpos |= 1 << j;
            if(d == 2)
                bneg |= 1 << j;
        }
        bpos |= bpos << n;
        bneg |= bneg << n;
        inner(bpos, bneg, bqueue[i].weight, cnt);
    }
}

int main(int argc, char *argv[]) {
    n = atoi(argv[1]);

    // allocate work queue
    maxb = pow(3, n);
    bqueue = (Bentry*)(malloc(maxb*sizeof(Bentry)));

    // start worker threads
    word procs = thread::hardware_concurrency();
    vector<thread> threads;
    vector<word*> counts;
    for(word p = 0; p < procs; p++) {
        word *cnt = (word*)calloc(64+n*sizeof(word), 1);
        threads.push_back(thread(work, cnt));
        counts.push_back(cnt);
    }

    // figure out which Bs we actually want to test, and with which weights
    bool *bmark = (bool*)calloc(maxb, 1);
    for(word i = 0; i < maxb; i++) {
        if(bmark[i]) continue;
        word b = i;
        word w = 0;
        for(word j = 0; j < 2; j++) {
            for(word k = 0; k < 2; k++) {
                for(word l = 0; l < n; l++) {
                    if(!bmark[b]) {
                        bmark[b] = true;
                        w++;
                    }
                    b = brot(b);
                }
                b = bneg(b);
            }
            b = brev(b);
        }
        bqueue[bhead].b = i;
        bqueue[bhead].weight = w;
        block.lock();
        bhead++;
        block.unlock();
    }
    block.lock();
    done = bhead;
    block.unlock();

    // add up results from threads
    word *cnt = (word*)calloc(n,sizeof(word));
    for(word p = 0; p < procs; p++) {
        threads[p].join();
        for(int i = 0; i < n; i++) cnt[i] += counts[p][i];
    }
    for(word i = 0; i < n; i++)
        printf("%2lld: %14lld / %14lld\n", i+1, cnt[n-1-i], maxb<<n);
    return 0;
}

Keith Randall

Posted 2014-12-11T09:07:36.940

Reputation: 19 865

Good, that dispenses me of working further on my own pet monster... – None – 2014-12-15T07:03:24.347

Thanks for this. You have the current winning entry. We do have to remember -pthread again. I get to n=17 on my machine. – None – 2014-12-15T09:24:06.943

Oops.. You should have got the full bounty. Sorry I missed the deadline. – None – 2014-12-21T12:13:15.217

@Lembik: no problem. – Keith Randall – 2014-12-21T18:00:50.073

6

Python 2 using pypy and pp: n = 15 in 3 minutes

Also just a simple brute force. Interesting to see, that I nearly get the same same speed as kuroi neko with C++. My code can reach n = 12 in about 5 minutes. And I only run it on one virtual core.

edit: Reduce search space by a factor of n

I noticed, that a cycled vector A* of A produces the same numbers as probabilities (same numbers) as the original vector A when I iterate over B. E.g. The vector (1, 1, 0, 1, 0, 0) has the same probabilities as each of the vectors (1, 0, 1, 0, 0, 1), (0, 1, 0, 0, 1, 1), (1, 0, 0, 1, 1, 0), (0, 0, 1, 1, 0, 1) and (0, 1, 1, 0, 1, 0) when choosing a random B. Therefore I don't have to iterate over each of these 6 vectors, but only about 1 and replace count[i] += 1 with count[i] += cycle_number.

This reduces the complexity from Theta(n) = 6^n to Theta(n) = 6^n / n. Therefore for n = 13 it's about 13 times as fast as my previous version. It calculates n = 13 in about 2 minutes 20 seconds. For n = 14 it is still a little bit too slow. It takes about 13 minutes.

edit 2: Multi-core-programming

Not really happy with the next improvement. I decided to also try to execute my program on multiple cores. On my 2 + 2 cores I now can calculate n = 14 in about 7 minutes. Only a factor of 2 improvement.

The code is available in this github repo: Link. The multi-core programming makes is a little bit ugly.

edit 3: Reducing search space for A vectors and B vectors

I noticed the same mirror-symmetry for the vectors A as kuroi neko did. Still not sure, why this works (and if it works for each n).

The reducing of the search space for B vectors is a bit cleverer. I replaced the generation of the vectors (itertools.product), with an own function. Basically, I start with an empty list and put it on a stack. Until the stack is empty, I remove a list, if it has not the same lenght as n, I generate 3 other lists (by appending -1, 0, 1) and pushing them onto the stack. I a list has the same length as n, I can evaluate the sums.

Now that I generate the vectors myself, I can filter them depending on if I can reach the sum = 0 or not. E.g. if my vector A is (1, 1, 1, 0, 0), and my vector B looks (1, 1, ?, ?, ?), I know, that I cannot fill the ? with values, so that A*B = 0. So I don't have to iterate over all those 6 vectors B of the form (1, 1, ?, ?, ?).

We can improve on this, if we ignore the values for 1. As noted in the question, for the values for i = 1 are the sequence A081671. There are many ways to calculate those. I choose the simple recurrence: a(n) = (4*(2*n-1)*a(n-1) - 12*(n-1)*a(n-2)) / n. Since we can calculate i = 1 in basically no time, we can filter more vectors for B. E.g. A = (0, 1, 0, 1, 1) and B = (1, -1, ?, ?, ?). We can ignore vectors, where the first ? = 1, because the A * cycled(B) > 0, for all these vectors. I hope you can follow. It's probably not the best example.

With this I can calculate n = 15 in 6 minutes.

edit 4:

Quickly implemented kuroi neko's great idea, which says, that B and -B produces the same results. Speedup x2. Implementation is only a quick hack, though. n = 15 in 3 minutes.

Code:

For the complete code visit Github. The following code is only a representation of the main features. I left out imports, multicore programming, printing the results, ...

count = [0] * n
count[0] = oeis_A081671(n)

#generating all important vector A
visited = set(); todo = dict()
for A in product((0, 1), repeat=n):
    if A not in visited:
        # generate all vectors, which have the same probability
        # mirrored and cycled vectors
        same_probability_set = set()
        for i in range(n):
            tmp = [A[(i+j) % n] for j in range(n)]
            same_probability_set.add(tuple(tmp))
            same_probability_set.add(tuple(tmp[::-1]))
        visited.update(same_probability_set)
        todo[A] = len(same_probability_set)

# for each vector A, create all possible vectors B
stack = []
for A, cycled_count in dict_A.iteritems():
    ones = [sum(A[i:]) for i in range(n)] + [0]
    # + [0], so that later ones[n] doesn't throw a exception
    stack.append(([0] * n, 0, 0, 0, False))

    while stack:
        B, index, sum1, sum2, used_negative = stack.pop()
        if index < n:
            # fill vector B[index] in all possible ways,
            # so that it's still possible to reach 0.
            if used_negative:
                for v in (-1, 0, 1):
                    sum1_new = sum1 + v * A[index]
                    sum2_new = sum2 + v * A[index - 1 if index else n - 1]
                    if abs(sum1_new) <= ones[index+1]:
                        if abs(sum2_new) <= ones[index] - A[n-1]:
                            C = B[:]
                            C[index] = v
                            stack.append((C, index + 1, sum1_new, sum2_new, True))
            else:
                for v in (0, 1):
                    sum1_new = sum1 + v * A[index]
                    sum2_new = sum2 + v * A[index - 1 if index else n - 1]
                    if abs(sum1_new) <= ones[index+1]:
                        if abs(sum2_new) <= ones[index] - A[n-1]:
                            C = B[:]
                            C[index] = v
                            stack.append((C, index + 1, sum1_new, sum2_new, v == 1))
        else:
            # B is complete, calculate the sums
            count[1] += cycled_count  # we know that the sum = 0 for i = 1
            for i in range(2, n):
                sum_prod = 0
                for j in range(n-i):
                    sum_prod += A[j] * B[i+j]
                for j in range(i):
                    sum_prod += A[n-i+j] * B[j]
                if sum_prod:
                    break
                else:
                    if used_negative:
                        count[i] += 2*cycled_count
                    else:
                        count[i] += cycled_count

Usage:

You have to install pypy (for Python 2!!!). The parallel python module isn't ported for Python 3. Then you have to install the parallel python module pp-1.6.4.zip. Extract it, cd into the folder and call pypy setup.py install.

Then you can call my program with

pypy you-do-the-math.py 15

It will automatically determine the number of cpu's. There may be some error messages after finishing the program, just ignore them. n = 16 should be possible on your machine.

Output:

Calculation for n = 15 took 2:50 minutes

 1  83940771168 / 470184984576  17.85%
 2  17379109692 / 470184984576   3.70%
 3   3805906050 / 470184984576   0.81%
 4    887959110 / 470184984576   0.19%
 5    223260870 / 470184984576   0.05%
 6     67664580 / 470184984576   0.01%
 7     30019950 / 470184984576   0.01%
 8     20720730 / 470184984576   0.00%
 9     18352740 / 470184984576   0.00%
10     17730480 / 470184984576   0.00%
11     17566920 / 470184984576   0.00%
12     17521470 / 470184984576   0.00%
13     17510280 / 470184984576   0.00%
14     17507100 / 470184984576   0.00%
15     17506680 / 470184984576   0.00%

Notes and ideas:

  • I have a i7-4600m processor with 2 cores and 4 threads. It doesn't matter If I use 2 or 4 threads. The cpu-usage is 50% with 2 threads and 100% with 4 threads, but it still takes the same amount of time. I don't know why. I checked, that each thread only has to the the half amout of data, when there are 4 threads, checked the results, ...
  • I use a lot of lists. Python isn't quite efficient in storing, I have to copy lots of lists, ... So I thought of using an integer instead. I could use the bits 00 (for 0) and 11 (for 1) in the vector A, and the bits 10 (for -1), 00 (for 0) and 01 (for 1) in the vector B. For the product of A and B, I would only have to calculate A & B and count the 01 and 10 blocks. Cycling can be done with shifting the vector and using masks, ... I actually implemented all this, you can find it in some of my older commits on Github. But it turned out, to be slower than with lists. I guess, pypy really optimizes list operations.

Jakube

Posted 2014-12-11T09:07:36.940

Reputation: 21 462

On my PC the n=12 run takes 7:25 while my C++ piece of junk takes about 1:23, which makes it about 5 times faster. With only two true cores, my CPU will gain something like a 2.5 factor compared with a mono-threaded application, so a true 8 cores CPU should run something like 3 times faster, and that's not counting with the basic mono-core speed improvement over my ageing i3-2100. Whether going through all these C++ hoops to tackle an exponentially growing computation time is worth the effort is debatable, though. – None – 2014-12-12T16:07:52.880

I'm getting a feeling of http://codegolf.stackexchange.com/questions/41021/find-highest-scoring-matrix-without-property-x... Would de Bruijn sequence be useful?

– kennytm – 2014-12-12T16:36:31.283

about multithreading, you could squeeze out a bit more of your 2+2 cores by locking each thread on one. The x2 gain is due to the scheduler shifting around your threads each time a matchstick is moved in the system. With core locking, you would probably get a x2.5 gain instead. No idea if Python allows to set processor affinity, though. – None – 2014-12-12T19:17:38.780

Thanks, I will look into it. But I'm pretty much a newbie in multithreading. – Jakube – 2014-12-12T19:26:29.523

http://nbviewer.ipython.org/gist/minrk/5500077 has some mention of this, albeit using a different tool for the parallelism. – None – 2014-12-12T19:45:49.007

The pypy people would suggest not using itertools. I don't know how much of a speedup this would give however. – None – 2014-12-12T21:00:38.837

I don't think so. It seems quite fast. My own implementation of the product function is slower. One thing, that helps a bit. Is to store all possible vectors B and iterate over them, instead of generating them each time. It reduces the time for n = 14 from 7 minutes to 6 minutes, but it uses 500 MB RAM for each core. – Jakube – 2014-12-12T21:46:45.560

How about numba or cython next? – None – 2014-12-13T10:59:39.820

I suspect you could speed up your inner loop by generating the true B vector (your B variable is actually what Lembik calls Bpre) and iterate over it in a single run. That would reduce indice computation somewhat. – None – 2014-12-13T19:51:21.017

It also seems to me that you would gain quite a bit by interverting the A and B loops, since B computation is the slowest process (you don't want to recompute all B values for each A group). That would probably dispense you of pre-computing the huge B range for equivalent performances. – None – 2014-12-13T20:12:24.817

I actually did a few improvements regarding iterating over the B-loop. I now can do n=15 in 9 minutes on my laptop. Haven't published the changes though. – Jakube – 2014-12-13T21:03:57.453

You may be able to do some bittricks to speed up the inner products themselves. Maybe something like http://stackoverflow.com/questions/19732598/fast-inner-product-of-ternary-vectors ?

– None – 2014-12-14T07:49:22.630

Your A-dependent B optimization is very clever and promising. Instead of relying on a brute-force contest where C++ is bound to win, it would be interesting to let the program output the number of inner products actually computed. That would be a pretty reliable metric to compare the optimizations. What do you think? – None – 2014-12-14T22:02:13.043

Thanks for this! I can't test it now but I will do it as soon as possible tomorrow. – None – 2014-12-14T22:07:37.770

@kuroineko That's a very nice idea indeed. – None – 2014-12-14T22:08:00.070

@Jakube now you've done it... I simply must teach your trick to my woolly bully :) – None – 2014-12-14T22:43:03.653

5

woolly bully - C++ - way too slow

Well since a better programmer took on the C++ implementation, I'm calling quits for this one.

#include <cstdlib>
#include <cmath>
#include <vector>
#include <bitset>
#include <future>
#include <iostream>
#include <iomanip>

using namespace std;

/*
6^^n events will be generated, so the absolute max
that can be counted by a b bits integer is
E(b*log(2)/log(6)), i.e. n=24 for a 64 bits counter

To enumerate 3 possible values of a size n vector we need
E(n*log(3)/log(2))+1 bits, i.e. 39 bits
*/
typedef unsigned long long Counter; // counts up to 6^^24

typedef unsigned long long Benumerator; // 39 bits
typedef unsigned long      Aenumerator; // 24 bits

#define log2_over_log6 0.3869

#define A_LENGTH ((size_t)(8*sizeof(Counter)*log2_over_log6))
#define B_LENGTH (2*A_LENGTH)

typedef bitset<B_LENGTH> vectorB;

typedef vector<Counter> OccurenceCounters;

// -----------------------------------------------------------------
// multithreading junk for CPUs detection and allocation
// -----------------------------------------------------------------
int number_of_CPUs(void)
{
    int res = thread::hardware_concurrency();
    return res == 0 ? 8 : res;
}

#ifdef __linux__
#include <sched.h>
void lock_on_CPU(int cpu)
{
    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET(cpu, &mask);
    sched_setaffinity(0, sizeof(mask), &mask);
}
#elif defined (_WIN32)
#include <Windows.h>
#define lock_on_CPU(cpu) SetThreadAffinityMask(GetCurrentThread(), 1 << cpu)
#else
// #warning is not really standard, so this might still cause compiler errors on some platforms. Sorry about that.
#warning "Thread processor affinity settings not supported. Performances might be improved by providing a suitable alternative for your platform"
#define lock_on_CPU(cpu)
#endif

// -----------------------------------------------------------------
// B values generator
// -----------------------------------------------------------------
struct Bvalue {
    vectorB p1;
    vectorB m1;
};

struct Bgenerator {
    int n;                 // A length
    Aenumerator stop;      // computation limit
    Aenumerator zeroes;    // current zeroes pattern
    Aenumerator plusminus; // current +1/-1 pattern
    Aenumerator pm_limit;  // upper bound of +1/-1 pattern

    Bgenerator(int n, Aenumerator start=0, Aenumerator stop=0) : n(n), stop(stop)
    {
        // initialize generator so that first call to next() will generate first value
        zeroes    = start - 1;
        plusminus = -1;
        pm_limit  = 0;
    }

    // compute current B value
    Bvalue value(void)
    {
        Bvalue res;
        Aenumerator pm = plusminus;
        Aenumerator position = 1;
        int i_pm = 0;
        for (int i = 0; i != n; i++)
        {
            if (zeroes & position)
            {
                if (i_pm == 0)  res.p1 |= position; // first non-zero value fixed to +1
                else         
                {
                    if (pm & 1) res.m1 |= position; // next non-zero values
                    else        res.p1 |= position;
                    pm >>= 1;
                }
                i_pm++;
            }
            position <<= 1;
        }
        res.p1 |= (res.p1 << n); // concatenate 2 Bpre instances
        res.m1 |= (res.m1 << n);
        return res;
    }

    // next value
    bool next(void)
    {
        if (++plusminus == pm_limit)
        {
            if (++zeroes == stop) return false;
            plusminus = 0;
            pm_limit = (1 << vectorB(zeroes).count()) >> 1;
        }
        return true;
    }

    // calibration: produces ranges that will yield the approximate same number of B values
    vector<Aenumerator> calibrate(int segments)
    {
        // setup generator for the whole B range
        zeroes = 0;
        stop = 1 << n;
        plusminus = -1;
        pm_limit = 0;

        // divide range into (nearly) equal chunks
        Aenumerator chunk_size = ((Aenumerator)pow (3,n)-1) / 2 / segments;

        // generate bounds for zeroes values
        vector<Aenumerator> res(segments + 1);
        int bound = 0;
        res[bound] = 1;
        Aenumerator count = 0;
        while (next()) if (++count % chunk_size == 0) res[++bound] = zeroes;
        res[bound] = stop;
        return res;
    }
};

// -----------------------------------------------------------------
// equiprobable A values merging
// -----------------------------------------------------------------
static char A_weight[1 << A_LENGTH];
struct Agroup {
    vectorB value;
    int     count;
    Agroup(Aenumerator a = 0, int length = 0) : value(a), count(length) {}
};
static vector<Agroup> A_groups;

Aenumerator reverse(Aenumerator n) // this works on N-1 bits for a N bits word
{
    Aenumerator res = 0;
    if (n != 0) // must have at least one bit set for the rest to work
    {
        // construct left-padded reverse value
        for (int i = 0; i != 8 * sizeof(n)-1; i++)
        {
            res |= (n & 1);
            res <<= 1;
            n >>= 1;
        }

        // shift right to elimitate trailing zeroes
        while (!(res & 1)) res >>= 1;
    }
    return res;
}

void generate_A_groups(int n)
{
    static bitset<1 << A_LENGTH> lookup(0);
    Aenumerator limit_A = (Aenumerator)pow(2, n);
    Aenumerator overflow = 1 << n;
    for (char & w : A_weight) w = 0;

    // gather rotation cycles
    for (Aenumerator a = 0; a != limit_A; a++)
    {
        Aenumerator rotated = a;
        int cycle_length = 0;
        for (int i = 0; i != n; i++)
        {
            // check for new cycles
            if (!lookup[rotated])
            {
                cycle_length++;
                lookup[rotated] = 1;
            }

            // rotate current value
            rotated <<= 1;
            if (rotated & overflow) rotated |= 1;
            rotated &= (overflow - 1);
        }

        // store new cycle
        if (cycle_length > 0) A_weight[a] = cycle_length;
    }

    // merge symetric groups
    for (Aenumerator a = 0; a != limit_A; a++)
    {
        // skip already grouped values
        if (A_weight[a] == 0) continue;

        // regroup a symetric pair
        Aenumerator r = reverse(a);
        if (r != a)
        {
            A_weight[a] += A_weight[r];
            A_weight[r] = 0;
        }  
    }

    // generate groups
    for (Aenumerator a = 0; a != limit_A; a++)
    {
        if (A_weight[a] != 0) A_groups.push_back(Agroup(a, A_weight[a]));
    }
}

// -----------------------------------------------------------------
// worker thread
// -----------------------------------------------------------------
OccurenceCounters solve(int n, int index, Aenumerator Bstart, Aenumerator Bstop)
{
    OccurenceCounters consecutive_zero_Z(n, 0);  // counts occurences of the first i terms of Z being 0

    // lock on assigned CPU
    lock_on_CPU(index);

    // enumerate B vectors
    Bgenerator Bgen(n, Bstart, Bstop);
    while (Bgen.next())
    {
        // get next B value
        Bvalue B = Bgen.value();

        // enumerate A vector groups
        for (const auto & group : A_groups)
        {
            // count consecutive occurences of inner product equal to zero
            vectorB sliding_A(group.value);
            for (int i = 0; i != n; i++)
            {
                if ((sliding_A & B.p1).count() != (sliding_A & B.m1).count()) break;
                consecutive_zero_Z[i] += group.count;
                sliding_A <<= 1;
            }
        }
    }
    return consecutive_zero_Z;
}

// -----------------------------------------------------------------
// main
// -----------------------------------------------------------------
#define die(msg) { cout << msg << endl; exit (-1); }

int main(int argc, char * argv[])
{
    int n = argc == 2 ? atoi(argv[1]) : 16; // arbitray value for debugging
    if (n < 1 || n > 24) die("vectors of lenght between 1 and 24 is all I can (try to) compute, guv");

    auto begin = time(NULL);

    // one worker thread per CPU
    int num_workers = number_of_CPUs();

    // regroup equiprobable A values
    generate_A_groups(n);

    // compute B generation ranges for proper load balancing
    vector<Aenumerator> ranges = Bgenerator(n).calibrate(num_workers);

    // set workers to work
    vector<future<OccurenceCounters>> workers(num_workers);
    for (int i = 0; i != num_workers; i++)
    {
        workers[i] = async(
            launch::async, // without this parameter, C++ will decide whether execution shall be sequential or asynchronous (isn't C++ fun?).
            solve, n, i, ranges[i], ranges[i+1]); 
    }

    // collect results
    OccurenceCounters result(n + 1, 0);
    for (auto& worker : workers)
    {
        OccurenceCounters partial = worker.get();
        for (size_t i = 0; i != partial.size(); i++) result[i] += partial[i]*2; // each result counts for a symetric B pair
    }
    for (Counter & res : result) res += (Counter)1 << n; // add null B vector contribution
    result[n] = result[n - 1];                           // the last two probabilities are equal by construction

    auto duration = time(NULL) - begin;

    // output
    cout << "done in " << duration / 60 << ":" << setw(2) << setfill('0') << duration % 60 << setfill(' ')
        << " by " << num_workers << " worker thread" << ((num_workers > 1) ? "s" : "") << endl;
    Counter events = (Counter)pow(6, n);
    int width = (int)log10(events) + 2;
    cout.precision(5);
    for (int i = 0; i <= n; i++) cout << setw(2) << i << setw(width) << result[i] << " / " << events << " " << fixed << (float)result[i] / events << endl;

    return 0;
}

Building the executable

It's a standalone C++11 source that compiles without warnings and runs smoothly on:

  • Win7 & MSVC2013
  • Win7 & MinGW - g++4.7
  • Ubuntu & g++4.8 (in a VirtualBox VM with 2 CPUs allocated)

If you compile with g++, use : g++ -O3 -pthread -std=c++11
forgetting about -pthread will produce a nice and friendly core dump.

Optimizations

  1. The last Z term is equal to the first (Bpre x A in both cases), so the last two results are always equal, which dispenses of computing the last Z value.
    The gain is neglectible, but coding it costs nothing so you might as well use it.

  2. As Jakube discovered, all cyclic values of a given A vector produce the same probabilities.
    You can compute these with a single instance of A and multiply the result by the number of its possible rotations. Rotation groups can easily be pre-computed in a neglectible amount of time, so this is a huge net speed gain.
    Since the number of permutations of an n length vector is n-1, the complexity drops from o(6n) to o(6n/(n-1)), basically going a step further for the same computation time.

  3. It appears pairs of symetric patterns also generate the same probabilities. For instance, 100101 and 101001.
    I have no mathematical proof of that, but intuitively when presented with all possible B patterns, each symetric A value will be convoluted with the corresponding symetric B value for the same global outcome.
    This allows to regroup some more A vectors, for an approximative 30% reduction of A groups number.

  4. WRONG For some semi-mysterious reason, all patterns with only one or two bits set produce the same result. This does not represent that many distinct groups, but still they can be merged at virtually no cost.

  5. The vectors B and -B (B with all components multiplied by -1) produce the same probabilities.
    (for instance [ 1, 0,-1, 1] and [-1, 0, 1,-1]).
    Except for the null vector (all components equal to 0), B and -B form a pair of distinct vectors.
    This allows to cut the number of B values in half by considering only one of each pair and multiplying its contribution by 2, adding the known global contribution of null B to each probability only once.

How it works

The number of B values is huge (3n), so pre-computing them would require indecent amounts of memory, which would slow down computation and eventually exhaust available RAM.
Unfortunately, I could not find a simple way of enumerating the half set of optimized B values, so I resorted to coding a dedicated generator.

The mighty B generator was a lot of fun to code, though languages that support yield mechanisms would have allowed to program it in a much more elegant way.
What it does in a nutshell is consider the "skeleton" of a Bpre vector as a binary vector where 1s represent actual -1 or +1 values.
Among all these +1/-1 potential values, the first one is fixed to +1 (thus selecting one of the possible B/-B vector), and all remaining possible +1/-1 combinations are enumerated.
Lastly, a simple calibration system ensures each worker thread will process a range of values of approximately the same size.

A values are heavily filtered for regrouping in equiprobable chunks.
This is done in a pre-computing phase that brute-force examines all possible values.
This part has a neglectible O(2n) execution time and does not need to be optimized (the code being already unreadable enough as it is!).

To evaluate the inner product (which needs only be tested against zero), the -1 and 1 components of B are regrouped into binary vectors.
The inner product is null if (and only if) there is an equal number of +1s and -1s among the B values corresponding to non-zero A values.
This can be computed with simple masking and bit count operations, helped by std::bitset that will produce reasonably efficient bit count code without having to resort to ugly intrinsic instructions.

The work is equally divided among cores, with forced CPU affinity (every little bit helps, or so they say).

Example result

C:\Dev\PHP\_StackOverflow\C++\VectorCrunch>release\VectorCrunch.exe 16
done in 8:19 by 4 worker threads
 0  487610895942 / 2821109907456 0.17284
 1   97652126058 / 2821109907456 0.03461
 2   20659337010 / 2821109907456 0.00732
 3    4631534490 / 2821109907456 0.00164
 4    1099762394 / 2821109907456 0.00039
 5     302001914 / 2821109907456 0.00011
 6     115084858 / 2821109907456 0.00004
 7      70235786 / 2821109907456 0.00002
 8      59121706 / 2821109907456 0.00002
 9      56384426 / 2821109907456 0.00002
10      55686922 / 2821109907456 0.00002
11      55508202 / 2821109907456 0.00002
12      55461994 / 2821109907456 0.00002
13      55451146 / 2821109907456 0.00002
14      55449098 / 2821109907456 0.00002
15      55449002 / 2821109907456 0.00002
16      55449002 / 2821109907456 0.00002

Performances

Multithreading should work perfectly on this, though only "true" cores will fully contribute to computation speed. My CPU has only 2 cores for 4 CPUs, and the gain over a single-threaded version is "only" about 3.5.

Compilers

An initial problem with multithreading led me to believe the GNU compilers were performing worse than Microsoft.

After more thorough testing, it appears g++ wins the day yet again, producing approximatively 30% faster code (the same ratio I noticed on two other computation-heavy projects).

Notably, the std::bitset library is implemented with dedicated bit count instructions by g++ 4.8, while MSVC 2013 uses only loops of conventional bit shifts.

As one could have expected, compiling in 32 or 64 bits makes no difference.

Further refinements

I noticed a few A groups producing the same probabilities after all the reduction operations, but I could not identify a pattern that would allow to regroup them.

Here are the pairs I noticed for n=11:

  10001011 and 10001101
 100101011 and 100110101
 100101111 and 100111101
 100110111 and 100111011
 101001011 and 101001101
 101011011 and 101101011
 101100111 and 110100111
1010110111 and 1010111011
1011011111 and 1011111011
1011101111 and 1011110111

user16991

Posted 2014-12-11T09:07:36.940

Reputation:

Could you give some advice on how to compile and run this? I tried g++ -Wall -std=c++11 kuroineko.cpp -o kuroineko and ./kuroineko 12 but it gives terminate called after throwing an instance of 'std::system_error' what(): Unknown error -1 Aborted (core dumped) – None – 2014-12-12T10:12:12.723

I ran it through gdb where you get a little more info. It might be worth pasting the output of that too to the SO question. – None – 2014-12-12T15:12:18.377

g++ -Wall kuroineko.cpp -O3 -pthread -std=c++11 -o kuroineko works. – None – 2014-12-12T15:36:10.617

yep, pthread was the culprit... Out of curiosity, does the Linux executable use all 8 of your cores, and how far does it go compared with the python code? – None – 2014-12-12T15:37:24.977

@Sp3000 That seems very odd: on my vintage i3-2100 n=13 takes only a bit more than 5 minutes. What compiler and options did you use? – None – 2014-12-12T16:26:02.130

k, on my Windows 8 four core laptop C++ took 6:32 for n = 13, and Jakube's previous version took 42 min. – Sp3000 – 2014-12-12T16:36:34.223

Can you just turn the #error to a #warning and define lock_on_core() to be no-op? This allows the code be compiled on OS X. – kennytm – 2014-12-12T20:48:16.280

@KennyTM That should be fixed (along with the infuriating multithreading issue, by the way). – None – 2014-12-12T23:34:51.733

All cores being used now. I don't have 8 real cores either of course. – None – 2014-12-13T08:56:16.820

Your further refinements part is intriguing. If you group B instead of A, do you see similar patterns? – None – 2014-12-14T21:38:18.557

B vectors are more complicated to observe. Detecting symetries in a binary vector is much simpler than in a 3-valued one. I'll see if I can come up with a representation that can allow "visual" recognition, but I'm not sure that will be either easy or even possible – None – 2014-12-14T21:47:59.040

Also bear in mind that the "symetric" patterns are only representatives of their whole rotation group. It does not seem the mysterious remaining equiprobable vector pairs are reductible by rotation+symetry, though. – None – 2014-12-14T21:51:04.100

I would be interested to know, if you look at all B's with the same number of 1s and -1s, how many different groups do you get? Say you have 6 1s, 6 -1s and 4 0s, say. – None – 2014-12-14T22:06:50.167

Mmmm.. I see. That should be doable, with a bit of time and efforts :) – None – 2014-12-14T22:08:41.823

B = -B is a brilliant idea. Already implemented ;-) – Jakube – 2014-12-14T22:25:19.040

Your current example output is wrong. The correct data should be http://pastebin.com/BXX4hC1c

– Jakube – 2014-12-14T23:38:20.143

Yep, seems like the empiric "2 bits A patterns all produce the same result" rule holds true only until n=15 :). Should be OK now. Thanks for the check. – None – 2014-12-15T03:46:32.050

Don't give up now! I really like your observations and I am excited you could make more. (I think -mpopcnt is also a big speedup if you can use that.) – None – 2014-12-15T10:20:06.500

I think the last two probabilities should always be the same. This is because the n+1th inner product is actually the same as the first one. – None – 2014-12-11T19:38:05.287

What I meant was that the first n inner products are zero if and only if the first n+1 are. The very last inner product doesn't provide any new information as you have already done it before. So the number of strings that give n zero products is exactly the same as the number that give n+1 zero products. – None – 2014-12-11T19:41:48.880

Out of interest, what were you computing instead exactly? – None – 2014-12-11T19:55:55.073

Thanks for the update but I don't understand the line "0 2160009216 2176782336". What exactly are you counting in this case? The probability that the first inner product is zero is much smaller than that. – None – 2014-12-11T20:10:49.293