How high can you go? (A coding+algorithms challenge)

34

6

Now that everyone has developed their (often amazing) low level coding expertise for How slow is Python really? (Or how fast is your language?) and How Slow Is Python Really (Part II)? it is time for a challenge which will also stretch your ability to improve an algorithm.

The following code computes a list of length 9. Position i in the list counts the number of times at least i consecutive zeros were found when computing the inner products between F and S. To do this exactly, it iterates over all possible lists F of length n and lists S of length n+m-1.

#!/usr/bin/python
import itertools
import operator

n=8
m=n+1
leadingzerocounts = [0]*m
for S in itertools.product([-1,1], repeat = n+m-1):
    for F in itertools.product([-1,1], repeat = n):
        i = 0
        while (i<m and sum(map(operator.mul, F, S[i:i+n])) == 0):
            leadingzerocounts[i] +=1
            i+=1
print leadingzerocounts

The output is

[4587520, 1254400, 347648, 95488, 27264, 9536, 4512, 2128, 1064]

If you increase n to 10,12,14,16,18,20 using this code it very rapidly becomes far too slow.

Rules

  • The challenge is to give the correct output for as large an n as possible. Only even values of n are relevant.
  • If there is a tie, the win goes to the fastest code on my machine for the largest n.
  • I reserve the right not to test code that takes more than 10 minutes.
  • You may change the algorithm in any way you like as long as it gives the correct output. In fact you will have to change the algorithm to make any decent progress toward winning.
  • The winner will be awarded one week from when the question was set.
  • The bounty will be awarded when it is due which is a little after when the winner will be awarded.

My Machine The timings will be run on my machine. This is a standard ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code. As a consequence, only use easily available free software and please include full instructions how to compile and run your code.

Status.

  • C. n = 12 in 49 seconds by @Fors
  • Java. n = 16 in 3:07 by @PeterTaylor
  • C++. n = 16 in 2:21 by @ilmale
  • Rpython. n = 22 in 3:11 by @primo
  • Java. n = 22 in 6:56 by @PeterTaylor
  • Nimrod. n = 24 in 9:28 seconds by @ReimerBehrends

The winner was Reimer Behrends with an entry in Nimrod!

As a check, the output for n = 22 should be [12410090985684467712, 2087229562810269696, 351473149499408384, 59178309967151104, 9975110458933248, 1682628717576192, 284866824372224, 48558946385920, 8416739196928, 1518499004416, 301448822784, 71620493312, 22100246528, 8676573184, 3897278464, 1860960256, 911646720, 451520512, 224785920, 112198656, 56062720, 28031360, 14015680].


The competition is now over but... I will offer 200 points for every submission that increases n by 2 (within 10 minutes on my computer) until I run out of points. This offer is open forever.

user9206

Posted 2014-04-30T20:47:41.673

Reputation:

1"I reserve the right not to test code that takes more than a few minutes." > You should specify an exact time on your machine, otherwise this question will lack an objective winning criterion. – user12205 – 2014-04-30T20:50:07.450

14I love these "increase my speed" challenges. If you are building a commercial product with these, you are going to have one hell of a fast product, and you are also an evil genius. – Rainbolt – 2014-04-30T20:53:42.627

@Rusher I 100% promise I am not going to make 1 cent from any of this. I just like series of questions of increasing difficulty! It somehow reminds me of Bruce Lee movies :) – None – 2014-04-30T20:58:11.713

@Rusher Technically, unless you have permission from the poster, you cannot use these codes commercially, as they are licensed under CC-By-SA. But if your product is closed-source, it's not like anyone would know... – user12205 – 2014-04-30T21:25:45.080

1Perhaps a more informative title would draw attention to this? – TheDoctor – 2014-04-30T21:30:06.720

@TheDoctor Good point but do you have any suggestions? I am not good at titles. – None – 2014-04-30T21:37:34.390

How about "Optimize this algorithm for larger inputs"? I'm not really sure what it does. – TheDoctor – 2014-04-30T21:47:20.447

@TheDoctor Thank you. Is there any part of the code I can explain? – None – 2014-04-30T21:50:27.617

As a sanity check for claims that are large enough to be hard to check with another implementation, the first item should be 2^(2n) n! / ((n/2)!^2) – Peter Taylor – 2014-04-30T22:07:51.030

8If we keep doing this sort of challenge, I think we should at least attempt to solve a different problem to keep it interesting (not a variation on the same problem with extra specifications). – grovesNL – 2014-05-01T04:39:43.887

@grovesNL Yes this is the last challenge on this theme. However I would argue it's great to have a few of increasing difficulty so people can use the hard work they have put into previous challenges. – None – 2014-05-01T06:45:15.157

What is the purpose of m? Is the list S just 2n long, and leadingzerocounts just n+1 length? – qwr – 2014-05-01T20:45:59.813

@qwr Yes that is right. – None – 2014-05-01T21:05:15.700

@Lembik So are we calculating dot products? Because the lists are of unequal length – qwr – 2014-05-01T21:21:26.420

@qwr S is longer than F which is why the dot products are of the form F, S[i:i+n]. That is, F with a window of S of length n. – None – 2014-05-02T06:11:51.020

@ace Technically and practically, CC-BY-SA does allow commercial use. It just requires attribution and sharing of source code. CC-BY-NC-SA (NC=non-commercial) would exclude money making. – ojdo – 2014-05-02T12:30:25.747

8-core means you have 8 physical cores? Or is it 4-core with hyperthreading = 8 virtual cores? Or is it 8-core with hyperthreading for 16 virtual cores? – Claudiu – 2014-05-02T16:43:01.523

1

@Claudiu It's this CPU http://techreport.com/review/23750/amd-fx-8350-processor-reviewed .

– None – 2014-05-02T16:59:09.870

2@Claudiu his CPU has 8 physical cores, but the fetch/decode and FPU units are shared between cores. So when the bottleneck is on one of those areas, it behaves more like an quadcore. Abuse integer logic and avoid large code sizes and it's more like an 8-core. – Stefan – 2014-05-02T20:00:35.673

@Lembik Each additional bounty you grant will need to be at least twice the previous one (up until 500, of course). – primo – 2014-05-12T09:30:05.350

@primo :) I would run out of points too quickly. – None – 2014-05-12T12:47:00.840

@Lembik No, I mean, it's not a suggestion. Click the start a bounty button again. I will require you to give at least 200 rep, because you previously granted one for 100. – primo – 2014-05-15T04:43:55.620

Answers

20

Nimrod (N=22)

import math, locks

const
  N = 20
  M = N + 1
  FSize = (1 shl N)
  FMax = FSize - 1
  SStep = 1 shl (N-1)
  numThreads = 16

type
  ZeroCounter = array[0..M-1, int]
  ComputeThread = TThread[int]

var
  leadingZeros: ZeroCounter
  lock: TLock
  innerProductTable: array[0..FMax, int8]

proc initInnerProductTable =
  for i in 0..FMax:
    innerProductTable[i] = int8(countBits32(int32(i)) - N div 2)

initInnerProductTable()

proc zeroInnerProduct(i: int): bool =
  innerProductTable[i] == 0

proc search2(lz: var ZeroCounter, s, f, i: int) =
  if zeroInnerProduct(s xor f) and i < M:
    lz[i] += 1 shl (M - i - 1)
    search2(lz, (s shr 1) + 0, f, i+1)
    search2(lz, (s shr 1) + SStep, f, i+1)

when defined(gcc):
  const
    unrollDepth = 1
else:
  const
    unrollDepth = 4

template search(lz: var ZeroCounter, s, f, i: int) =
  when i < unrollDepth:
    if zeroInnerProduct(s xor f) and i < M:
      lz[i] += 1 shl (M - i - 1)
      search(lz, (s shr 1) + 0, f, i+1)
      search(lz, (s shr 1) + SStep, f, i+1)
  else:
    search2(lz, s, f, i)

proc worker(base: int) {.thread.} =
  var lz: ZeroCounter
  for f in countup(base, FMax div 2, numThreads):
    for s in 0..FMax:
      search(lz, s, f, 0)
  acquire(lock)
  for i in 0..M-1:
    leadingZeros[i] += lz[i]*2
  release(lock)

proc main =
  var threads: array[numThreads, ComputeThread]
  for i in 0 .. numThreads-1:
    createThread(threads[i], worker, i)
  for i in 0 .. numThreads-1:
    joinThread(threads[i])

initLock(lock)
main()
echo(@leadingZeros)

Compile with

nimrod cc --threads:on -d:release count.nim

(Nimrod can be downloaded here.)

This runs in the allotted time for n=20 (and for n=18 when only using a single thread, taking about 2 minutes in the latter case).

The algorithm uses a recursive search, pruning the search tree whenever a non-zero inner product is encountered. We also cut the search space in half by observing that for any pair of vectors (F, -F) we only need to consider one because the other produces the exact same sets of inner products (by negating S also).

The implementation uses Nimrod's metaprogramming facilities to unroll/inline the first few levels of the recursive search. This saves a little time when using gcc 4.8 and 4.9 as the backend of Nimrod and a fair amount for clang.

The search space could be further pruned by observing that we only need to consider values of S that differ in an even number of the first N positions from our choice of F. However, the complexity or memory needs of that do not scale for large values of N, given that the loop body is completely skipped in those cases.

Tabulating where the inner product is zero appears to be faster than using any bit counting functionality in the loop. Apparently accessing the table has pretty good locality.

It looks as though the problem should be amenable to dynamic programming, considering how the recursive search works, but there is no apparent way to do that with a reasonable amount of memory.

Example outputs:

N=16:

@[55276229099520, 10855179878400, 2137070108672, 420578918400, 83074121728, 16540581888, 3394347008, 739659776, 183838720, 57447424, 23398912, 10749184, 5223040, 2584896, 1291424, 645200, 322600]

N=18:

@[3341140958904320, 619683355033600, 115151552380928, 21392898654208, 3982886961152, 744128512000, 141108051968, 27588886528, 5800263680, 1408761856, 438001664, 174358528, 78848000, 38050816, 18762752, 9346816, 4666496, 2333248, 1166624]

N=20:

@[203141370301382656, 35792910586740736, 6316057966936064, 1114358247587840, 196906665902080, 34848574013440, 6211866460160, 1125329141760, 213330821120, 44175523840, 11014471680, 3520839680, 1431592960, 655872000, 317675520, 156820480, 78077440, 39005440, 19501440, 9750080, 4875040]

For purposes of comparing the algorithm with other implementations, N=16 takes about 7.9 seconds on my machine when using a single thread and 2.3 seconds when using four cores.

N=22 takes about 15 minutes on a 64-core machine with gcc 4.4.6 as Nimrod's backend and overflows 64-bit integers in leadingZeros[0] (possibly not unsigned ones, haven't looked at it).


Update: I've found room for a couple more improvements. First, for a given value of F, we can enumerate the first 16 entries of the corresponding S vectors precisely, because they must differ in exactly N/2 places. So we precompute a list of bit vectors of size N that have N/2 bits set and use these to derive the initial part of S from F.

Second, we can improve upon the recursive search by observing that we always know the value of F[N] (as the MSB is zero in the bit representation). This allows us to predict precisely which branch we recurse into from the inner product. While that would actually allow us to turn the entire search into a recursive loop, that actually happens to screw up branch prediction quite a bit, so we keep the top levels in its original form. We still save some time, primarily by reducing the amount of branching we are doing.

For some cleanup, the code is now using unsigned integers and fix them at 64-bit (just in case someone wants to run this on a 32-bit architecture).

Overall speedup is between a factor of x3 and x4. N = 22 still needs more than eight cores to run in under 10 minutes, but on a 64-core machine it's now down to about four minutes (with numThreads bumped up accordingly). I don't think there's much more room for improvement without a different algorithm, though.

N=22:

@[12410090985684467712, 2087229562810269696, 351473149499408384, 59178309967151104, 9975110458933248, 1682628717576192, 284866824372224, 48558946385920, 8416739196928, 1518499004416, 301448822784, 71620493312, 22100246528, 8676573184, 3897278464, 1860960256, 911646720, 451520512, 224785920, 112198656, 56062720, 28031360, 14015680]

Updated again, making use of further possible reductions in search space. Runs in about 9:49 minutes for N=22 on my quadcore machine.

Final update (I think). Better equivalence classes for choices of F, cutting runtime for N=22 down to 3:19 minutes 57 seconds (edit: I had accidentally run that with only one thread) on my machine.

This change makes use of the fact that a pair of vectors produces the same leading zeros if one can be transformed into the other by rotating it. Unfortunately, a fairly critical low-level optimization requires that the top bit of F in the bit representation is always the same, and while using this equivalence slashed the search space quite a bit and reduced runtime by about one quarter over using a different state space reduction on F, the overhead from eliminating the low-level optimization more than offset it. However, it turns out that this problem can be eliminated by also considering the fact that F that are inverses of one another are also equivalent. While this added to the complexity of the calculation of the equivalence classes a bit, it also allowed me to retain the aforementioned low-level optimization, leading to a speedup of about x3.

One more update to support 128-bit integers for the accumulated data. To compile with 128 bit integers, you'll need longint.nim from here and to compile with -d:use128bit. N=24 still takes more than 10 minutes, but I've included the result below for those interested.

N=24:

@[761152247121980686336, 122682715414070296576, 19793870419291799552, 3193295704340561920, 515628872377565184, 83289931274780672, 13484616786640896, 2191103969198080, 359662314586112, 60521536552960, 10893677035520, 2293940617216, 631498735616, 230983794688, 102068682752, 48748969984, 23993655296, 11932487680, 5955725312, 2975736832, 1487591936, 743737600, 371864192, 185931328, 92965664]

import math, locks, unsigned

when defined(use128bit):
  import longint
else:
  type int128 = uint64 # Fallback on unsupported architectures
  template toInt128(x: expr): expr = uint64(x)

const
  N = 22
  M = N + 1
  FSize = (1 shl N)
  FMax = FSize - 1
  SStep = 1 shl (N-1)
  numThreads = 16

type
  ZeroCounter = array[0..M-1, uint64]
  ZeroCounterLong = array[0..M-1, int128]
  ComputeThread = TThread[int]
  Pair = tuple[value, weight: int32]

var
  leadingZeros: ZeroCounterLong
  lock: TLock
  innerProductTable: array[0..FMax, int8]
  zeroInnerProductList = newSeq[int32]()
  equiv: array[0..FMax, int32]
  fTable = newSeq[Pair]()

proc initInnerProductTables =
  for i in 0..FMax:
    innerProductTable[i] = int8(countBits32(int32(i)) - N div 2)
    if innerProductTable[i] == 0:
      if (i and 1) == 0:
        add(zeroInnerProductList, int32(i))

initInnerProductTables()

proc ror1(x: int): int {.inline.} =
  ((x shr 1) or (x shl (N-1))) and FMax

proc initEquivClasses =
  add(fTable, (0'i32, 1'i32))
  for i in 1..FMax:
    var r = i
    var found = false
    block loop:
      for j in 0..N-1:
        for m in [0, FMax]:
          if equiv[r xor m] != 0:
            fTable[equiv[r xor m]-1].weight += 1
            found = true
            break loop
        r = ror1(r)
    if not found:
      equiv[i] = int32(len(fTable)+1)
      add(fTable, (int32(i), 1'i32))

initEquivClasses()

when defined(gcc):
  const unrollDepth = 4
else:
  const unrollDepth = 4

proc search2(lz: var ZeroCounter, s0, f, w: int) =
  var s = s0
  for i in unrollDepth..M-1:
    lz[i] = lz[i] + uint64(w)
    s = s shr 1
    case innerProductTable[s xor f]
    of 0:
      # s = s + 0
    of -1:
      s = s + SStep
    else:
      return

template search(lz: var ZeroCounter, s, f, w, i: int) =
  when i < unrollDepth:
    lz[i] = lz[i] + uint64(w)
    if i < M-1:
      let s2 = s shr 1
      case innerProductTable[s2 xor f]
      of 0:
        search(lz, s2 + 0, f, w, i+1)
      of -1:
        search(lz, s2 + SStep, f, w, i+1)
      else:
        discard
  else:
    search2(lz, s, f, w)

proc worker(base: int) {.thread.} =
  var lz: ZeroCounter
  for fi in countup(base, len(fTable)-1, numThreads):
    let (fp, w) = fTable[fi]
    let f = if (fp and (FSize div 2)) == 0: fp else: fp xor FMax
    for sp in zeroInnerProductList:
      let s = f xor sp
      search(lz, s, f, w, 0)
  acquire(lock)
  for i in 0..M-1:
    let t = lz[i].toInt128 shl (M-i).toInt128
    leadingZeros[i] = leadingZeros[i] + t
  release(lock)

proc main =
  var threads: array[numThreads, ComputeThread]
  for i in 0 .. numThreads-1:
    createThread(threads[i], worker, i)
  for i in 0 .. numThreads-1:
    joinThread(threads[i])

initLock(lock)
main()
echo(@leadingZeros)

Reimer Behrends

Posted 2014-04-30T20:47:41.673

Reputation: 899

The result with N=22 is 12410090985684467712 which takes 63.42 bits and thus fits into an unsigned 64-bit. – Stefan – 2014-05-03T00:21:38.020

2You have definitely raised the bar very impressively. – None – 2014-05-03T08:37:09.980

1Good to see somebody using Nimrod. :) – cjfaure – 2014-05-03T11:59:36.843

@Stefan Maybe your coding wizardry could get this method below 10 minutes for N=22? – None – 2014-05-04T20:30:56.700

I tried N= 22 which did terminate after a few hours. However it gives me [-6036653088025083904, 2087229562810269696, 351473149499408384, 59178309967151104, 9975110458933248, 1682628717576192, 284866824372224, 48558946385920, 8416739196928, 1518499004416, 301448822784, 71620493312, 22100246528, 8676573184, 3897278464, 1860960256, 911646720, 451520512, 224785920, 112198656, 56062720, 28031360, 14015680] which seems to be an overflow error. I know no nimrod but is it possible to use unsigned ints to solve this? – None – 2014-05-05T07:22:36.927

@PeterTaylor I am on a 64 bit system which might help. Are you really on a 32 bit system? – None – 2014-05-05T10:26:38.987

@PeterTaylor I just checked and I think ReimerBehrends' answers for n = 18 are right. – None – 2014-05-05T12:20:32.120

@Lembik, yes, I found another place where I was using 32-bit integers not expecting them to overflow. – Peter Taylor – 2014-05-05T12:52:34.543

N=22 is only 29 minutes on my machine with this new code. – None – 2014-05-05T18:14:59.157

It turns out that I can cut the search space by a factor of 4 with very little coding effort and get a 4x speedup in the process. It's a bit tricky to prove that it's still correct, though. Basically, both the numbers of F and S vectors you have to consider can be cut in half. That's easy to show for S, but the F case is non-trivial (though the results are obviously correct). – Reimer Behrends – 2014-05-06T15:25:35.193

@ReimerBehrends The same symmetries I exploit. If you've found a vector pair S, F with S[0]=1, S[1]=1, and F[0]=1, you've actually found 3 more (S[0]=-1, S[1]=1, F[0]=-1), (S[0]=1, S[1]=-1, F[0]=-1), (S[0]=-1, S[1]=-1, F[0]=1), and the inverses of all of these for a total of 8. Because each of these has a -1 in at least one of these three positions, the search space can be limited so that these are never encountered, and instead counting each vector pair found as 8. – primo – 2014-05-07T02:36:31.790

n = 22 in 6:48! – None – 2014-05-07T08:50:52.940

I assume you mean numThreads = 16 ? – None – 2014-05-07T17:57:20.070

It seems to take 27 seconds?! Do I need to change anything apart from N to try N = 24 in terms of avoiding overflows? – None – 2014-05-07T17:59:43.803

Yeah, I had accidentally run it with only 1 thread to see how fast it was, so it's actually a bit faster than the 3:19 minutes I had noted. For N>=24, I have to rewrite the code a bit to avoid overflow when accumulating the result. I'll look at that. – Reimer Behrends – 2014-05-07T18:42:57.300

@Lembik, now you see why I tried to avoid explaining my method in much detail. :P – Peter Taylor – 2014-05-07T22:26:49.733

@PeterTaylor: Huh? – Reimer Behrends – 2014-05-07T23:11:14.430

Your new equivalence classes are the same as mine. – Peter Taylor – 2014-05-08T07:03:31.780

You have raised the bar to N = 24. Well done! (See new timings in question.) – None – 2014-05-08T09:00:32.863

@PeterTaylor: I'll take your word for it, but I didn't look at your answer and I didn't have to. Tip: Once you print out the F's as bit sequences, grouping together those values that generate the same sets of leading zeros (for small N, i.e. <= 10), there is a rather obvious pattern. – Reimer Behrends – 2014-05-08T20:21:47.623

The smiley was to indicate that I wasn't being entirely serious. – Peter Taylor – 2014-05-08T22:37:58.663

11

Java (n=22?)

I think most of the answers which do better than n=16 use a similar approach to this, although they differ in the symmetries they exploit and the way they divide the task between threads.

The vectors defined in the question can be replaced with bit strings, and the inner product with XORing the overlapping window and checking that there are exactly n/2 bits set (and hence n/2 bits cleared). There are n! / ((n/2)!) (central binomial coefficient) strings of n bits with n/2 bits set (which I call balanced strings), so for any given F there are that many windows of S which give a zero inner product. Moreover, the action of sliding S along one and checking whether we can still find an incoming bit which gives a zero inner product corresponds to looking for an edge in a graph whose nodes are the windows and whose edges link a node u to a node v whose first n-1 bits are the last n-1 bits of u.

For example, with n=6 and F=001001 we get this graph:

Graph for F=001001

and for F=001011 we get this graph:

Graph for F=001011

Then we need to count for each i from 0 to n how many paths of length i there are, summing over the graphs for every F. I think most of us are using depth-first search.

Note that the graphs are sparse: it's easy to prove that each node has in-degree of at most 1 and out-degree of at most one. That also means that the only structures possible are simple chains and simple loops. This simplifies the DFS a bit.

I exploit a couple of symmetries: the balanced strings are closed under bit inverse (the ~ operation in many languages from the ALGOL family), and under bit rotation, so we can group together values of F which are related by these operations and only do the DFS once.

public class CodeGolf26459v8D implements Runnable {
    private static final int NUM_THREADS = 8;

    public static void main(String[] args) {
        v8D(22);
    }

    private static void v8D(int n) {
        int[] bk = new int[1 << n];
        int off = 0;
        for (int i = 0; i < bk.length; i++) {
            bk[i] = Integer.bitCount(i) == n/2 ? off++ : -1;
        }

        int[] fwd = new int[off];
        for (int i = 0; i < bk.length; i++) {
            if (bk[i] >= 0) fwd[bk[i]] = i;
        }

        CodeGolf26459v8D[] runners = new CodeGolf26459v8D[NUM_THREADS];
        Thread[] threads = new Thread[runners.length];
        for (int i = 0; i < runners.length; i++) {
            runners[i] = new CodeGolf26459v8D(n, i, runners.length, bk, fwd);
            threads[i] = new Thread(runners[i]);
            threads[i].start();
        }

        try {
            for (int i = 0; i < threads.length; i++) threads[i].join();
        }
        catch (InterruptedException ie) {
            throw new RuntimeException("This shouldn't be reachable", ie);
        }

        long surviving = ((long)fwd.length) << (n - 1);
        for (int i = 0; i <= n; i++) {
            for (CodeGolf26459v8D runner : runners) surviving -= runner.survival[i];
            System.out.print(i == 0 ? "[" : ", ");
            java.math.BigInteger result = new java.math.BigInteger(Long.toString(surviving));
            System.out.print(result.shiftLeft(n + 1 - i));
        }
        System.out.println("]");
    }

    public final int n;
    protected final int id;
    protected final int numRunners;
    private final int[] bk;
    private final int[] fwd;

    public long[] survival;

    public CodeGolf26459v8D(int n, int id, int numRunners, int[] bk, int[] fwd) {
        this.n = n;
        this.id = id;
        this.numRunners = numRunners;

        this.bk = bk;
        this.fwd = fwd;
    }

    private int dfs2(int[] graphShape, int flip, int i) {
        if (graphShape[i] != 0) return graphShape[i];

        int succ = flip ^ (fwd[i] << 1);
        if (succ >= bk.length) succ ^= bk.length + 1;

        int j = bk[succ];
        if (j == -1) return graphShape[i] = 1;

        graphShape[i] = n + 1; // To detect cycles
        return graphShape[i] = dfs2(graphShape, flip, j) + 1;
    }

    @Override
    public void run() {
        int n = this.n;
        int[] bk = this.bk;
        int[] fwd = this.fwd;

        // NB The initial count is approx 2^(2n - 1.33 - 0.5 lg n)
        // For n=18 we overflow 32-bit
        // 64-bit is good up to n=32.
        long[] survival = new long[n + 1];
        boolean[] visited = new boolean[1 << (n - 1)];
        int th = 0;
        for (int f = 0; f < visited.length; f++) {
            if (visited[f]) continue;

            int m = 1, g = f;
            while (true) {
                visited[g] = true;
                int ng = g << 1;
                if ((ng >> (n - 1)) != 0) ng ^= (1 << n) - 1;
                if (ng == f) break;
                m++;
                g = ng;
            }

            if (th++ % numRunners != id) continue;

            int[] graphShape = new int[fwd.length];
            int flip = (f << 1) ^ f;
            for (int i = 0; i < graphShape.length; i++) {
                int life = dfs2(graphShape, flip, i);
                if (life <= n) survival[life] += m;
            }
        }

        this.survival = survival;
    }
}

On my 2.5GHz Core 2 I get

# n=18
$ javac CodeGolf26459v8D.java && time java CodeGolf26459v8D
[3341140958904320, 619683355033600, 115151552380928, 21392898654208, 3982886961152, 744128512000, 141108051968, 27588886528, 5800263680, 1408761856, 438001664, 174358528, 78848000, 38050816, 18762752, 9346816, 4666496, 2333248, 1166624]

real    0m3.131s
user    0m10.133s
sys     0m0.380s

# n=20
$ javac CodeGolf26459v8D.java && time java CodeGolf26459v8D
[203141370301382656, 35792910586740736, 6316057966936064, 1114358247587840, 196906665902080, 34848574013440, 6211866460160, 1125329141760, 213330821120, 44175523840, 11014471680, 3520839680, 1431592960, 655872000, 317675520, 156820480, 78077440, 39005440, 19501440, 9750080, 4875040]

real    1m8.706s
user    4m20.980s
sys     0m0.564s

# n=22
$ javac CodeGolf26459v8D.java && time java CodeGolf26459v8D
[12410090985684467712, 2087229562810269696, 351473149499408384, 59178309967151104, 9975110458933248, 1682628717576192, 284866824372224, 48558946385920, 8416739196928, 1518499004416, 301448822784, 71620493312, 22100246528, 8676573184, 3897278464, 1860960256, 911646720, 451520512, 224785920, 112198656, 56062720, 28031360, 14015680]

real    20m10.654s
user    76m53.880s
sys     0m6.852s

Since Lembik's computer has 8 cores and executed my earlier single-threaded program twice as fast as mine, I'm optimistic that it will execute n=22 in less than 8 minutes.

Peter Taylor

Posted 2014-04-30T20:47:41.673

Reputation: 41 901

7:17! Very nice. Would you mind explaining your method a little more? – None – 2014-05-06T09:08:58.807

6

C

It is basically just a slightly optimised implementation of the algorithm in the question. It can manage n=12 within the time limit.

#include <stdio.h>
#include <inttypes.h>

#define n 12
#define m (n + 1)

int main() {
    int i;
    uint64_t S, F, o[m] = {0};
    for (S = 0; S < (1LLU << (n + m - 1)); S += 2)
        for (F = 0; F < (1 << (n - 1)); F++)
            for (i = 0; i < m; i++)
                if (__builtin_popcount(((S >> i) & ((1 << n) - 1)) ^ F) == n >> 1)
                    o[i] += 4;
                else
                    break;
    for (i = 0; i < m; i++)
        printf("%" PRIu64 " ", o[i]);
    return 0;
}

Test run for n=12, including compilation:

$ clang -O3 -march=native -fstrict-aliasing -ftree-vectorize -Wall fast.c
$ time ./a.out 
15502147584 3497066496 792854528 179535872 41181184 9826304 2603008 883712 381952 177920 85504 42560 21280 
real    0m53.266s
user    0m53.042s
sys     0m0.068s
$

Comment: I just turned my brain on and used some simple combinatorics to calculate that the first value will always be n! / ((n / 2)!)^2 * 2^(n + m - 1). It seems to me that there must be a completely algebraic solution to this problem.

Fors

Posted 2014-04-30T20:47:41.673

Reputation: 3 020

I get lots of warnings when I compile this. Try gcc -Wall -Wextra Fors.c -o Fors – None – 2014-04-30T21:46:56.570

There were a couple of unused variables forgotten from an earlier iteration, but I removed them so at least a couple of warnings should have disappeared.

I don't have GCC available at the moment (only Clang), and Clang doesn't give me any warnings at the moment (after removing the unused variables). And as Clang usually is more strict when it comes to warnings I must say I'm a bit surprised that you got any warnings at all. – Fors – 2014-04-30T21:53:41.153

It complains about Fors.c:13:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses] (twice) and also warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 2 has type ‘uint64_t’ [-Wformat=] . In fact clang complains about the printf statement too for me, – None – 2014-04-30T21:55:26.190

With the latest changes GCC shouldn't throw any warning messages. – Fors – 2014-04-30T22:03:14.783

It still complains about Fors.c:13:49: warning: suggest parentheses around arithmetic in operand of ‘^’ [-Wparentheses] But in worse news... it takes longer than 10 minutes on my machine. – None – 2014-04-30T22:06:03.943

I guess GCC doesn't trust that the programmer knows the order of precedence in C.

What flags did you compile it with? I compiled it with -O3, which should be enough. With no optimisation flags at all I understand that it would take more than ten minutes. – Fors – 2014-04-30T22:08:25.270

gcc -Wall -Wextra -O3 Fors.c -o Fors . It also takes more than 10 minutes when compiled with clang -O3 Fors.c -o Forsclang – None – 2014-04-30T22:09:04.100

There's an easy speed-up by 4 exploiting symmetry. – Peter Taylor – 2014-04-30T22:16:03.733

You may try gcc -O3 -march=native -fwhole-program -fstrict-aliasing -ftree-vectorize -Wall Fors.c -o Fors, but I don't know whether it will work. – user12205 – 2014-04-30T22:16:49.093

I just reran it and it took just over 5 minutes and 50 seconds. My LLVM version is 5.1, the OS is OS X and the CPU is a 2.5 GHz Intel Core i5 processor. I see no reason to why it runs so much slower on your machine, except potentially an older compiler.

I think I see the speed-up Peter Taylor. – Fors – 2014-04-30T22:25:13.920

Thanks. It is much faster with your gcc command line it turns out than using clang -O3 on my system. I was using 3.4-1ubuntu3 (tags/RELEASE_34/final) (based on LLVM 3.4) before. – None – 2014-05-01T12:41:36.453

5

Java, n=16

For any given value of F there are \binom{n}{n/2} vectors which have a zero inner product with it. So we can build a graph whose vertices are those matching vectors and whose edges correspond to the shifting of S, and then we just need to count paths of length up to n in the graph.

I haven't tried microoptimising this by replacing conditionals with bitwise operations, but each double-increment of n increases running time about 16-fold, so that's not going to make enough of a difference unless I'm pretty close to the threshold. On my machine, I'm not.

public class CodeGolf26459 {

    public static void main(String[] args) {
        v3(16);
    }

    // Order of 2^(2n-1) * n ops
    private static void v3(int n) {
        long[] counts = new long[n+1];
        int mask = (1 << n) - 1;
        for (int f = 0; f < (1 << (n-1)); f++) {
            // Find adjacencies
            long[] subcounts = new long[1 << n];
            for (int g = 0; g < (1 << n); g++) {
                subcounts[g] = Integer.bitCount(f ^ g) == n/2 ? 2 : -1;
            }

            for (int round = 0; round <= n; round++) {
                long count = 0;
                // Extend one bit.
                long[] next = new long[1 << n];
                for (int i = 0; i < (1 << n); i++) {
                    long s = subcounts[i];
                    if (s == -1) next[i] = -1;
                    else {
                        count += s;
                        int j = (i << 1) & mask;
                        if (subcounts[j] >= 0) next[j] += s;
                        if (subcounts[j + 1] >= 0) next[j + 1] += s;
                    }
                }
                counts[round] += count << (n - round);
                subcounts = next;
            }
        }

        System.out.print("[");
        for (long count : counts) System.out.print(count+", ");
        System.out.println("]");
    }
}

On my 2.5GHz Core 2 I get

$ javac CodeGolf26459.java && time java -server CodeGolf26459 
[55276229099520, 10855179878400, 2137070108672, 420578918400, 83074121728, 16540581888, 3394347008, 739659776, 183838720, 57447424, 23398912, 10749184, 5223040, 2584896, 1291424, 645200, 322600, ]

real    6m2.663s
user    6m4.631s
sys     0m1.580s

Peter Taylor

Posted 2014-04-30T20:47:41.673

Reputation: 41 901

Piggybacking since I don't want to implement my own solution right now. Each vertex has at most one successor, so you don't really need the array. To iterate efficiently over combinations of f and starting vertices, iterate over all f_xor_g with exactly n/2 set bits. For each of these, iterate over all f and take g = f ^ f_xor_g. – David Eisenstat – 2014-05-02T22:30:57.307

@David, I know, and my version 7 does n=18 in one minute on my Atom netbook, but I can't post it until I get back from holiday. – Peter Taylor – 2014-05-02T22:36:04.727

4

RPython, N=22 ~3:23

Multi-threaded, using a stackless recursive descent. The program accepts two command line arguments: N, and the number of worker threads.

from time import sleep

from rpython.rlib.rthread import start_new_thread, allocate_lock
from rpython.rlib.rarithmetic import r_int64, build_int, widen
from rpython.rlib.rbigint import rbigint

r_int8 = build_int('r_char', True, 8)

class ThreadEnv:
  __slots__ = ['n', 'counts', 'num_threads',
               'v_range', 'v_num', 'running', 'lock']

  def __init__(self):
    self.n = 0
    self.counts = [rbigint.fromint(0)]
    self.num_threads = 0
    self.v_range = [0]
    self.v_num = 0
    self.running = 0
    self.lock = None

env = ThreadEnv()

bt_bits = 12
bt_mask = (1<<bt_bits)-1
# computed compile time
bit_table = [r_int8(0)]
for i in xrange(1,1<<bt_bits):
  bit_table += [((i&1)<<1) + bit_table[i>>1]]

def main(argv):
  argc = len(argv)
  if argc < 2 or argc > 3:
    print 'Usage: %s N [NUM_THREADS=2]'%argv[0]
    return 1

  if argc == 3:
    env.num_threads = int(argv[2])
  else:
    env.num_threads = 2

  env.n = int(argv[1])
  env.counts = [rbigint.fromint(0)]*env.n
  env.lock = allocate_lock()

  v_range = []
  v_max = 1<<(env.n-1)
  v_num = 0
  v = (1<<(env.n>>1))-1
  while v < v_max:
    v_num += 1
    v_range += [v]
    if v&1:
      # special case odd v
      s = (v+1)&-v
      v ^= s|(s>>1)
    else:
      s = v&-v
      r = v+s
      # s is at least 2, skip two iterations
      i = 3
      s >>= 2
      while s:
        i += 1
        s >>= 1
      v = r|((v^r)>>i)
  env.v_range = v_range
  env.v_num = v_num

  for i in xrange(env.num_threads-1):
    start_new_thread(run,())

  # use the main process as a worker
  run()

  # wait for any laggers
  while env.running:
    sleep(0.05)

  result = []
  for i in range(env.n):
    result += [env.counts[i].lshift(env.n-i+3).str()]
  result += [env.counts[env.n-1].lshift(3).str()]
  print result
  return 0

def run():
  with env.lock:
    v_start = env.running
    env.running += 1

  n = env.n
  counts = [r_int64(0)]*n
  mask = (1<<n)-1
  v_range = env.v_range
  v_num = env.v_num
  z_count = 1<<(n-2)

  for i in xrange(v_start, v_num, env.num_threads):
    v = v_range[i]
    counts[0] += z_count
    counts[1] += v_num
    r = v^(v<<1)
    for w in v_range:
      # unroll counts[2] for speed
      # ideally, we could loop over x directly,
      # rather than over all v, only to throw the majority away
      # there's a 2x-3x speed improvement to be had here...
      x = w^r
      if widen(bit_table[x>>bt_bits]) + widen(bit_table[x&bt_mask]) == n:
        counts[2] += 1
        x, y = v, x
        o, k = 2, 3
        while k < n:
          # x = F ^ S
          # y = F ^ (S<<1)
          o = k
          z = (((x^y)<<1)^y)&mask
          # z is now F ^ (S<<2), possibly xor 1
          # what S and F actually are is of no consequence

          # the compiler hint `widen` let's the translator know
          # to store the result as a native int, rather than a signed char
          bt_high = widen(bit_table[z>>bt_bits])
          if bt_high + widen(bit_table[z&bt_mask]) == n:
            counts[k] += 1
            x, y = y, z
            k += 1
          elif bt_high + widen(bit_table[(z^1)&bt_mask]) == n:
            counts[k] += 1
            x, y = y, z^1
            k += 1
          else: k = n

  with env.lock:
    for i in xrange(n):
      env.counts[i] = env.counts[i].add(rbigint.fromrarith_int(counts[i]))
    env.running -= 1

def target(*args):
  return main, None

To Compile

Make a local clone of the PyPy repository using mercurial, git, or whatever you prefer. Type the following incantation (assuming the above script is named convolution-high.py):

$ pypy %PYPY_REPO%/rpython/bin/rpython --thread convolution-high.py

where %PYPY_REPO% represents an environment variable pointing to the repository you just cloned. Compilation takes about one minute.


Sample Timings

N=16, 4 threads:

$ timeit convolution-high-c 16 4
[55276229099520, 10855179878400, 2137070108672, 420578918400, 83074121728, 16540581888, 3394347008, 739659776, 183838720, 57447424, 23398912, 10749184, 5223040, 2584896, 1291424, 645200, 322600]
Elapsed Time:     0:00:00.109
Process Time:     0:00:00.390

N=18, 4 threads:

$ timeit convolution-high-c 18 4
[3341140958904320, 619683355033600, 115151552380928, 21392898654208, 3982886961152, 744128512000, 141108051968, 27588886528, 5800263680, 1408761856, 438001664, 174358528, 78848000, 38050816, 18762752, 9346816, 4666496, 2333248, 1166624]
Elapsed Time:     0:00:01.250
Process Time:     0:00:04.937

N=20, 4 threads:

$ timeit convolution-high-c 20 4
[203141370301382656, 35792910586740736, 6316057966936064, 1114358247587840, 196906665902080, 34848574013440, 6211866460160, 1125329141760, 213330821120, 44175523840, 11014471680, 3520839680, 1431592960, 655872000, 317675520, 156820480, 78077440, 39005440, 19501440, 9750080, 4875040]
Elapsed Time:     0:00:15.531
Process Time:     0:01:01.328

N=22, 4 threads:

$ timeit convolution-high-c 22 4
[12410090985684467712, 2087229562810269696, 351473149499408384, 59178309967151104, 9975110458933248, 1682628717576192, 284866824372224, 48558946385920, 8416739196928, 1518499004416, 301448822784, 71620493312, 22100246528, 8676573184, 3897278464, 1860960256, 911646720, 451520512, 224785920, 112198656, 56062720, 28031360, 14015680]
Elapsed Time:     0:03:23.156
Process Time:     0:13:25.437

primo

Posted 2014-04-30T20:47:41.673

Reputation: 30 891

9:26. Welcome to the 22 crew :) – None – 2014-05-06T09:48:34.140

I am not sure why but your new version is no faster for me. Still about 9:30 when I do time ./primo-c 22 8 . – None – 2014-05-06T18:52:31.930

@Lembik that would make sense if division were about as fast on average as 3 right shifts (3=Sum{(n+1)/(2^n)},n=1..infty). For certian architectures, I suppose that might be the case, but on mine division is noticably slower. Thanks for taking the time to test it :) – primo – 2014-05-07T01:56:12.227

3

Python 3.3, N = 20, 3.5min

Disclaimer: my intention is NOT to post this as my own answer, since the algorithm I'm using is only a shameless port from primo's RPython solution. My purpose here is only to show what you can do in Python if you combine the magic of Numpy and Numba modules.

Numba explained in short:

Numba is an just-in-time specializing compiler which compiles annotated Python and NumPy code to LLVM (through decorators). http://numba.pydata.org/

Update 1: I noticed after tossing the numbers around that we can simply skip some of the numbers completely. So now maxf becomes (1 << n) // 2 and maxs becomes maxf2**. This will speed up the process quite a bit. n = 16 now takes only ~48s (down from 4,5min). I also have another idea which I'm going to try and see if I can make it go a little faster.

Update 2: Changed algorithm (primo's solution). While my port does not yet support multithreading it is pretty trivial to add. It is even possible to release CPython GIL using Numba and ctypes. This solution, however, runs very fast on single core too!

import numpy as np
import numba as nb

bt_bits = 11
bt_mask = (1 << bt_bits) - 1
bit_table = np.zeros(1 << bt_bits, np.int32)

for i in range(0, 1 << bt_bits):
    bit_table[i] = ((i & 1) << 1) + bit_table[i >> 1]

@nb.njit("void(int32, int32, int32, int32, int64[:], int64[:])")
def run(n, m, start, re, counts, result):
    mask = (1 << n) - 1

    v_max = (1 << n) // 2
    rr = v_max // 2

    v = (1 << (n >> 1)) - 1
    while v < v_max:
        s = start

        while s < rr:
            f = v ^ s
            counts[0] += 8
            t = s << 1
            o, j = 0, 1

            while o < j and j < m:
                o = j
                w = (t ^ f) & mask
                bt_high = bit_table[w >> bt_bits]

                if bt_high + bit_table[w & bt_mask] == n:
                    counts[j] += 8
                    t <<= 1
                    j += 1
                elif bt_high + bit_table[(w ^ 1) & bt_mask] == n:
                    counts[j] += 8
                    t = (t | 1) << 1
                    j += 1
                    s += re

            s = v & -v
            r = v + s
            o = v ^ r
            o = (o >> 2) // s
            v = r | o

    for e in range(m):
        result[e] += counts[e] << (n - e)

And finally:

if __name__ == "__main__":
    n = 20
    m = n + 1

    result = np.zeros(m, np.int64)
    counts = np.zeros(m, np.int64)

    s1 = time.time() * 1000
    run(n, m, 0, 1, counts, result)
    s2 = time.time() * 1000

    print(result)
    print("{0}ms".format(s2 - s1))

This runs on my machine in 212688ms or ~3.5min.

Anna Jokela

Posted 2014-04-30T20:47:41.673

Reputation: 79

Thanks. Now how about n = 18 ? :) – None – 2014-05-02T15:42:32.707

It has been almost 20min since I started the program using n = 18. I think it is safe to say that Python can't solve this even with Numba on time using this particular algorithm. – Anna Jokela – 2014-05-02T16:04:49.257

I am optimistic that a better algorithm exists. – None – 2014-05-02T16:56:30.350

I tried pip install numba but it says it can't find llvmpy. I tried sudo pip install llvmpy but it says it can't find versioneer. I tried sudo pip install versioneer but it says I already have it. – None – 2014-05-02T18:24:24.370

Although I haven't got numba to work yet (I think I will have to install anaconda in the end) I am impressed by this. The question is, can you get it to solve N=22 using a similar method to the nimrod one? – None – 2014-05-05T10:27:56.523

Behrends' solutions is very clever! I could try to port it to Python and use Numba to see how fast it would run. Surprisingly Numba version runs much faster than the equivalent Lisp and C# versions on my machine. I'm currently working on an another type of a solution for this problem. It ALMOST works, but not quite yet =P – Anna Jokela – 2014-05-05T12:50:07.737

@AnnaJokela I think the translation may not be correct. s += re should be shifted two indentations to the left, and the block s = v & -v ... v = r | o should be shifted one indentation to the left. A few other notes, counts[j] += 1 is compiler optimized as counts[j]++, which is why I add 1 instead of 8, and shift an extra 3 at the end. When cycling through all values with the same bit count, odd v can be special cased: find the first 0 bit, flip that bit and the one preceeding it. – primo – 2014-05-07T04:40:22.450

@AnnaJokela I've modified your code a bit to satisfy my own curiosity: http://codepad.org/Y2rOQF77 Timings for n=14: CPython+Numpy+Numba - 78ms; PyPy - 375ms; CPython - 10.531s; CPython+Numpy - 36.203s. I'm impressed. Numba is 5x faster than PyPy (which was already 28x faster than CPython). And it's very nearly as fast as compiled RPython when single-threaded (for n=18: 15.453s vs. 14.687s).

– primo – 2014-05-07T06:36:08.883

Looks like Reimer Behrends' latest method is the one to copy! – None – 2014-05-07T18:02:41.613

It sure looks like it. I have to admit that the solution I was working on wasn't even half as good as his. I must tip a hat for him and his brains! – Anna Jokela – 2014-05-07T18:34:59.230

2

C++ N=16

I'm testing on a EEEPC with an atom.. my time don't make a lot of sense. :D
The atom solve n=14 in 34 seconds. And n=16 in 20 minutes. I want to test n = 16 on OP pc. I'm optimistic.

The ideas is that every time we find a solution for a given F we have found 2^i solution because we can change the lower part of S leading to the same result.

#include <stdio.h>
#include <cinttypes>
#include <cstring>

int main()
{
   const int n = 16;
   const int m = n + 1;
   const uint64_t maxS = 1ULL << (2*n);
   const uint64_t maxF = 1ULL << n;
   const uint64_t mask = (1ULL << n)-1;
   uint64_t out[m]={0};
   uint64_t temp[m] = {0};
   for( uint64_t F = 0; F < maxF; ++F )
   {
      for( uint64_t S = 0; S < maxS; ++S )
      {
         int numSolution = 1;
         for( int i = n; i >= 0; --i )
         {
            const uint64_t window = S >> i;
            if( __builtin_popcount( mask & (window ^ F) ) == (n / 2) )
            {
               temp[i] += 1;
            } else {
               numSolution = 1 << i;
               S += numSolution - 1;
               break;
            }
         }
         for( int i = n; i >= 0; --i )
         {
            out[i] += temp[i]*numSolution;
            temp[i] = 0;
         }
      }
   }
   for( int i = n; i >= 0; --i )
   {
      uint64_t x = out[i];
      printf( "%lu ", x );
   }
   return 0;
}

to compile:

gcc 26459.cpp -std=c++11 -O3 -march=native -fstrict-aliasing -ftree-vectorize -Wall -pedantic -o 26459

ilmale

Posted 2014-04-30T20:47:41.673

Reputation: 1 147

1This is great. I have some half-baked ideas in fact for how to solve it for larger n. Would you like to hear them or could that spoil the competition? – None – 2014-05-01T12:49:43.897

2

JAVASCRIPT n:12

In my computer it took 231.242 seconds. In the Demo I'm using webworkers to prevent freezing the browser. This sure can be further improved with parallel workers. I know JS don't stand a chance in this challenge but I did it for fun!

Click to run the online Demo

var n = 8;        
var m = n + 1;
var o = [];
var popCount = function(bits) {
  var SK5  = 0x55555555,
      SK3  = 0x33333333,
      SKF0 = 0x0f0f0f0f,
      SKFF = 0xff00ff;

  bits -= (bits >> 1) & SK5;
  bits  = (bits & SK3) + ((bits >> 2) & SK3);
  bits  = (bits & SKF0) + ((bits >> 4) & SKF0);
  bits += bits >> 8;

  return (bits + (bits >> 15)) & 63;
};
for(var S = 0; S < (1 << n + m - 1); S += 2){
  for(var F = 0; F < (1 << n - 1); F += 1){
    for (var i = 0; i < m; i++){
      var c = popCount(((S >> i) & ((1 << n) - 1)) ^ F);
      if(c == n >> 1){
        if(!o[i]) o[i] = 0;
        o[i] += 4;
      } else break;
    }
  }
}
return o;

rafaelcastrocouto

Posted 2014-04-30T20:47:41.673

Reputation: 315

What about one of those new(ish) fast javascript engines? Could those be used? – None – 2014-05-08T14:31:44.753

You mean something like dart?

– rafaelcastrocouto – 2014-05-08T14:38:53.033

1Actually I am wrong. You might as well just try both firefox and chromium. Unless you want to write it in asm.js of course :) – None – 2014-05-08T15:27:51.503

1challenge accepted ... gonna do it! – rafaelcastrocouto – 2014-05-08T18:28:40.730

1

Tried this and took my computer 5.4 seconds to do n=22

[235388928,86292480,19031048,5020640,1657928,783920,545408,481256,463832,460256,459744,459744,459744,459744,459744,459744,459744,459744,459744,459744,459744,459744]

http://i.imgur.com/FIJa2Ch.png

– Spedwards – 2014-05-09T12:09:54.267

omg ...what machine are you using? I want one! – rafaelcastrocouto – 2014-05-09T17:36:52.853

@Spedwards That's the wrong answer for n = 22. See the question for the correct output. – None – 2014-05-09T18:02:31.963

1

Fortran: n=12

I just made a quick'n'dirty version in Fortran, no optimizations except OpenMP. It should squeeze in at just below 10 minutes for n=12 on the OPs machine, it takes 10:39 on my machine which is sligthly slower.

64-bit integers have a negative performance impact indeed; guess I would have to rethink the whole algorithm for this to be much faster. Don't know if I'm going to bother, I think I'll rather spend some time thinking up a good challenge that's more to my own tastes. If anyone else wants to take this and run with it, go ahead :)

program golf
use iso_fortran_env
implicit none
integer, parameter ::  n=12
integer :: F(n), S(2*n)
integer(int64) :: leadingzerocounts(n+1)
integer :: k
integer(int64) :: i,j,bindec,enc

leadingzerocounts=0

!$OMP parallel do private(i,enc,j,bindec,S,F,k) reduction(+:leadingzerocounts) schedule(dynamic)
do i=0,2**(2*n)-1
  enc=i
  ! Short loop to convert i into the array S with -1s and 1s
  do j=2*n,1,-1
    bindec=2**(j-1)
    if (enc-bindec .ge. 0) then
      S(j)=1
      enc=enc-bindec
    else
      S(j)=-1
    endif
  end do
  do j=0,2**(n)-1
    ! Convert j into the array F with -1s and 1s
    enc=j
    do k=n,1,-1
      bindec=2**(k-1)
      if (enc-bindec .ge. 0) then
        F(k)=1
        enc=enc-bindec
      else
        F(k)=-1
      endif
    end do
    ! Compute dot product   
    do k=1,n+1
      if (dot_product(F,S(k:k+n-1)) /= 0) exit
      leadingzerocounts(k)=leadingzerocounts(k)+1
    end do
  end do
end do
!$OMP end parallel do

print *, leadingzerocounts

end

semi-extrinsic

Posted 2014-04-30T20:47:41.673

Reputation: 641

1

Lua: n = 16

Disclaimer: my intention is NOT to post this as my own answer, since the algorithm I'm using is shamelessly stolen from Anna Jokela's clever answer. which was shamelessly stolen from ilmale's clever answer.

Besides, it's not even valid - it has inaccuracies caused by floating point numbers (it would be better if Lua would support 64-bit integers). However, I'm still uploading it, just to show how fast this solution is. It's a dynamic programming language, and yet I can calculate n = 16 in reasonable time (1 minute on 800MHz CPU).

Run with LuaJIT, standard interpreter is too slow.

local bit = require "bit"
local band = bit.band
local bor = bit.bor
local bxor = bit.bxor
local lshift = bit.lshift
local rshift = bit.rshift

-- http://stackoverflow.com/a/11283689/736054
local function pop_count(w)
    local b1 = 1431655765
    local b2 = 858993459
    local b3 = 252645135
    local b7 = 63

    w = band(rshift(w, 1), b1) + band(w, b1)
    w = band(rshift(w, 2), b2) + band(w, b2)
    w = band(w + rshift(w, 4), b3)
    return band(rshift(w, 24) + rshift(w, 16) + rshift(w, 8) + w, b7)
end

local function gen_array(n, value)
    value = value or 0
    array = {}
    for i = 1, n do
        array[i] = value
    end
    return array
end

local n = 16
local u = math.floor(n / 2)
local m = n + 1
local maxf = math.floor(lshift(1, n) / 2)
local maxs = maxf ^ 2
local mask = lshift(1, n) - 1

local out = gen_array(m, 0)
local temp = gen_array(m, 0)


for f = 0, maxf do
    local s = 0
    while s <= maxs do
        local num_solution = 1

        for i = n, 0, -1 do
            if pop_count(band(mask, bxor(rshift(s, i), f))) == u then
                temp[i + 1] = temp[i + 1] + 8
            else
                num_solution = lshift(1, i)
                s = s + num_solution - 1
                break
            end
        end

        for i = 1, m do
            out[i] = out[i] + temp[i] * num_solution
            temp[i] = 0
        end

        s = s + 1
    end
end

for i = m, 1, -1 do
    print(out[i])
end

Konrad Borowski

Posted 2014-04-30T20:47:41.673

Reputation: 11 185

Thank you. I think recent lua versions use long long int which should be 64 bit on a 64 bit system. See "lua_integer" at http://www.lua.org/work/doc/manual.html .

– None – 2014-05-05T15:03:22.883

@Lembik: Interesting. Either way, it's standard Lua (which already supports long long instead of double with a compilation setting), not LuaJIT. – Konrad Borowski – 2014-05-05T15:25:45.993

I think I was just wrong in any case wrt luajit. One would need 5.3 which doesn't exist. The best advice lua people could give was "try 5.3-workx". – None – 2014-05-05T16:39:30.500