Nimrod (N=22)
import math, locks
const
N = 20
M = N + 1
FSize = (1 shl N)
FMax = FSize - 1
SStep = 1 shl (N-1)
numThreads = 16
type
ZeroCounter = array[0..M-1, int]
ComputeThread = TThread[int]
var
leadingZeros: ZeroCounter
lock: TLock
innerProductTable: array[0..FMax, int8]
proc initInnerProductTable =
for i in 0..FMax:
innerProductTable[i] = int8(countBits32(int32(i)) - N div 2)
initInnerProductTable()
proc zeroInnerProduct(i: int): bool =
innerProductTable[i] == 0
proc search2(lz: var ZeroCounter, s, f, i: int) =
if zeroInnerProduct(s xor f) and i < M:
lz[i] += 1 shl (M - i - 1)
search2(lz, (s shr 1) + 0, f, i+1)
search2(lz, (s shr 1) + SStep, f, i+1)
when defined(gcc):
const
unrollDepth = 1
else:
const
unrollDepth = 4
template search(lz: var ZeroCounter, s, f, i: int) =
when i < unrollDepth:
if zeroInnerProduct(s xor f) and i < M:
lz[i] += 1 shl (M - i - 1)
search(lz, (s shr 1) + 0, f, i+1)
search(lz, (s shr 1) + SStep, f, i+1)
else:
search2(lz, s, f, i)
proc worker(base: int) {.thread.} =
var lz: ZeroCounter
for f in countup(base, FMax div 2, numThreads):
for s in 0..FMax:
search(lz, s, f, 0)
acquire(lock)
for i in 0..M-1:
leadingZeros[i] += lz[i]*2
release(lock)
proc main =
var threads: array[numThreads, ComputeThread]
for i in 0 .. numThreads-1:
createThread(threads[i], worker, i)
for i in 0 .. numThreads-1:
joinThread(threads[i])
initLock(lock)
main()
echo(@leadingZeros)
Compile with
nimrod cc --threads:on -d:release count.nim
(Nimrod can be downloaded here.)
This runs in the allotted time for n=20 (and for n=18 when only using a single thread, taking about 2 minutes in the latter case).
The algorithm uses a recursive search, pruning the search tree whenever a non-zero inner product is encountered. We also cut the search space in half by observing that for any pair of vectors (F, -F)
we only need to consider one because the other produces the exact same sets of inner products (by negating S
also).
The implementation uses Nimrod's metaprogramming facilities to unroll/inline the first few levels of the recursive search. This saves a little time when using gcc 4.8 and 4.9 as the backend of Nimrod and a fair amount for clang.
The search space could be further pruned by observing that we only need to consider values of S that differ in an even number of the first N positions from our choice of F. However, the complexity or memory needs of that do not scale for large values of N, given that the loop body is completely skipped in those cases.
Tabulating where the inner product is zero appears to be faster than using any bit counting functionality in the loop. Apparently accessing the table has pretty good locality.
It looks as though the problem should be amenable to dynamic programming, considering how the recursive search works, but there is no apparent way to do that with a reasonable amount of memory.
Example outputs:
N=16:
@[55276229099520, 10855179878400, 2137070108672, 420578918400, 83074121728, 16540581888, 3394347008, 739659776, 183838720, 57447424, 23398912, 10749184, 5223040, 2584896, 1291424, 645200, 322600]
N=18:
@[3341140958904320, 619683355033600, 115151552380928, 21392898654208, 3982886961152, 744128512000, 141108051968, 27588886528, 5800263680, 1408761856, 438001664, 174358528, 78848000, 38050816, 18762752, 9346816, 4666496, 2333248, 1166624]
N=20:
@[203141370301382656, 35792910586740736, 6316057966936064, 1114358247587840, 196906665902080, 34848574013440, 6211866460160, 1125329141760, 213330821120, 44175523840, 11014471680, 3520839680, 1431592960, 655872000, 317675520, 156820480, 78077440, 39005440, 19501440, 9750080, 4875040]
For purposes of comparing the algorithm with other implementations, N=16 takes about 7.9 seconds on my machine when using a single thread and 2.3 seconds when using four cores.
N=22 takes about 15 minutes on a 64-core machine with gcc 4.4.6 as Nimrod's backend and overflows 64-bit integers in leadingZeros[0]
(possibly not unsigned ones, haven't looked at it).
Update: I've found room for a couple more improvements. First, for a given value of F
, we can enumerate the first 16 entries of the corresponding S
vectors precisely, because they must differ in exactly N/2
places. So we precompute a list of bit vectors of size N
that have N/2
bits set and use these to derive the initial part of S
from F
.
Second, we can improve upon the recursive search by observing that we always know the value of F[N]
(as the MSB is zero in the bit representation). This allows us to predict precisely which branch we recurse into from the inner product. While that would actually allow us to turn the entire search into a recursive loop, that actually happens to screw up branch prediction quite a bit, so we keep the top levels in its original form. We still save some time, primarily by reducing the amount of branching we are doing.
For some cleanup, the code is now using unsigned integers and fix them at 64-bit (just in case someone wants to run this on a 32-bit architecture).
Overall speedup is between a factor of x3 and x4. N = 22 still needs more than eight cores to run in under 10 minutes, but on a 64-core machine it's now down to about four minutes (with numThreads
bumped up accordingly). I don't think there's much more room for improvement without a different algorithm, though.
N=22:
@[12410090985684467712, 2087229562810269696, 351473149499408384, 59178309967151104, 9975110458933248, 1682628717576192, 284866824372224, 48558946385920, 8416739196928, 1518499004416, 301448822784, 71620493312, 22100246528, 8676573184, 3897278464, 1860960256, 911646720, 451520512, 224785920, 112198656, 56062720, 28031360, 14015680]
Updated again, making use of further possible reductions in search space. Runs in about 9:49 minutes for N=22 on my quadcore machine.
Final update (I think). Better equivalence classes for choices of F, cutting runtime for N=22 down to 3:19 minutes 57 seconds (edit: I had accidentally run that with only one thread) on my machine.
This change makes use of the fact that a pair of vectors produces the same leading zeros if one can be transformed into the other by rotating it. Unfortunately, a fairly critical low-level optimization requires that the top bit of F in the bit representation is always the same, and while using this equivalence slashed the search space quite a bit and reduced runtime by about one quarter over using a different state space reduction on F, the overhead from eliminating the low-level optimization more than offset it. However, it turns out that this problem can be eliminated by also considering the fact that F that are inverses of one another are also equivalent. While this added to the complexity of the calculation of the equivalence classes a bit, it also allowed me to retain the aforementioned low-level optimization, leading to a speedup of about x3.
One more update to support 128-bit integers for the accumulated data. To compile with 128 bit integers, you'll need longint.nim
from here and to compile with -d:use128bit
. N=24 still takes more than 10 minutes, but I've included the result below for those interested.
N=24:
@[761152247121980686336, 122682715414070296576, 19793870419291799552, 3193295704340561920, 515628872377565184, 83289931274780672, 13484616786640896, 2191103969198080, 359662314586112, 60521536552960, 10893677035520, 2293940617216, 631498735616, 230983794688, 102068682752, 48748969984, 23993655296, 11932487680, 5955725312, 2975736832, 1487591936, 743737600, 371864192, 185931328, 92965664]
import math, locks, unsigned
when defined(use128bit):
import longint
else:
type int128 = uint64 # Fallback on unsupported architectures
template toInt128(x: expr): expr = uint64(x)
const
N = 22
M = N + 1
FSize = (1 shl N)
FMax = FSize - 1
SStep = 1 shl (N-1)
numThreads = 16
type
ZeroCounter = array[0..M-1, uint64]
ZeroCounterLong = array[0..M-1, int128]
ComputeThread = TThread[int]
Pair = tuple[value, weight: int32]
var
leadingZeros: ZeroCounterLong
lock: TLock
innerProductTable: array[0..FMax, int8]
zeroInnerProductList = newSeq[int32]()
equiv: array[0..FMax, int32]
fTable = newSeq[Pair]()
proc initInnerProductTables =
for i in 0..FMax:
innerProductTable[i] = int8(countBits32(int32(i)) - N div 2)
if innerProductTable[i] == 0:
if (i and 1) == 0:
add(zeroInnerProductList, int32(i))
initInnerProductTables()
proc ror1(x: int): int {.inline.} =
((x shr 1) or (x shl (N-1))) and FMax
proc initEquivClasses =
add(fTable, (0'i32, 1'i32))
for i in 1..FMax:
var r = i
var found = false
block loop:
for j in 0..N-1:
for m in [0, FMax]:
if equiv[r xor m] != 0:
fTable[equiv[r xor m]-1].weight += 1
found = true
break loop
r = ror1(r)
if not found:
equiv[i] = int32(len(fTable)+1)
add(fTable, (int32(i), 1'i32))
initEquivClasses()
when defined(gcc):
const unrollDepth = 4
else:
const unrollDepth = 4
proc search2(lz: var ZeroCounter, s0, f, w: int) =
var s = s0
for i in unrollDepth..M-1:
lz[i] = lz[i] + uint64(w)
s = s shr 1
case innerProductTable[s xor f]
of 0:
# s = s + 0
of -1:
s = s + SStep
else:
return
template search(lz: var ZeroCounter, s, f, w, i: int) =
when i < unrollDepth:
lz[i] = lz[i] + uint64(w)
if i < M-1:
let s2 = s shr 1
case innerProductTable[s2 xor f]
of 0:
search(lz, s2 + 0, f, w, i+1)
of -1:
search(lz, s2 + SStep, f, w, i+1)
else:
discard
else:
search2(lz, s, f, w)
proc worker(base: int) {.thread.} =
var lz: ZeroCounter
for fi in countup(base, len(fTable)-1, numThreads):
let (fp, w) = fTable[fi]
let f = if (fp and (FSize div 2)) == 0: fp else: fp xor FMax
for sp in zeroInnerProductList:
let s = f xor sp
search(lz, s, f, w, 0)
acquire(lock)
for i in 0..M-1:
let t = lz[i].toInt128 shl (M-i).toInt128
leadingZeros[i] = leadingZeros[i] + t
release(lock)
proc main =
var threads: array[numThreads, ComputeThread]
for i in 0 .. numThreads-1:
createThread(threads[i], worker, i)
for i in 0 .. numThreads-1:
joinThread(threads[i])
initLock(lock)
main()
echo(@leadingZeros)
1"I reserve the right not to test code that takes more than a few minutes." > You should specify an exact time on your machine, otherwise this question will lack an objective winning criterion. – user12205 – 2014-04-30T20:50:07.450
14I love these "increase my speed" challenges. If you are building a commercial product with these, you are going to have one hell of a fast product, and you are also an evil genius. – Rainbolt – 2014-04-30T20:53:42.627
@Rusher I 100% promise I am not going to make 1 cent from any of this. I just like series of questions of increasing difficulty! It somehow reminds me of Bruce Lee movies :) – None – 2014-04-30T20:58:11.713
@Rusher Technically, unless you have permission from the poster, you cannot use these codes commercially, as they are licensed under CC-By-SA. But if your product is closed-source, it's not like anyone would know... – user12205 – 2014-04-30T21:25:45.080
1Perhaps a more informative title would draw attention to this? – TheDoctor – 2014-04-30T21:30:06.720
@TheDoctor Good point but do you have any suggestions? I am not good at titles. – None – 2014-04-30T21:37:34.390
How about "Optimize this algorithm for larger inputs"? I'm not really sure what it does. – TheDoctor – 2014-04-30T21:47:20.447
@TheDoctor Thank you. Is there any part of the code I can explain? – None – 2014-04-30T21:50:27.617
As a sanity check for claims that are large enough to be hard to check with another implementation, the first item should be
2^(2n) n! / ((n/2)!^2)
– Peter Taylor – 2014-04-30T22:07:51.0308If we keep doing this sort of challenge, I think we should at least attempt to solve a different problem to keep it interesting (not a variation on the same problem with extra specifications). – grovesNL – 2014-05-01T04:39:43.887
@grovesNL Yes this is the last challenge on this theme. However I would argue it's great to have a few of increasing difficulty so people can use the hard work they have put into previous challenges. – None – 2014-05-01T06:45:15.157
What is the purpose of
m
? Is the listS
just2n
long, andleadingzerocounts
justn+1
length? – qwr – 2014-05-01T20:45:59.813@qwr Yes that is right. – None – 2014-05-01T21:05:15.700
@Lembik So are we calculating dot products? Because the lists are of unequal length – qwr – 2014-05-01T21:21:26.420
@qwr S is longer than F which is why the dot products are of the form F, S[i:i+n]. That is, F with a window of S of length n. – None – 2014-05-02T06:11:51.020
@ace Technically and practically, CC-BY-SA does allow commercial use. It just requires attribution and sharing of source code. CC-BY-NC-SA (NC=non-commercial) would exclude money making. – ojdo – 2014-05-02T12:30:25.747
8-core means you have 8 physical cores? Or is it 4-core with hyperthreading = 8 virtual cores? Or is it 8-core with hyperthreading for 16 virtual cores? – Claudiu – 2014-05-02T16:43:01.523
1
@Claudiu It's this CPU http://techreport.com/review/23750/amd-fx-8350-processor-reviewed .
– None – 2014-05-02T16:59:09.8702@Claudiu his CPU has 8 physical cores, but the fetch/decode and FPU units are shared between cores. So when the bottleneck is on one of those areas, it behaves more like an quadcore. Abuse integer logic and avoid large code sizes and it's more like an 8-core. – Stefan – 2014-05-02T20:00:35.673
@Lembik Each additional bounty you grant will need to be at least twice the previous one (up until 500, of course). – primo – 2014-05-12T09:30:05.350
@primo :) I would run out of points too quickly. – None – 2014-05-12T12:47:00.840
@Lembik No, I mean, it's not a suggestion. Click the
start a bounty
button again. I will require you to give at least 200 rep, because you previously granted one for 100. – primo – 2014-05-15T04:43:55.620