Python 2 using pypy and pp: n = 15 in 3 minutes
Also just a simple brute force. Interesting to see, that I nearly get the same same speed as kuroi neko with C++. My code can reach n = 12
in about 5 minutes. And I only run it on one virtual core.
edit: Reduce search space by a factor of n
I noticed, that a cycled vector A*
of A
produces the same numbers as probabilities (same numbers) as the original vector A
when I iterate over B
.
E.g. The vector (1, 1, 0, 1, 0, 0)
has the same probabilities as each of the vectors (1, 0, 1, 0, 0, 1)
, (0, 1, 0, 0, 1, 1)
, (1, 0, 0, 1, 1, 0)
, (0, 0, 1, 1, 0, 1)
and (0, 1, 1, 0, 1, 0)
when choosing a random B
. Therefore I don't have to iterate over each of these 6 vectors, but only about 1 and replace count[i] += 1
with count[i] += cycle_number
.
This reduces the complexity from Theta(n) = 6^n
to Theta(n) = 6^n / n
. Therefore for n = 13
it's about 13 times as fast as my previous version. It calculates n = 13
in about 2 minutes 20 seconds. For n = 14
it is still a little bit too slow. It takes about 13 minutes.
edit 2: Multi-core-programming
Not really happy with the next improvement. I decided to also try to execute my program on multiple cores. On my 2 + 2 cores I now can calculate n = 14
in about 7 minutes. Only a factor of 2 improvement.
The code is available in this github repo: Link. The multi-core programming makes is a little bit ugly.
edit 3: Reducing search space for A
vectors and B
vectors
I noticed the same mirror-symmetry for the vectors A
as kuroi neko did. Still not sure, why this works (and if it works for each n
).
The reducing of the search space for B
vectors is a bit cleverer. I replaced the generation of the vectors (itertools.product
), with an own function. Basically, I start with an empty list and put it on a stack. Until the stack is empty, I remove a list, if it has not the same lenght as n
, I generate 3 other lists (by appending -1, 0, 1) and pushing them onto the stack. I a list has the same length as n
, I can evaluate the sums.
Now that I generate the vectors myself, I can filter them depending on if I can reach the sum = 0 or not. E.g. if my vector A
is (1, 1, 1, 0, 0)
, and my vector B
looks (1, 1, ?, ?, ?)
, I know, that I cannot fill the ?
with values, so that A*B = 0
. So I don't have to iterate over all those 6 vectors B
of the form (1, 1, ?, ?, ?)
.
We can improve on this, if we ignore the values for 1. As noted in the question, for the values for i = 1
are the sequence A081671. There are many ways to calculate those. I choose the simple recurrence: a(n) = (4*(2*n-1)*a(n-1) - 12*(n-1)*a(n-2)) / n
. Since we can calculate i = 1
in basically no time, we can filter more vectors for B
. E.g. A = (0, 1, 0, 1, 1)
and B = (1, -1, ?, ?, ?)
. We can ignore vectors, where the first ? = 1
, because the A * cycled(B) > 0
, for all these vectors. I hope you can follow. It's probably not the best example.
With this I can calculate n = 15
in 6 minutes.
edit 4:
Quickly implemented kuroi neko's great idea, which says, that B
and -B
produces the same results. Speedup x2. Implementation is only a quick hack, though. n = 15
in 3 minutes.
Code:
For the complete code visit Github. The following code is only a representation of the main features. I left out imports, multicore programming, printing the results, ...
count = [0] * n
count[0] = oeis_A081671(n)
#generating all important vector A
visited = set(); todo = dict()
for A in product((0, 1), repeat=n):
if A not in visited:
# generate all vectors, which have the same probability
# mirrored and cycled vectors
same_probability_set = set()
for i in range(n):
tmp = [A[(i+j) % n] for j in range(n)]
same_probability_set.add(tuple(tmp))
same_probability_set.add(tuple(tmp[::-1]))
visited.update(same_probability_set)
todo[A] = len(same_probability_set)
# for each vector A, create all possible vectors B
stack = []
for A, cycled_count in dict_A.iteritems():
ones = [sum(A[i:]) for i in range(n)] + [0]
# + [0], so that later ones[n] doesn't throw a exception
stack.append(([0] * n, 0, 0, 0, False))
while stack:
B, index, sum1, sum2, used_negative = stack.pop()
if index < n:
# fill vector B[index] in all possible ways,
# so that it's still possible to reach 0.
if used_negative:
for v in (-1, 0, 1):
sum1_new = sum1 + v * A[index]
sum2_new = sum2 + v * A[index - 1 if index else n - 1]
if abs(sum1_new) <= ones[index+1]:
if abs(sum2_new) <= ones[index] - A[n-1]:
C = B[:]
C[index] = v
stack.append((C, index + 1, sum1_new, sum2_new, True))
else:
for v in (0, 1):
sum1_new = sum1 + v * A[index]
sum2_new = sum2 + v * A[index - 1 if index else n - 1]
if abs(sum1_new) <= ones[index+1]:
if abs(sum2_new) <= ones[index] - A[n-1]:
C = B[:]
C[index] = v
stack.append((C, index + 1, sum1_new, sum2_new, v == 1))
else:
# B is complete, calculate the sums
count[1] += cycled_count # we know that the sum = 0 for i = 1
for i in range(2, n):
sum_prod = 0
for j in range(n-i):
sum_prod += A[j] * B[i+j]
for j in range(i):
sum_prod += A[n-i+j] * B[j]
if sum_prod:
break
else:
if used_negative:
count[i] += 2*cycled_count
else:
count[i] += cycled_count
Usage:
You have to install pypy (for Python 2!!!). The parallel python module isn't ported for Python 3. Then you have to install the parallel python module pp-1.6.4.zip. Extract it, cd
into the folder and call pypy setup.py install
.
Then you can call my program with
pypy you-do-the-math.py 15
It will automatically determine the number of cpu's. There may be some error messages after finishing the program, just ignore them. n = 16
should be possible on your machine.
Output:
Calculation for n = 15 took 2:50 minutes
1 83940771168 / 470184984576 17.85%
2 17379109692 / 470184984576 3.70%
3 3805906050 / 470184984576 0.81%
4 887959110 / 470184984576 0.19%
5 223260870 / 470184984576 0.05%
6 67664580 / 470184984576 0.01%
7 30019950 / 470184984576 0.01%
8 20720730 / 470184984576 0.00%
9 18352740 / 470184984576 0.00%
10 17730480 / 470184984576 0.00%
11 17566920 / 470184984576 0.00%
12 17521470 / 470184984576 0.00%
13 17510280 / 470184984576 0.00%
14 17507100 / 470184984576 0.00%
15 17506680 / 470184984576 0.00%
Notes and ideas:
- I have a i7-4600m processor with 2 cores and 4 threads. It doesn't
matter If I use 2 or 4 threads. The cpu-usage is 50% with 2 threads and 100% with 4 threads, but it still takes the same amount of time. I don't know why. I checked, that each thread only has to the the half amout of data, when there are 4 threads, checked the results, ...
- I use a lot of lists. Python isn't quite efficient in storing, I have to copy lots of lists, ... So I thought of using an integer instead. I could use the bits 00 (for 0) and 11 (for 1) in the vector A, and the bits 10 (for -1), 00 (for 0) and 01 (for 1) in the vector B. For the product of A and B, I would only have to calculate
A & B
and count the 01 and 10 blocks. Cycling can be done with shifting the vector and using masks, ... I actually implemented all this, you can find it in some of my older commits on Github. But it turned out, to be slower than with lists. I guess, pypy really optimizes list operations.
What is your graphic card ? Looks like a job for a GPU. – Michael M. – 2014-12-11T22:57:57.643
1@Knerd I added a table of probabilities to the question instead. I hope it is helpful. – None – 2014-12-12T09:55:18.777
@Mig PCI bridge: Advanced Micro Devices, Inc. [AMD] RS780/RS880 PCI to PCI bridge (PCIE port 1) . This may be disappointing news for you, sorry. – None – 2014-12-12T11:04:17.563
http://oeis.org/A081671 – kennytm – 2014-12-12T11:25:59.683
@KennyTM Thanks I'll add that to the question. – None – 2014-12-12T11:27:32.263
@Lembik could you please add the execution time on your machine to the leaderboard? I'm really curious to see what your 8 cores can make of my code. – None – 2014-12-12T19:23:50.707
@kuroineko It only runs on one core! Have you tried it in your ubuntu install? – None – 2014-12-12T19:44:07.287
@Lembik Pfff... I can't get VirtualBox to allocate more than 1 CPU to my VM, so I can't test that. The horrible processor affinity code does not help either. Would you kindly comment out the
lock_on_core()
call and see if the scheduler does make some use of your 8 cores when left to its own initiative? – None – 2014-12-12T19:54:44.790I know nothing about windows but http://askubuntu.com/questions/365615/how-do-i-enable-multiple-cores-in-my-virtual-enviroment seems related to your problem. Maybe there is a windows version of this. I also commented out
– None – 2014-12-12T19:57:36.543lock_on_core(index);
but it seems to have made no difference.Pfff (bis)... VirtualBox does not allow me to change the settings despite what the article claims, and frankly I'm getting tired of this. Now about you native Ubuntu environment, I wonder what use threads are if the bloody OS and/or compiler does not manage to run them on all available cores? Is there some Linux magic to invoke to make that possible at all? – None – 2014-12-12T20:01:44.370
It really should work. Could you make a minimal example for me to test and possibly post on SO? – None – 2014-12-12T20:03:29.360
If it helps, https://bpaste.net/show/ee23472615c3 works and uses all my cores and I have to compile it with g++ -O3 -std=c++0x -pthread -Wl,--no-as-needed Stefan.cpp -o Stefan
– None – 2014-12-12T20:07:40.4671@Knerd How can I say no. I will try to work out how to run your code in linux but any help much appreciated. – None – 2014-12-11T11:46:08.597
Ok, sorry for deleting comments. For all that didn't read, it was if F# or C# are allowed :) – Knerd – 2014-12-11T11:47:11.467
The other question again, do you maybe have an example of a valid input output? – Knerd – 2014-12-11T11:47:44.220