The challenge is to write the fastest code possible for computing the permanent of a matrix.

The permanent of an n-by-n matrix A = (a_i,j) is defined as

Here S_n represents the set of all permutations of [1, n].

As an example (from the wiki):

In this question matrices are all square and will only have the values -1 and 1 in them.

Examples

Input:

[[ 1 -1 -1  1]
 [-1 -1 -1  1]
 [-1  1 -1  1]
 [ 1 -1 -1  1]]

Permanent:

-4

Input:

[[-1 -1 -1 -1]
 [-1  1 -1 -1]
 [ 1 -1 -1 -1]
 [ 1 -1  1 -1]]

Permanent:

Input:

[[ 1 -1  1 -1 -1 -1 -1 -1]
 [-1 -1  1  1 -1  1  1 -1]
 [ 1 -1 -1 -1 -1  1  1  1]
 [-1 -1 -1  1 -1  1  1  1]
 [ 1 -1 -1  1  1  1  1 -1]
 [-1  1 -1  1 -1  1  1 -1]
 [ 1 -1  1 -1  1 -1  1 -1]
 [-1 -1  1 -1  1  1  1  1]]

Permanent:

Input:

[[1, -1, 1, -1, -1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1, 1, -1],
 [1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, 1, 1, 1, -1, -1, 1, 1, 1, -1],
 [-1, -1, 1, 1, 1, -1, -1, -1, -1, 1, -1, 1, 1, 1, -1, -1, -1, 1, -1, -1],
 [-1, -1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, 1, -1],
 [-1, 1, 1, 1, -1, 1, 1, 1, -1, -1, -1, 1, -1, 1, -1, 1, 1, 1, 1, 1],
 [1, -1, 1, 1, -1, -1, 1, -1, 1, 1, 1, 1, -1, 1, 1, -1, 1, -1, -1, -1],
 [1, -1, -1, 1, -1, -1, -1, 1, -1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 1, -1],
 [1, -1, -1, 1, -1, 1, 1, -1, 1, 1, 1, -1, 1, -1, 1, 1, 1, -1, 1, 1],
 [1, -1, -1, -1, -1, -1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, -1, 1, 1, -1],
 [-1, -1, 1, -1, 1, -1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1],
 [-1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1],
 [1, 1, -1, -1, -1, 1, 1, -1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1],
 [-1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, -1, 1],
 [1, 1, -1, -1, -1, 1, -1, 1, -1, -1, -1, -1, 1, -1, 1, 1, -1, 1, -1, 1],
 [1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 1, -1, 1, -1, 1, 1, 1, -1, 1, 1],
 [1, -1, -1, 1, -1, -1, -1, -1, 1, -1, -1, 1, 1, -1, 1, -1, -1, -1, -1, -1],
 [-1, 1, 1, 1, -1, 1, 1, -1, -1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1],
 [1, 1, -1, -1, 1, 1, -1, 1, 1, -1, 1, 1, 1, -1, 1, 1, -1, 1, -1, 1],
 [1, 1, 1, -1, -1, -1, 1, -1, -1, 1, 1, -1, -1, -1, 1, -1, -1, -1, -1, 1],
 [-1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, 1, 1, -1, 1, -1, -1]]

Permanent:

1021509632

The task

You should write code that, given an n by n matrix, outputs its permanent.

As I will need to test your code it would be helpful if you could give a simple way for me to give a matrix as input to your code, for example by reading from standard in.

Be warned that the permanent can be large (the all 1s matrix is the extreme case).

Scores and ties

I will test your code on random +-1 matrices of increasing size and stop the first time your code takes more than 1 minute on my computer. The scoring matrices will be consistent for all submissions in order to ensure fairness.

If two people get the same score then the winner is the one which is fastest for that value of n. If those are within 1 second of each other then it is the one posted first.

Languages and libraries

You can use any available language and libraries you like but no pre-existing function to compute the permanent. Where feasible, it would be good to be able to run your code so please include a full explanation for how to run/compile your code in Linux if at all possible.`

Reference implementations

There is already a codegolf question question with lots of code in different languages for computing the permanent for small matrices. Mathematica and Maple also both have permanent implementations if you can access those.

My Machine The timings will be run on my 64-bit machine. This is a standard ubuntu install with 8GB RAM, AMD FX-8350 Eight-Core Processor and Radeon HD 4250. This also means I need to be able to run your code.

Low level information about my machine

cat /proc/cpuinfo/|grep flags gives

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold

I will ask a closely related follow up multi-language question that doesn't suffer from the big Int problem so lovers of Scala, Nim, Julia, Rust, Bash can show off their languages too.

Leader board

n = 33 (45 seconds. 64 seconds for n = 34). Ton Hospel in C++ with g++ 5.4.0.
n = 32 (32 seconds). Dennis in C with gcc 5.4.0 using Ton Hospel's gcc flags.
n = 31 (54 seconds). Christian Sievers in Haskell
n = 31 (60 seconds). primo in rpython
n = 30 (26 seconds). ezrast in Rust
n = 28 (49 seconds). xnor with Python + pypy 5.4.1
n = 22 (25 seconds). Shebang with Python + pypy 5.4.1

Note. In practice the timings for Dennis and Ton Hospel's vary a lot for mysterious reasons. For example they seem to be faster after I have loaded a web browser! The timings quoted are the fastest in all the tests I have done.

user9206

Posted 2016-10-22T08:30:36.173

Reputation:

5I read the first sentence, thought 'Lembik', scrolled down, yep - Lembik. – orlp – 2016-10-22T08:40:08.793

@orlp :) It's been a long time. – None – 2016-10-22T08:40:42.103

One other thing I forgot to ask in sandbox, are submissions limited to using a single thread or are multiple threads allowed? Also, I just noticed you mentioned your GPU type, are you also expecting and allowing submissions to use OpenCL or something similar? Finally, what is the deadline to when you will test our submissions? – miles – 2016-10-22T08:55:31.183

@miles Your code can take advantage of anything my PC has to offer. That is it can use all the cores and anything you can get out of my unimpressive GPU. – None – 2016-10-22T09:11:22.817

Could you please include a bigger test case we can check? – xnor – 2016-10-22T09:49:04.800

@xnor I am away from a computer for 36 hours but hopefully Shebang will post his/her simple python code soon that can be used for testing. Otherwise I will post a bigger example then. – None – 2016-10-22T09:51:17.687

1@Lembik I added a large test case. I'll wait for someone to confirm it to be sure. – xnor – 2016-10-22T09:58:19.953

@Lembik What size of the matrix do you expect? Is it allowed to mix programming languages? – Martin Rosenau – 2016-10-22T10:21:41.273

@MartinRosenau You can mix in any way you like. Please do help me run your code though. It's hard to guess how fast people's code will be but anything more than n=25 is impressive. – None – 2016-10-22T10:25:14.500

Is Mathematica code allowed? – JungHwan Min – 2016-10-22T15:42:00.140

Can we rely on the fact that In this question matrices are all square and will only have the values -1 and 1 in them. or should solutions be theoretically correct no matter what? – Socratic Phoenix – 2016-10-22T17:38:34.653

What is on the flags line on your systems /proc/cpuinfo ? – Ton Hospel – 2016-10-22T18:05:17.047

@JHM Yes although it's not clear what score I can give. Maybe you could compare your code to the python reference code ? – None – 2016-10-22T18:51:36.317

@SocraticPhoenix Ideally your code should be correct for any real number inputs but I won't enforce it. – None – 2016-10-22T18:52:48.997

@TonHospel I will have to tell in 24 hours sorry. I am away from my computer. – None – 2016-10-22T18:53:56.767

2One of the answer prints an approximate result, since it uses double precision floats to store the permanent. Is that allowed? – Dennis – 2016-10-22T22:39:46.757

When you say "I will test your code on random +-1 matrices," I assume you mean matrices in which the entries are independent and identically distributed, from a distribution where -1 and +1 have equal probability? It would be good to be explicit about that, since the structure of the random matrices could conceivably affect the speed of the computation. – Nathaniel – 2016-10-23T03:06:48.800

@Nathaniel Yes that is right. – None – 2016-10-23T06:22:44.663

Dennis Yes the answer outputted should be the exact integer not an approximation. – None – 2016-10-23T06:25:56.673

@SocraticPhoenix Do you know a way to use the fact that matrix entries are restricted to 1 and -1? I searched and thought a lot and didn't find one. (Well, I now use that the entries are not very big...) – Christian Sievers – 2016-10-26T00:42:27.830

1@ChristianSievers I thought I might be able to do some magic with signs, but it didn't pan out... – Socratic Phoenix – 2016-10-26T10:33:02.123

Answers

gcc C++ n ≈ 36 (57 seconds on my system)

Uses Glynn formula with a Gray code for updates if all column sums are even, otherwise uses Ryser's method. Threaded and vectorized. Optimized for AVX, so don't expect much on older processors. Don't bother with n>=35 for a matrix with only +1's even if your system is fast enough since the signed 128 bit accumulator will overflow. For random matrices you will probably not hit the overflow. For n>=37 the internal multipliers will start to overflow for an all 1/-1 matrix. So only use this program for n<=36.

Just give the matrix elements on STDIN separated by any kind of whitespace

permanent
1 2
3 4
^D

permanent.cpp:

/*
  Compile using something like:
    g++ -Wall -O3 -march=native -fstrict-aliasing -std=c++11 -pthread -s permanent.cpp -o permanent
*/

#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <cstdint>
#include <climits>
#include <array>
#include <vector>
#include <thread>
#include <future>
#include <ctgmath>
#include <immintrin.h>

using namespace std;

bool const DEBUG = false;
int const CACHE = 64;

using Index  = int_fast32_t;
Index glynn;
// Number of elements in our vectors
Index const POW   = 3;
Index const ELEMS = 1 << POW;
// Over how many floats we distribute each row
Index const WIDTH = 9;
// Number of bits in the fraction part of a floating point number
int const FLOAT_MANTISSA = 23;
// Type to use for the first add/multiply phase
using Sum  = float;
using SumN = __restrict__ Sum __attribute__((vector_size(ELEMS*sizeof(Sum))));
// Type to convert to between the first and second phase
using ProdN = __restrict__ int32_t __attribute__((vector_size(ELEMS*sizeof(int32_t))));
// Type to use for the third and last multiply phase.
// Also used for the final accumulator
using Value = __int128;
using UValue = unsigned __int128;

// Wrap Value so C++ doesn't really see it and we can put it in vectors etc.
// Needed since C++ doesn't fully support __int128
struct Number {
    Number& operator+=(Number const& right) {
        value += right.value;
        return *this;
    }
    // Output the value
    void print(ostream& os, bool dbl = false) const;
    friend ostream& operator<<(ostream& os, Number const& number) {
        number.print(os);
        return os;
    }

    Value value;
};

using ms = chrono::milliseconds;

auto nr_threads = thread::hardware_concurrency();
vector<Sum> input;

// Allocate cache aligned datastructures
template<typename T>
T* alloc(size_t n) {
    T* mem = static_cast<T*>(aligned_alloc(CACHE, sizeof(T) * n));
    if (mem == nullptr) throw(bad_alloc());
    return mem;
}

// Work assigned to thread k of nr_threads threads
Number permanent_part(Index n, Index k, SumN** more) {
    uint64_t loops = (UINT64_C(1) << n) / nr_threads;
    if (glynn) loops /= 2;
    Index l = loops < ELEMS ? loops : ELEMS;
    loops /= l;
    auto from = loops * k;
    auto to   = loops * (k+1);

    if (DEBUG) cout << "From=" << from << "\n";
    uint64_t old_gray = from ^ from/2;
    uint64_t bit = 1;
    bool bits = (to-from) & 1;

    Index nn = (n+WIDTH-1)/WIDTH;
    Index ww = nn * WIDTH;
    auto column = alloc<SumN>(ww);
    for (Index i=0; i<n; ++i)
        for (Index j=0; j<ELEMS; ++j) column[i][j] = 0;
    for (Index i=n; i<ww; ++i)
        for (Index j=0; j<ELEMS; ++j) column[i][j] = 1;
    Index b;
    if (glynn) {
        b = n > POW+1 ? n - POW - 1: 0;
        auto c = n-1-b;
        for (Index k=0; k<l; k++) {
            Index gray = k ^ k/2;
            for (Index j=0; j< c; ++j)
                if (gray & 1 << j)
                    for (Index i=0; i<n; ++i)
                        column[i][k] -= input[(b+j)*n+i];
                else
                    for (Index i=0; i<n; ++i)
                        column[i][k] += input[(b+j)*n+i];
        }
        for (Index i=0; i<n; ++i)
            for (Index k=0; k<l; k++)
                column[i][k] += input[n*(n-1)+i];

        for (Index k=1; k<l; k+=2)
            column[0][k] = -column[0][k];

        for (Index i=0; i<b; ++i, bit <<= 1) {
            if (old_gray & bit) {
                bits = bits ^ 1;
                for (Index j=0; j<ww; ++j)
                    column[j] -= more[i][j];
            } else {
                for (Index j=0; j<ww; ++j)
                    column[j] += more[i][j];
            }
        }

        for (Index i=0; i<n; ++i)
            for (Index k=0; k<l; k++)
                column[i][k] /= 2;
    } else {
        b = n > POW ? n - POW : 0;
        auto c = n-b;
        for (Index k=0; k<l; k++) {
            Index gray = k ^ k/2;
            for (Index j=0; j<c; ++j)
                if (gray & 1 << j)
                    for (Index i=0; i<n; ++i)
                        column[i][k] -= input[(b+j)*n+i];
        }

        for (Index k=1; k<l; k+=2)
            column[0][k] = -column[0][k];

        for (Index i=0; i<b; ++i, bit <<= 1) {
            if (old_gray & bit) {
                bits = bits ^ 1;
                for (Index j=0; j<ww; ++j)
                    column[j] -= more[i][j];
            }
        }
    }

    if (DEBUG) {
        for (Index i=0; i<ww; ++i) {
            cout << "Column[" << i << "]=";
            for (Index j=0; j<ELEMS; ++j) cout << " " << column[i][j];
            cout << "\n";
        }
    }

    --more;
    old_gray = (from ^ from/2) | UINT64_C(1) << b;
    Value total = 0;
    SumN accu[WIDTH];
    for (auto p=from; p<to; ++p) {
        uint64_t new_gray = p ^ p/2;
        uint64_t bit = old_gray ^ new_gray;
        Index i = __builtin_ffsl(bit);
        auto diff = more[i];
        auto c = column;
        if (new_gray > old_gray) {
            // Phase 1 add/multiply.
            // Uses floats until just before loss of precision
            for (Index i=0; i<WIDTH; ++i) accu[i] = *c++ -= *diff++;

            for (Index j=1; j < nn; ++j)
                for (Index i=0; i<WIDTH; ++i) accu[i] *= *c++ -= *diff++;
        } else {
            // Phase 1 add/multiply.
            // Uses floats until just before loss of precision
            for (Index i=0; i<WIDTH; ++i) accu[i] = *c++ += *diff++;

            for (Index j=1; j < nn; ++j)
                for (Index i=0; i<WIDTH; ++i) accu[i] *= *c++ += *diff++;
        }

        if (DEBUG) {
            cout << "p=" << p << "\n";
            for (Index i=0; i<ww; ++i) {
                cout << "Column[" << i << "]=";
                for (Index j=0; j<ELEMS; ++j) cout << " " << column[i][j];
                cout << "\n";
            }
        }

        // Convert floats to int32_t
        ProdN prod32[WIDTH] __attribute__((aligned (32)));
        for (Index i=0; i<WIDTH; ++i)
            // Unfortunately gcc doesn't recognize the static_cast<int32_t>
            // as a vector pattern, so force it with an intrinsic
#ifdef __AVX__
            //prod32[i] = static_cast<ProdN>(accu[i]);
            reinterpret_cast<__m256i&>(prod32[i]) = _mm256_cvttps_epi32(accu[i]);
#else   // __AVX__
            for (Index j=0; j<ELEMS; ++j)
                prod32[i][j] = static_cast<int32_t>(accu[i][j]);
#endif  // __AVX__

        // Phase 2 multiply. Uses int64_t until just before overflow
        int64_t prod64[3][ELEMS];
        for (Index i=0; i<3; ++i) {
            for (Index j=0; j<ELEMS; ++j)
                prod64[i][j] = static_cast<int64_t>(prod32[i][j]) * prod32[i+3][j] * prod32[i+6][j];
        }
        // Phase 3 multiply. Collect into __int128. For large matrices this will
        // actually overflow but that's ok as long as all 128 low bits are
        // correct. Terms will cancel and the final sum can fit into 128 bits
        // (This will start to fail at n=35 for the all 1 matrix)
        // Strictly speaking this needs the -fwrapv gcc option
        for (Index j=0; j<ELEMS; ++j) {
            auto value = static_cast<Value>(prod64[0][j]) * prod64[1][j] * prod64[2][j];
            if (DEBUG) cout << "value[" << j << "]=" << static_cast<double>(value) << "\n";
            total += value;
        }
        total = -total;

        old_gray = new_gray;
    }

    return bits ? Number{-total} : Number{total};
}

// Prepare datastructures, Assign work to threads
Number permanent(Index n) {
    Index nn = (n+WIDTH-1)/WIDTH;
    Index ww = nn*WIDTH;

    Index rows  = n > (POW+glynn) ? n-POW-glynn : 0;
    auto data = alloc<SumN>(ww*(rows+1));
    auto pointers = alloc<SumN *>(rows+1);
    auto more = &pointers[0];
    for (Index i=0; i<rows; ++i)
        more[i] = &data[ww*i];
    more[rows] = &data[ww*rows];
    for (Index j=0; j<ww; ++j)
        for (Index i=0; i<ELEMS; ++i)
            more[rows][j][i] = 0;

    Index loops = n >= POW+glynn ? ELEMS : 1 << (n-glynn);
    auto a = &input[0];
    for (Index r=0; r<rows; ++r) {
        for (Index j=0; j<n; ++j) {
            for (Index i=0; i<loops; ++i)
                more[r][j][i] = j == 0 && i %2 ? -*a : *a;
            for (Index i=loops; i<ELEMS; ++i)
                more[r][j][i] = 0;
            ++a;
        }
        for (Index j=n; j<ww; ++j)
            for (Index i=0; i<ELEMS; ++i)
                more[r][j][i] = 0;
    }

    if (DEBUG)
        for (Index r=0; r<=rows; ++r)
            for (Index j=0; j<ww; ++j) {
                cout << "more[" << r << "][" << j << "]=";
                for (Index i=0; i<ELEMS; ++i)
                    cout << " " << more[r][j][i];
                cout << "\n";
            }

    // Send work to threads...
    vector<future<Number>> results;
    for (auto i=1U; i < nr_threads; ++i)
        results.emplace_back(async(DEBUG ? launch::deferred: launch::async, permanent_part, n, i, more));
    // And collect results
    auto r = permanent_part(n, 0, more);
    for (auto& result: results)
        r += result.get();

    free(data);
    free(pointers);

    // For glynn we should double the result, but we will only do this during
    // the final print. This allows n=34 for an all 1 matrix to work
    // if (glynn) r *= 2;
    return r;
}

// Print 128 bit number
void Number::print(ostream& os, bool dbl) const {
    const UValue BILLION = 1000000000;

    UValue val;
    if (value < 0) {
        os << "-";
        val = -value;
    } else
        val = value;
    if (dbl) val *= 2;

    uint32_t output[5];
    for (int i=0; i<5; ++i) {
        output[i] = val % BILLION;
        val /= BILLION;
    }
    bool print = false;
    for (int i=4; i>=0; --i) {
        if (print) {
            os << setfill('0') << setw(9) << output[i];
        } else if (output[i] || i == 0) {
            print = true;
            os << output[i];
        }
    }
}

// Read matrix, check for sanity
void my_main() {
    Sum a;
    while (cin >> a)
        input.push_back(a);

    size_t n = sqrt(input.size());
    if (input.size() != n*n)
        throw(logic_error("Read " + to_string(input.size()) +
                          " elements which does not make a square matrix"));

    vector<double> columns_pos(n, 0);
    vector<double> columns_neg(n, 0);
    Sum *p = &input[0];
    for (size_t i=0; i<n; ++i)
        for (size_t j=0; j<n; ++j, ++p) {
            if (*p >= 0) columns_pos[j] += *p;
            else         columns_neg[j] -= *p;
        }
    std::array<double,WIDTH> prod;
    prod.fill(1);

    int32_t odd = 0;
    for (size_t j=0; j<n; ++j) {
        prod[j%WIDTH] *= max(columns_pos[j], columns_neg[j]);
        auto sum = static_cast<int32_t>(columns_pos[j] - columns_neg[j]);
        odd |= sum;
    }
    glynn = (odd & 1) ^ 1;
    for (Index i=0; i<WIDTH; ++i)
        // A float has an implicit 1. in front of the fraction so it can
        // represent 1 bit more than the mantissa size. And 1 << (mantissa+1)
        // itself is in fact representable
        if (prod[i] && log2(prod[i]) > FLOAT_MANTISSA+1)
            throw(range_error("Values in matrix are too large. A subproduct reaches " + to_string(prod[i]) + " which doesn't fit in a float without loss of precision"));

    for (Index i=0; i<3; ++i) {
        auto prod3 = prod[i] * prod[i+3] * prod[i+6];
        if (log2(prod3) >= CHAR_BIT*sizeof(int64_t)-1)
            throw(range_error("Values in matrix are too large. A subproduct reaches " + to_string(prod3) + " which doesn't fit in an int64"));
    }

    nr_threads = pow(2, ceil(log2(static_cast<float>(nr_threads))));
    uint64_t loops = UINT64_C(1) << n;
    if (glynn) loops /= 2;
    if (nr_threads * ELEMS > loops)
        nr_threads = max(loops / ELEMS, UINT64_C(1));
    // if (DEBUG) nr_threads = 1;

    cout << n << " x " << n << " matrix, method " << (glynn ? "Glynn" : "Ryser") << ", " << nr_threads << " threads" << endl;

    // Go for the actual calculation
    auto start = chrono::steady_clock::now();
    auto perm = permanent(n);
    auto end = chrono::steady_clock::now();
    auto elapsed = chrono::duration_cast<ms>(end-start).count();

    cout << "Permanent=";
    perm.print(cout, glynn);
    cout << " (" << elapsed / 1000. << " s)" << endl;
}

// Wrapper to print any exceptions
int main() {
    try {
        my_main();
    } catch(exception& e) {
        cerr << "Error: " << e.what() << endl;
        exit(EXIT_FAILURE);
    }
    exit(EXIT_SUCCESS);
}

Ton Hospel

Posted 2016-10-22T08:30:36.173

Reputation: 14 114

flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold – None – 2016-10-23T19:43:15.357

I am still debugging my test harness to run your code but it looks very fast, thank you! I was wondering if the larger int size might be causing a speed problem (as you suggested). I saw https://accu.org/index.php/articles/1849 in case its of any interest.

– None – 2016-10-23T19:45:41.440

I had to modify your code to remove the quick_exit as those made it very hard to use in a test harness. Out of interest, why are you using Ryser's formula when the wiki seems to claim the other one should be twice as fast? – None – 2016-10-24T12:59:38.733

@Lembik I switched to Ryser's formula since with the other I need to scale back by 2 << (n-1) at the end which means my int128 accumulator overflowed far before that point. – Ton Hospel – 2016-10-24T13:35:56.490

@TonHospel Oh I see. Do you think there is any further room for improvement now? – None – 2016-10-24T13:41:17.417

1@Lembik Yes :-) – Ton Hospel – 2016-10-25T01:45:11.543

C99, n ≈ 33 (35 seconds)

#include <stdint.h>
#include <stdio.h>

#define CHUNK_SIZE 12
#define NUM_THREADS 8

#define popcnt __builtin_popcountll
#define BILLION (1000 * 1000 * 1000)
#define UPDATE_ROW_PPROD() \
    update_row_pprod(row_pprod, row, rows, row_sums, mask, mask_popcnt)

typedef __int128 int128_t;

static inline int64_t update_row_pprod
(
    int64_t* row_pprod, int64_t row, int64_t* rows,
    int64_t* row_sums, int64_t mask, int64_t mask_popcnt
)
{
    int64_t temp = 2 * popcnt(rows[row] & mask) - mask_popcnt;

    row_pprod[0] *= temp;
    temp -= 1;
    row_pprod[1] *= temp;
    temp -= row_sums[row];
    row_pprod[2] *= temp;
    temp += 1;
    row_pprod[3] *= temp;

    return row + 1;
}

int main(int argc, char* argv[])
{
    int64_t size = argc - 1, rows[argc - 1];
    int64_t row_sums[argc - 1];
    int128_t permanent = 0, sign = size & 1 ? -1 : 1;

    if (argc == 2)
    {
        printf("%d\n", argv[1][0] == '-' ? -1 : 1);
        return 0;
    }

    for (int64_t row = 0; row < size; row++)
    {
        char positive = argv[row + 1][0] == '+' ? '-' : '+';

        sign *= ',' - positive;
        rows[row] = row_sums[row] = 0;

        for (char* p = &argv[row + 1][1]; *p; p++)
        {
            rows[row] <<= 1;
            rows[row] |= *p == positive;
            row_sums[row] += *p == positive;
        }

        row_sums[row] = 2 * row_sums[row] - size;
    }

    #pragma omp parallel for reduction(+:permanent) num_threads(NUM_THREADS)
    for (int64_t mask = 1; mask < 1LL << (size - 1); mask += 2)
    {
        int64_t mask_popcnt = popcnt(mask);
        int64_t row = 0;
        int128_t row_prod = 1 - 2 * (mask_popcnt & 1);
        int128_t row_prod_high = -row_prod;
        int128_t row_prod_inv = row_prod;
        int128_t row_prod_inv_high = -row_prod;

        for (int64_t chunk = 0; chunk < size / CHUNK_SIZE; chunk++)
        {
            int64_t row_pprod[4] = {1, 1, 1, 1};

            for (int64_t i = 0; i < CHUNK_SIZE; i++)
                row = UPDATE_ROW_PPROD();

            row_prod *= row_pprod[0], row_prod_high *= row_pprod[1];
            row_prod_inv *= row_pprod[3], row_prod_inv_high *= row_pprod[2];
        }

        int64_t row_pprod[4] = {1, 1, 1, 1};

        while (row < size)
            row = UPDATE_ROW_PPROD();

        row_prod *= row_pprod[0], row_prod_high *= row_pprod[1];
        row_prod_inv *= row_pprod[3], row_prod_inv_high *= row_pprod[2];
        permanent += row_prod + row_prod_high + row_prod_inv + row_prod_inv_high;
    }

    permanent *= sign;

    if (permanent < 0)
        printf("-"), permanent *= -1;

    int32_t output[5], print = 0;

    output[0] = permanent % BILLION, permanent /= BILLION;
    output[1] = permanent % BILLION, permanent /= BILLION;
    output[2] = permanent % BILLION, permanent /= BILLION;
    output[3] = permanent % BILLION, permanent /= BILLION;
    output[4] = permanent % BILLION;

    if (output[4])
        printf("%u", output[4]), print = 1;
    if (print)
        printf("%09u", output[3]);
    else if (output[3])
        printf("%u", output[3]), print = 1;
    if (print)
        printf("%09u", output[2]);
    else if (output[2])
        printf("%u", output[2]), print = 1;
    if (print)
        printf("%09u", output[1]);
    else if (output[1])
        printf("%u", output[1]), print = 1;
    if (print)
        printf("%09u\n", output[0]);
    else
        printf("%u\n", output[0]);
}

Input is currently a bit cumbersome; it is taken with rows as command line arguments, where each entry is represented by its sign, i.e., + indicates a 1 and - indicates a -1.

Test run

$ gcc -Wall -std=c99 -march=native -Ofast -fopenmp -fwrapv -o permanent permanent.c
$ ./permanent +--+ ---+ -+-+ +--+
-4
$ ./permanent ---- -+-- +--- +-+-
0
$ ./permanent +-+----- --++-++- +----+++ ---+-+++ +--++++- -+-+-++- +-+-+-+- --+-++++
192
$ ./permanent +-+--+++----++++-++- +-+++++-+--+++--+++- --+++----+-+++---+-- ---++-++++++------+- -+++-+++---+-+-+++++ +-++--+-++++-++-+--- +--+---+-++++---+++- +--+-++-+++-+-+++-++ +-----+++-----++-++- --+-+-++-+-++++++-++ -------+----++++---- ++---++--+-++-++++++ -++-----++++-----+-+ ++---+-+----+-++-+-+ +++++---+++-+-+++-++ +--+----+--++-+----- -+++-++--+++--++--++ ++--++-++-+++-++-+-+ +++---+--++---+----+ -+++-------++-++-+--
1021509632
$ time ./permanent +++++++++++++++++++++++++++++++{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,}     # 31
8222838654177922817725562880000000

real    0m8.365s
user    1m6.504s
sys     0m0.000s
$ time ./permanent ++++++++++++++++++++++++++++++++{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,}   # 32
263130836933693530167218012160000000

real    0m17.013s
user    2m15.226s
sys     0m0.001s
$ time ./permanent +++++++++++++++++++++++++++++++++{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} # 33
8683317618811886495518194401280000000

real    0m34.592s
user    4m35.354s
sys     0m0.001s

Dennis

Posted 2016-10-22T08:30:36.173

Reputation: 196 637

Do you have any ideas for improvements? – xnor – 2016-10-24T01:57:49.220

@xnor A few. I want to try packed multiplication with SSE and partially unrolling the big loop (to see if I can speed up parallelization and to compute more than 4 values at once without calling popcnt). If that saves any time, the next big hurdle is the integer type. For randomly generated matrices, the permanent is comparatively small. If I can find an easy way to compute a bound before doing the actual calculation, I might wrap the whole thing in a big conditional. – Dennis – 2016-10-24T02:16:29.153

@Dennis About unrolling the loop, a small possible optimization is to make the top row be all +1's. – xnor – 2016-10-24T02:23:28.547

@xnor Yeah, I tried that at some point, but then reverted the change to try something else (which didn't work out at all). The bottleneck seems to be the integer multiplication (which is slow for 64 bits and really slow for 128), which is why I hope SSE will help a bit. – Dennis – 2016-10-24T02:37:37.177

@Dennis I see. About bounds, one non-obvious bound is in terms of the operator norm |Per(M)|<=|M|^n. See https://arxiv.org/pdf/1606.07474v1.pdf

– xnor – 2016-10-24T02:40:54.270

Let us continue this discussion in chat.

– Dennis – 2016-10-24T03:17:21.617

You know, I really like this input format. Do you mind if I use it in my answer? – R. Kap – 2016-10-31T05:31:34.873

@R.Kap Not at all. Go ahead. – Dennis – 2016-10-31T05:32:45.150

Python 2, n ≈ 28

from operator import mul

def fast_glynn_perm(M):
    row_comb = [sum(c) for c in zip(*M)]
    n=len(M)

    total = 0
    old_grey = 0 
    sign = +1

    binary_power_dict = {2**i:i for i in range(n)}
    num_loops = 2**(n-1)

    for bin_index in xrange(1, num_loops + 1):  
        total += sign * reduce(mul, row_comb)

        new_grey = bin_index^(bin_index/2)
        grey_diff = old_grey ^ new_grey
        grey_diff_index = binary_power_dict[grey_diff]

        new_vector = M[grey_diff_index]
        direction = 2 * cmp(old_grey,new_grey)      

        for i in range(n):
            row_comb[i] += new_vector[i] * direction

        sign = -sign
        old_grey = new_grey

    return total/num_loops

Uses the Glynn formula with a Gray code for updates. Runs up to n=23 in a minute on my machine. One can surely do better implementing this in a faster language and with better data structures. This doesn't use that the matrix is ±1-valued.

A Ryser formula implementation is very similar, summing over all 0/1 vectors of coefficients rather than ±1-vectors. It takes about twice as long as Glynn's formula because adds over all 2^n such vectors, whereas Glynn's halves that using symmetry to only those starting with +1.

from operator import mul

def fast_ryser_perm(M):
    n=len(M)
    row_comb = [0] * n

    total = 0
    old_grey = 0 
    sign = +1

    binary_power_dict = {2**i:i for i in range(n)}
    num_loops = 2**n

    for bin_index in range(1, num_loops) + [0]: 
        total += sign * reduce(mul, row_comb)

        new_grey = bin_index^(bin_index/2)
        grey_diff = old_grey ^ new_grey
        grey_diff_index = binary_power_dict[grey_diff]

        new_vector = M[grey_diff_index]
        direction = cmp(old_grey, new_grey)

        for i in range(n):
            row_comb[i] += new_vector[i] * direction

        sign = -sign
        old_grey = new_grey

    return total * (-1)**n

xnor

Posted 2016-10-22T08:30:36.173

Reputation: 115 687

Awesome. Have you got pypy to test too? – None – 2016-10-22T10:30:01.333

@Lembik No, I don't have much installed. – xnor – 2016-10-22T10:30:22.477

I will use pypy when I test it too. Can you see how to implement the other fast formula? I find it confusing. – None – 2016-10-22T10:33:11.200

@Lembik What's the other fast formula? – xnor – 2016-10-22T10:33:49.710

The Balasubramanian-Bax/Franklin-Glynn formula . – None – 2016-10-22T10:39:11.193

1As reference, on my machine with pypy this was able to easily calculate n=28 in 44.6 seconds. Lembik's system seems to be fairly comparable to mine in speed if not a bit faster. – Kade – 2016-10-22T14:15:33.207

what does it mean p ^ p/2 the xor of p and p/2? – RosLuP – 2016-10-23T06:22:35.893

@RosLuP Yes, that's the formula for the p'th element of the Grey code. – xnor – 2016-10-23T06:23:31.330

If p=(2^n) -1 => p^(p/2)=2^(n-1) – RosLuP – 2016-10-23T06:53:00.443

For pypy the top tip is to remove all the builtin functions. For example, zip and sum and replace by pure python code. – None – 2016-10-25T12:01:12.600

Haskell, n=31 (54s)

With a lot of invaluable contributions by @Angs: use Vector, use short circuit products, look at odd n.

import Control.Parallel.Strategies
import qualified Data.Vector.Unboxed as V
import Data.Int

type Row = V.Vector Int8

x :: Row -> [Row] -> Integer -> Int -> Integer
x p (v:vs) m c = let c' = c - 1
                     r = if c>0 then parTuple2 rseq rseq else r0
                     (a,b) = ( x p                  vs m    c' ,
                               x (V.zipWith(-) p v) vs (-m) c' )
                             `using` r
                 in a+b
x p _      m _ = prod m p

prod :: Integer -> Row -> Integer
prod m p = if 0 `V.elem` p then 0 
                           else V.foldl' (\a b->a*fromIntegral b) m p

p, pt :: [Row] -> Integer
p (v:vs) = x (foldl (V.zipWith (+)) v vs) (map (V.map (2*)) vs) 1 11
           `div` 2^(length vs)
p [] = 1 -- handle 0x0 matrices too  :-)

pt (v:vs) | even (length vs) = p ((V.map (2*) v) : vs ) `div` 2
pt mat                       = p mat

main = getContents >>= print . pt . map V.fromList . read

My first attempts at parallelism in Haskell. You can see a lot of optimization steps through the revision history. Amazingly, it were mostly very small changes. The code is based on the formula in the section "Balasubramanian-Bax/Franklin-Glynn formula" in the Wikipedia article on computing the permanent.

p computes the permanent. It is called via pt which transforms the matrix in a way that is always valid, but especially useful for the matrices that we get here.

Compile with ghc -O2 -threaded -fllvm -feager-blackholing -o <name> <name>.hs. To run with parallelisation, give it runtime parameters like this: ./<name> +RTS -N. Input is from stdin with nested comma separated lists in brackets like [[1,2],[3,4]] as in the last example (newlines allowed everywhere).

Christian Sievers

Posted 2016-10-22T08:30:36.173

Reputation: 6 366

Thank you. Some comments. line 3 is repeated in the source. Also it doesn't seem to actually run in parallel. At least "top" doesn't show more than 1 core working. – None – 2016-10-23T19:07:13.220

Oops, that's the result of partially changing the code. Did you use the specified runtime parameters? – Christian Sievers – 2016-10-23T19:20:08.023

My mistake! Makes it to 26 now. – None – 2016-10-23T19:26:15.493

Oh, I hoped for more with 8 cores... – Christian Sievers – 2016-10-23T19:28:45.543

This is my testing code https://bpaste.net/show/e74c87dfd19d in case you want to try it.

– None – 2016-10-23T19:30:40.303

1I was able to get a speed improvement of 20-25% by plugging in Data.Vector. The changes excluding changed function types: import qualified Data.Vector as V, x (V.zipWith(-) p v) vs (-m) c' ), p (v:vs) = x (foldl (V.zipWith (+)) v vs) (map (V.map (2*)) vs) 1 11, main = getContents >>= print . p . map V.fromList . read – Angs – 2016-10-25T15:19:15.620

1@Angs Thanks a lot! I didn't really feel like looking into better suited datatypes. It's amazing how little things have to change (also had to use V.product). That only gave me ~10%. Changed the code so that the vectors only contain Ints. That's okay because they are only added, the big numbers come from multiplication. Then it was ~20%. I had tried the same change with the old code, but at that time it slowed it down. I tried again because it allows to use unboxed vectors, which helped a lot! – Christian Sievers – 2016-10-26T00:14:37.930

1@christian-sievers glab I could be of help. Here's another fun luck-based optimization I found: x p _ m _ = m * (sum $ V.foldM' (\a b -> if b==0 then Nothing else Just $ a*fromIntegral b) 1 p) - product as a monadic fold where 0 is a special case. Seems to be beneficial more often than not. – Angs – 2016-10-26T11:35:56.947

1@Angs Great! I changed that into a form that doesn't need Transversable (I see your not changing product eatlier was no mistake...) for ghc from e.g. Debian stable. It's using the form of the input, but that seems fine: we're not relying on it, only optimizing for it. Makes timing much more exciting: my random 30x30 matrix is slightly faster than 29x29, but then 31x31 take 4x time. - That INLINE doesn't seem to work for me. AFAIK it's ignored for recursive functions. – Christian Sievers – 2016-10-26T14:31:18.430

1@christian-sievers Yeah, I was about to say something about that product but forgot. It seems only even lengths have zeros in p, so for odd length we should use the regular product instead of the short circuiting to get the best of both worlds. – Angs – 2016-10-26T14:53:42.800

@Angs But that's only true for this special kind of input. And the overhead isn't that bad. – Christian Sievers – 2016-10-26T15:26:49.557

Much faster indeed! – None – 2016-10-26T18:46:53.597

1@Angs Thanks for your observation that there are no zeros in the odd n case. It's obvious in hindsight, but I may not have noticed it. (It also explains that my results with 29x29, 30x30 and 31x31 matrices were nothing special.) Now that you made me think about it, I found a way to force zeros into the odd case. – Christian Sievers – 2016-10-27T00:10:32.600

Faster again! Out of interest, how hard would it be for your code to support floating point inputs? – None – 2016-10-27T07:02:09.207

1@lembik I think I can answer that for you: just replace all Int8 and Integer with Double, remove the fromIntegral and change _`div`_ to /. Was n=30 close? :) – Angs – 2016-10-27T08:00:54.433

@Angs Thank you.There is a great deal of non-determinism in the run times so I ran it a few more times and eventually got to 57 seconds for n - 30 :) – None – 2016-10-27T08:38:48.020

1@lembik that's weird. I was wondering because on my system, n+1 is always faster than n when n is odd. You can also change m * prod p to m * V.product p because with floating point inputs the optimization won't work. – Angs – 2016-10-27T08:59:27.233

1@Lembik Yes, all the things @Angs said, and you can undo the optimization for the odd case, which costs almost nothing, but is completely useless after undoing the also useless product optimization: in main replace pt with p, and remove the definition of pt. (And ofprod. And the last two import lines.) @Angs, do you have that optimization? With it I'm back to smoothly increasing running times. – Christian Sievers – 2016-10-27T10:44:20.797

@ChristianSievers I thought I had but I had rolled my own main and forgot to change p to pt. – Angs – 2016-10-27T14:10:13.620

By the way, is the 11 experimental or how did you end up with it? – Angs – 2016-10-27T15:37:49.493

1@Angs Yes, completely experimental and totally not understood by me. The number is higher than expected, and statistics (with run time option -s) still sometimes show GC'ed sparks (why does that happen at all?), but lowering the number such that most sparks are converted slightly increases execution time. – Christian Sievers – 2016-10-27T16:05:55.047

1@Lembik For the new version, please note the added compiler options – Christian Sievers – 2016-10-29T13:03:19.653

Rust + extprim

This straightforward Ryser with Gray code implementation takes about 65 90 seconds to run n=31 on my laptop. ~~I imagine your machine will get there in well under 60s.~~ I'm using extprim 1.1.1 for i128.

I have never used Rust and have no idea what I'm doing. No compiler options other than whatever cargo build --release does. Comments/suggestions/optimizations are appreciated.

Invocation is identical to Dennis' program.

use std::env;
use std::thread;
use std::sync::Arc;
use std::sync::mpsc;

extern crate extprim;
use extprim::i128::i128;

static THREADS : i64 = 8; // keep this a power of 2.

fn main() {
  // Read command line args for the matrix, specified like
  // "++- --- -+-" for [[1, 1, -1], [-1, -1, -1], [-1, 1, -1]].
  let mut args = env::args();
  args.next();

  let mat : Arc<Vec<Vec<i64>>> = Arc::new(args.map( |ss|
    ss.trim().bytes().map( |cc| if cc == b'+' {1} else {-1}).collect()
  ).collect());

  // Figure how many iterations each thread has to do.
  let size = 2i64.pow(mat.len() as u32);
  let slice_size = size / THREADS; // Assumes divisibility.

  let mut accumulator : i128;
  if slice_size >= 4 { // permanent() requires 4 divides slice_size
    let (tx, rx) = mpsc::channel();

    // Launch threads.
    for ii in 0..THREADS {
      let mat = mat.clone();
      let tx = tx.clone();
      thread::spawn(move ||
        tx.send(permanent(&mat, ii * slice_size, (ii+1) * slice_size))
      );
    }

    // Accumulate results.
    accumulator = extprim::i128::ZERO;
    for _ in 0..THREADS {
      accumulator += rx.recv().unwrap();
    }
  }
  else { // Small matrix, don't bother threading.
    accumulator = permanent(&mat, 0, size);
  }
  println!("{}", accumulator);
}

fn permanent(mat: &Vec<Vec<i64>>, start: i64, end: i64) -> i128 {
  let size = mat.len();
  let sentinel = std::i64::MAX / size as i64;

  let mut bits : Vec<bool> = Vec::with_capacity(size);
  let mut sums : Vec<i64> = Vec::with_capacity(size);

  // Initialize gray code bits.
  let gray_number = start ^ (start / 2);

  for row in 0..size {
    bits.push((gray_number >> row) % 2 == 1);
    sums.push(0);
  }

  // Initialize column sums
  for row in 0..size {
    if bits[row] {
      for column in 0..size {
        sums[column] += mat[row][column];
      }
    }
  }

  // Do first two iterations with initial sums
  let mut total = product(&sums, sentinel);
  for column in 0..size {
    sums[column] += mat[0][column];
  }
  bits[0] = true;

  total -= product(&sums, sentinel);

  // Do rest of iterations updating gray code bits incrementally
  let mut gray_bit : usize;
  let mut idx = start + 2;
  while idx < end {
    gray_bit = idx.trailing_zeros() as usize;

    if bits[gray_bit] {
      for column in 0..size {
        sums[column] -= mat[gray_bit][column];
      }
      bits[gray_bit] = false;
    }
    else {
      for column in 0..size {
        sums[column] += mat[gray_bit][column];
      }
      bits[gray_bit] = true;
    }

    total += product(&sums, sentinel);

    if bits[0] {
      for column in 0..size {
        sums[column] -= mat[0][column];
      }
      bits[0] = false;
    }
    else {
      for column in 0..size {
        sums[column] += mat[0][column];
      }
      bits[0] = true;
    }

    total -= product(&sums, sentinel);
    idx += 2;
  }
  return if size % 2 == 0 {total} else {-total};
}

#[inline]
fn product(sums : &Vec<i64>, sentinel : i64) -> i128 {
  let mut ret : Option<i128> = None;
  let mut tally = sums[0];
  for ii in 1..sums.len() {
    if tally.abs() >= sentinel {
      ret = Some(ret.map_or(i128::new(tally), |n| n * i128::new(tally)));
      tally = sums[ii];
    }
    else {
      tally *= sums[ii];
    }
  }
  if ret.is_none() {
    return i128::new(tally);
  }
  return ret.unwrap() * i128::new(tally);
}

ezrast

Posted 2016-10-22T08:30:36.173

Reputation: 491

Could you give copy and pasteable command lines for installing extprim and compiling the code please. – None – 2016-10-27T11:30:01.153

The output looks like "i128!(-2)" where -2 is the correct answer. Is this expected and could you change it to just to output the permanent please? – None – 2016-10-27T11:58:51.457

1@Lembik: Output should be fixed now. Looks like you figured out compilation, but I threw it in Git so you can do git clone https://gitlab.com/ezrast/permanent.git; cd permanent; cargo build --release if you want to be sure of having the same setup as me. Cargo will handle dependencies. Binary goes in target/release. – ezrast – 2016-10-27T21:28:29.337

Unfortunately this gives the wrong answer for n = 29. https://bpaste.net/show/99d6e826d968

– None – 2016-10-31T13:08:02.060

1@Lembik gah, sorry, intermediate values were overflowing earlier than I thought. It's fixed, though the program is a lot slower now. – ezrast – 2016-11-02T01:29:50.740

Python 2, n = 22 [Reference]

This is the 'reference' implementation I shared with Lembik yesterday, it misses making it to n=23 by a few seconds on his machine, on my machine it does it in about 52 seconds. To achieve these speeds you need to run this through PyPy.

The first function calculates the permanent similar to how the determinant could be calculated, by going over each submatrix until you are left with a 2x2 which you can apply the basic rule to. It is incredibly slow.

The second function is the one implementing the Ryser function (the second equation listed in Wikipedia). The set S is essentially the powerset of the numbers {1,...,n} (variable s_list in the code).

from random import *
from time import time
from itertools import*

def perm(a): # naive method, recurses over submatrices, slow 
    if len(a) == 1:
        return a[0][0]
    elif len(a) == 2:
        return a[0][0]*a[1][1]+a[1][0]*a[0][1]
    else:
        tsum = 0
        for i in xrange(len(a)):
            transposed = [zip(*a)[j] for j in xrange(len(a)) if j != i]
            tsum += a[0][i] * perm(zip(*transposed)[1:])
        return tsum

def perm_ryser(a): # Ryser's formula, using matrix entries
    maxn = len(a)
    n_list = range(1,maxn+1)
    s_list = chain.from_iterable(combinations(n_list,i) for i in range(maxn+1))
    total = 0
    for st in s_list:
        stotal = (-1)**len(st)
        for i in xrange(maxn):
            stotal *= sum(a[i][j-1] for j in st)
        total += stotal
    return total*((-1)**maxn)


def genmatrix(d):
    mat = []
    for x in xrange(d):
        row = []
        for y in xrange(d):
            row.append([-1,1][randrange(0,2)])
        mat.append(row)
    return mat

def main():
    for i in xrange(1,24):
        k = genmatrix(i)
        print 'Matrix: (%dx%d)'%(i,i)
        print '\n'.join('['+', '.join(`j`.rjust(2) for j in a)+']' for a in k)
        print 'Permanent:',
        t = time()
        p = perm_ryser(k)
        print p,'(took',time()-t,'seconds)'

if __name__ == '__main__':
    main()

Kade

Posted 2016-10-22T08:30:36.173

Reputation: 7 463

I think you should rephrase the description "similar to how the determinant would be calculated". It's not like the method for determinants is slow for permanents, but one slow method for determinants works similarly (and as slowly) for permanents. – Christian Sievers – 2016-10-24T11:45:52.640

1@ChristianSievers Good point, I've altered it. – Kade – 2016-10-24T12:28:25.667

Mathematica, n ≈ 20

p[m_] := Last[Fold[Take[ListConvolve[##, {1, -1}, 0], 2^Length[m]]&,
  Table[If[IntegerQ[Log2[k]], m[[j, Log2[k] + 1]], 0], {j, n}, {k, 0, 2^Length[m] - 1}]]]

Using the Timing command, a 20x20 matrix requires about 48 seconds on my system. This is not exactly as efficient as the other since it relies on the fact that the permanent can be found as the coefficient of the product of polymomials from each row of the matrix. Efficient polynomial multiplication is performed by creating the coefficient lists and performing convolution using ListConvolve. This requires about O(2ⁿ n²) time assuming convolution is performed using a Fast Fourier transform or similar which requires O(n log n) time.

miles

Posted 2016-10-22T08:30:36.173

Reputation: 15 654

RPython 5.4.1, n ≈ 32 (37 seconds)

from rpython.rlib.rtime import time
from rpython.rlib.rarithmetic import r_int, r_uint
from rpython.rlib.rrandom import Random
from rpython.rlib.rposix import pipe, close, read, write, fork, waitpid
from rpython.rlib.rbigint import rbigint

from math import log, ceil
from struct import pack

bitsize = len(pack('l', 1)) * 8 - 1

bitcounts = bytearray([0])
for i in range(16):
  b = bytearray([j+1 for j in bitcounts])
  bitcounts += b


def bitcount(n):
  bits = 0
  while n:
    bits += bitcounts[n & 65535]
    n >>= 16
  return bits


def main(argv):
  if len(argv) < 2:
    write(2, 'Usage: %s NUM_THREADS [N]'%argv[0])
    return 1
  threads = int(argv[1])

  if len(argv) > 2:
    n = int(argv[2])
    rnd = Random(r_uint(time()*1000))
    m = []
    for i in range(n):
      row = []
      for j in range(n):
        row.append(1 - r_int(rnd.genrand32() & 2))
      m.append(row)
  else:
    m = []
    strm = ""
    while True:
      buf = read(0, 4096)
      if len(buf) == 0:
        break
      strm += buf
    rows = strm.split("\n")
    for row in rows:
      r = []
      for val in row.split(' '):
        r.append(int(val))
      m.append(r)
    n = len(m)

  a = []
  for row in m:
    val = 0
    for v in row:
      val = (val << 1) | -(v >> 1)
    a.append(val)

  batches = int(ceil(n * log(n) / (bitsize * log(2))))

  pids = []
  handles = []
  total = rbigint.fromint(0)
  for i in range(threads):
    r, w = pipe()
    pid = fork()
    if pid:
      close(w)
      pids.append(pid)
      handles.append(r)
    else:
      close(r)
      total = run(n, a, i, threads, batches)
      write(w, total.str())
      close(w)
      return 0

  for pid in pids:
    waitpid(pid, 0)

  for handle in handles:
    strval = read(handle, 256)
    total = total.add(rbigint.fromdecimalstr(strval))
    close(handle)

  print total.rshift(n-1).str()

  return 0


def run(n, a, mynum, threads, batches):
  start = (1 << n-1) * mynum / threads
  end = (1 << n-1) * (mynum+1) / threads

  dtotal = rbigint.fromint(0)
  for delta in range(start, end):
    pdelta = rbigint.fromint(1 - ((bitcount(delta) & 1) << 1))
    for i in range(batches):
      pbatch = 1
      for j in range(i, n, batches):
        pbatch *= n - (bitcount(delta ^ a[j]) << 1)
      pdelta = pdelta.int_mul(pbatch)
    dtotal = dtotal.add(pdelta)

  return dtotal


def target(*args):
  return main

To compile, download the most recent PyPy source, and execute the following:

pypy /path/to/pypy-src/rpython/bin/rpython matrix-permanent.py

The resulting executable will be named matrix-permanent-c or similiar in the current working directory.

As of PyPy 5.0, RPython's threading primitives are a lot less primitive than they used to be. Newly spawned threads require the GIL, which is more or less useless for parallel computations. I've used fork instead, so it may not work as expected on Windows, ~~although I haven't tested~~ fails to compile (unresolved external symbol _fork).

The executable accepts up to two command line parameters. The first is the number of threads, the second optional parameter is n. If it is provided, a random matrix will be generated, otherwise it will be read from stdin. Each row must be newline separated (without a trailing newline), and each value space separated. The third example input would be given as:

1 -1 1 -1 -1 1 1 1 -1 -1 -1 -1 1 1 1 1 -1 1 1 -1
1 -1 1 1 1 1 1 -1 1 -1 -1 1 1 1 -1 -1 1 1 1 -1
-1 -1 1 1 1 -1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 1 -1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 1 -1
-1 1 1 1 -1 1 1 1 -1 -1 -1 1 -1 1 -1 1 1 1 1 1
1 -1 1 1 -1 -1 1 -1 1 1 1 1 -1 1 1 -1 1 -1 -1 -1
1 -1 -1 1 -1 -1 -1 1 -1 1 1 1 1 -1 -1 -1 1 1 1 -1
1 -1 -1 1 -1 1 1 -1 1 1 1 -1 1 -1 1 1 1 -1 1 1
1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 -1
-1 -1 1 -1 1 -1 1 1 -1 1 -1 1 1 1 1 1 1 -1 1 1
-1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1
1 1 -1 -1 -1 1 1 -1 -1 1 -1 1 1 -1 1 1 1 1 1 1
-1 1 1 -1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 -1 1 -1 1
1 1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 1 -1 1 1 -1 1 -1 1
1 1 1 1 1 -1 -1 -1 1 1 1 -1 1 -1 1 1 1 -1 1 1
1 -1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1 -1 1 -1 -1 -1 -1 -1
-1 1 1 1 -1 1 1 -1 -1 1 1 1 -1 -1 1 1 -1 -1 1 1
1 1 -1 -1 1 1 -1 1 1 -1 1 1 1 -1 1 1 -1 1 -1 1
1 1 1 -1 -1 -1 1 -1 -1 1 1 -1 -1 -1 1 -1 -1 -1 -1 1
-1 1 1 1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 1 1 -1 1 -1 -1

Sample Usage

$ time ./matrix-permanent-c 8 30
8395059644858368

real    0m8.582s
user    1m8.656s
sys     0m0.000s

Method

I've used the Balasubramanian-Bax/Franklin-Glynn formula, with a runtime complexity of O(2ⁿn). However, instead of iterating the δ in grey code order, I've instead replaced vector-row multiplication with a single xor operation (mapping (1, -1) → (0, 1)). The vector sum can likewise be found in a single operation, by taking n minus twice the popcount.

primo

Posted 2016-10-22T08:30:36.173

Reputation: 30 891

Unfortunately the code gives the wrong answer for https://bpaste.net/show/8690251167e7

– None – 2016-10-31T12:57:21.390

@Lembik updated. Out of curiosity, could you tell me the result of the following code? https://bpaste.net/show/76ec65e1b533

– primo – 2016-10-31T13:51:48.693

It gives "True 18446744073709551615" I added the results for your very nice to code now as well. – None – 2016-10-31T19:55:17.167

@Lembik thanks. I had already split the multiplication as to not overflow 63-bits. Was the result listed taken with 8 threads? Does 2 or 4 make a difference? If 30 finishes in 25, it seems like 31 should be under a minute. – primo – 2016-10-31T20:06:30.347

-1

Racket 84 bytes

Following simple function works for smaller matrices but hangs on my machine for larger matrices:

(for/sum((p(permutations(range(length l)))))(for/product((k l)(c p))(list-ref k c)))

Ungolfed:

(define (f ll) 
  (for/sum ((p (permutations (range (length ll))))) 
    (for/product ((l ll)(c p)) 
      (list-ref l c))))

The code can easily be modified for unequal number of rows and columns.

Testing:

(f '[[ 1 -1 -1  1]
     [-1 -1 -1  1]
     [-1  1 -1  1]
     [ 1 -1 -1  1]])

(f '[[ 1 -1  1 -1 -1 -1 -1 -1]
 [-1 -1  1  1 -1  1  1 -1]
 [ 1 -1 -1 -1 -1  1  1  1]
 [-1 -1 -1  1 -1  1  1  1]
 [ 1 -1 -1  1  1  1  1 -1]
 [-1  1 -1  1 -1  1  1 -1]
 [ 1 -1  1 -1  1 -1  1 -1]
 [-1 -1  1 -1  1  1  1  1]])

Output:

-4
192

As I mentioned above, it hangs on testing following:

(f '[[1 -1 1 -1 -1 1 1 1 -1 -1 -1 -1 1 1 1 1 -1 1 1 -1]
 [1 -1 1 1 1 1 1 -1 1 -1 -1 1 1 1 -1 -1 1 1 1 -1]
 [-1 -1 1 1 1 -1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 -1]
 [-1 -1 -1 1 1 -1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 1 -1]
 [-1 1 1 1 -1 1 1 1 -1 -1 -1 1 -1 1 -1 1 1 1 1 1]
 [1 -1 1 1 -1 -1 1 -1 1 1 1 1 -1 1 1 -1 1 -1 -1 -1]
 [1 -1 -1 1 -1 -1 -1 1 -1 1 1 1 1 -1 -1 -1 1 1 1 -1]
 [1 -1 -1 1 -1 1 1 -1 1 1 1 -1 1 -1 1 1 1 -1 1 1]
 [1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 -1]
 [-1 -1 1 -1 1 -1 1 1 -1 1 -1 1 1 1 1 1 1 -1 1 1]
 [-1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1]
 [1 1 -1 -1 -1 1 1 -1 -1 1 -1 1 1 -1 1 1 1 1 1 1]
 [-1 1 1 -1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 -1 1 -1 1]
 [1 1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 1 -1 1 1 -1 1 -1 1]
 [1 1 1 1 1 -1 -1 -1 1 1 1 -1 1 -1 1 1 1 -1 1 1]
 [1 -1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1 -1 1 -1 -1 -1 -1 -1]
 [-1 1 1 1 -1 1 1 -1 -1 1 1 1 -1 -1 1 1 -1 -1 1 1]
 [1 1 -1 -1 1 1 -1 1 1 -1 1 1 1 -1 1 1 -1 1 -1 1]
 [1 1 1 -1 -1 -1 1 -1 -1 1 1 -1 -1 -1 1 -1 -1 -1 -1 1]
 [-1 1 1 1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 1 1 -1 1 -1 -1]])

rnso

Posted 2016-10-22T08:30:36.173

Reputation: 1 635

5Is this answer better in the codegolf version rather than the speed version of this question? – None – 2016-10-25T06:05:41.820

Calculate the permanent as quickly as possible

Answers

gcc C++ n ≈ 36 (57 seconds on my system)

C99, n ≈ 33 (35 seconds)

Test run

Python 2, n ≈ 28

Haskell, n=31 (54s)

Rust + extprim

Python 2, n = 22 [Reference]

Mathematica, n ≈ 20

RPython 5.4.1, n ≈ 32 (37 seconds)

Racket 84 bytes