C++ with pthreads, n = 147 156
The latest result is from running the same code as before on a beefier machine. This was now run on a desktop with a quad-core i7 (Core i7-4770), which got to n=156 in 120 seconds. The result is:
7858103688882482349696225090648142317093426691269441606893544257091315906431773702676266198643058148987365151560565922891852481847049321541347582728793175114543840164406674137410614843200
This is using a dynamic programming algorithm. I initially pondered approaches where the result would be built row by row, but I could never come up with a way to expand the solution without tracking a ton of state.
The key insights that enabled a reasonably efficient solution were:
- Since pawns on black squares can only attack pawns on other black squares, and the same is true for white squares, the black and white squares are independent, and can be processed separately. And since they are equivalent, we only need to process one of the two.
- The problem gets much easier when processing the board diagonal by diagonal.
If you look at one diagonal of a valid configuration, it always consists of a sequence of black pawns followed by a sequence of white pawns (where either sequence can also be empty). In other words, each diagonal can be fully characterized by its number of black pawns.
Therefore, the state tracked for each diagonal is the number of valid pawn configurations for each combination of:
- Number of black pawns in the row (or in other words, the position within the diagonal that separates the black pawns from the white pawns).
- Total count of black pawns used. We need to track the whole thing per pawn count because we only need the equal number of black pawns and white pawns at the very end. While processing the diagonals, the counts can be different, and still result in valid solutions in the end.
When stepping from one diagonal to the next one, there is another constraint to build valid solutions: The position that separates black pawns from white pawns can not increase. So the number of valid configurations is calculated as the sum of the valid configurations of the previous diagonal for positions that are equal or larger.
The basic DP step is then very simple. Each value in a diagonal is just a sum of values from the previous diagonal. The only somewhat painful part is calculating the indices and loop ranges correctly. Since we're working on diagonals, the length increases during the first half of the calculation, and decreases for the second half, which makes the calculation of the loop ranges more cumbersome. There's also some considerations for the values at the boundary of the board, since they only have diagonal neighbors on one side when stepping from diagonal to diagonal.
The amount of memory used is O(n^3). I keep two copies of the state data, and ping pong between them. I believe it would be possible to operate with a single instance of the state data. But you would have to be very careful that no values are updated before the old values are fully consumed. Also, it would not work well for the parallel processing I introduced.
Runtime complexity is... polynomial. There are 4 nested loops in the algorithm, so at first sight it would look like O(n^4). But you obviously need bigints at these sizes, and the numbers themselves also get longer at larger sizes. The number of digits in the result seems to increase roughly proportionally to n, which would make the whole thing O(n^5). On the other hand, I found some performance improvements, which avoids going through the full range of all loops.
So while this is still a fairly expensive algorithm, it gets much farther than the algorithms that enumerate solutions, which are all exponential.
Some notes on the implementation:
- While there can be up to 2*n^2 black pawns on the black squares, I only calculate the configuration numbers up to n^2 black pawns. Since there's a symmetry between black and white pawns, the configuration count for k and 2*n^2-k are the same.
- The number of solutions is calculated at the end from the configuration counts on the black squares based on a similar symmetry. The total number of solutions (which need to have 2*n^2 pawns of each color) is the number of configurations for k black pawns on one color of squares multiplied by the number of configurations for 2*n^2-k black pawns on the other color of squares, summed over all k.
- In addition to just storing the configuration counts per diagonal position and pawn count, I also store the range of pawn counts that have valid configurations per position. This allows cutting down the range of the inner loop substantially. Without this, I found that a lot of zeros were being added. I got a very substantial performance improvement from this.
- The algorithm parallelizes fairly well, particularly at large sizes. The diagonals have to be processes sequentially, so there is a barrier at the end of each diagonal. But the positions within the diagonal can be processed in parallel.
- Profiling shows that the bottleneck is clearly in adding bigint values. I played around with some variations of the code, but it's not heavily optimized. I suspect that there could be a significant improvement from inline assembly to use 64-bit additions with carry.
Main algorithm code. THREADS
controls the number of threads used, where the number of CPU cores should be a reasonable starting point:
#ifndef THREADS
#define THREADS 2
#endif
#if THREADS > 1
#include <pthread.h>
#endif
#include <vector>
#include <iostream>
#include <sstream>
#include "BigUint.h"
typedef std::vector<BigUint> BigUintVec;
typedef std::vector<int> IntVec;
static int N;
static int NPawn;
static int NPos;
static BigUintVec PawnC[2];
static IntVec PawnMinC[2];
static IntVec PawnMaxC[2];
#if THREADS > 1
static pthread_mutex_t ThreadMutex;
static pthread_cond_t ThreadCond;
static int BarrierCount;
#endif
#if THREADS > 1
static void ThreadBarrier()
{
pthread_mutex_lock(&ThreadMutex);
--BarrierCount;
if (BarrierCount)
{
pthread_cond_wait(&ThreadCond, &ThreadMutex);
}
else
{
pthread_cond_broadcast(&ThreadCond);
BarrierCount = THREADS;
}
pthread_mutex_unlock(&ThreadMutex);
}
#endif
static void* countThread(void* pData)
{
int* pThreadIdx = static_cast<int*>(pData);
int threadIdx = *pThreadIdx;
int prevDiagMin = N - 1;
int prevDiagMax = N;
for (int iDiag = 1; iDiag < 2 * N; ++iDiag)
{
BigUintVec& rSrcC = PawnC[1 - iDiag % 2];
BigUintVec& rDstC = PawnC[iDiag % 2];
IntVec& rSrcMinC = PawnMinC[1 - iDiag % 2];
IntVec& rDstMinC = PawnMinC[iDiag % 2];
IntVec& rSrcMaxC = PawnMaxC[1 - iDiag % 2];
IntVec& rDstMaxC = PawnMaxC[iDiag % 2];
int diagMin = prevDiagMin;
int diagMax = prevDiagMax;;
if (iDiag < N)
{
--diagMin;
++diagMax;
}
else if (iDiag > N)
{
++diagMin;
--diagMax;
}
int iLastPos = diagMax;
if (prevDiagMax < diagMax)
{
iLastPos = prevDiagMax;
}
for (int iPos = diagMin + threadIdx; iPos <= iLastPos; iPos += THREADS)
{
int nAdd = iPos - diagMin;
for (int iPawn = nAdd; iPawn < NPawn; ++iPawn)
{
rDstC[iPos * NPawn + iPawn] = 0;
}
rDstMinC[iPos] = NPawn;
rDstMaxC[iPos] = -1;
int iFirstPrevPos = iPos;
if (!nAdd)
{
iFirstPrevPos = prevDiagMin;
}
for (int iPrevPos = iFirstPrevPos;
iPrevPos <= prevDiagMax; ++iPrevPos)
{
int iLastPawn = rSrcMaxC[iPrevPos];
if (iLastPawn + nAdd >= NPawn)
{
iLastPawn = NPawn - 1 - nAdd;
}
if (rSrcMinC[iPrevPos] > iLastPawn)
{
continue;
}
if (rSrcMinC[iPrevPos] < rDstMinC[iPos])
{
rDstMinC[iPos] = rSrcMinC[iPrevPos];
}
if (iLastPawn > rDstMaxC[iPos])
{
rDstMaxC[iPos] = iLastPawn;
}
for (int iPawn = rSrcMinC[iPrevPos];
iPawn <= iLastPawn; ++iPawn)
{
rDstC[iPos * NPawn + iPawn + nAdd] += rSrcC[iPrevPos * NPawn + iPawn];
}
}
if (rDstMinC[iPos] <= rDstMaxC[iPos])
{
rDstMinC[iPos] += nAdd;
rDstMaxC[iPos] += nAdd;
}
}
if (threadIdx == THREADS - 1 && diagMax > prevDiagMax)
{
int pawnFull = (iDiag + 1) * (iDiag + 1);
rDstC[diagMax * NPawn + pawnFull] = 1;
rDstMinC[diagMax] = pawnFull;
rDstMaxC[diagMax] = pawnFull;
}
prevDiagMin = diagMin;
prevDiagMax = diagMax;
#if THREADS > 1
ThreadBarrier();
#endif
}
return 0;
}
static void countPawns(BigUint& rRes)
{
NPawn = N * N + 1;
NPos = 2 * N;
PawnC[0].resize(NPos * NPawn);
PawnC[1].resize(NPos * NPawn);
PawnMinC[0].assign(NPos, NPawn);
PawnMinC[1].assign(NPos, NPawn);
PawnMaxC[0].assign(NPos, -1);
PawnMaxC[1].assign(NPos, -1);
PawnC[0][(N - 1) * NPawn + 0] = 1;
PawnMinC[0][N - 1] = 0;
PawnMaxC[0][N - 1] = 0;
PawnC[0][N * NPawn + 1] = 1;
PawnMinC[0][N] = 1;
PawnMaxC[0][N] = 1;
#if THREADS > 1
pthread_mutex_init(&ThreadMutex, 0);
pthread_cond_init(&ThreadCond, 0);
BarrierCount = THREADS;
int threadIdxA[THREADS] = {0};
pthread_t threadA[THREADS] = {0};
for (int iThread = 0; iThread < THREADS; ++iThread)
{
threadIdxA[iThread] = iThread;
pthread_create(threadA + iThread, 0, countThread, threadIdxA + iThread);
}
for (int iThread = 0; iThread < THREADS; ++iThread)
{
pthread_join(threadA[iThread], 0);
}
pthread_cond_destroy(&ThreadCond);
pthread_mutex_destroy(&ThreadMutex);
#else
int threadIdx = 0;
countThread(&threadIdx);
#endif
BigUint solCount;
BigUintVec& rResC = PawnC[1];
for (int iPawn = 0; iPawn < NPawn; ++iPawn)
{
BigUint nComb = rResC[(N - 1) * NPawn + iPawn];
nComb *= nComb;
if (iPawn < NPawn - 1)
{
nComb *= 2;
}
solCount += nComb;
}
std::string solStr;
solCount.toDecString(solStr);
std::cout << solStr << std::endl;
}
int main(int argc, char* argv[])
{
std::istringstream strm(argv[1]);
strm >> N;
BigUint res;
countPawns(res);
return 0;
}
This also needs a bigint class that I wrote for this purpose. Note that this is not a general purpose bigint class. It does just enough to support the operations used by this specific algorithm:
#ifndef BIG_UINT_H
#define BIG_UINT_H
#include <cstdint>
#include <string>
#include <vector>
class BigUint
{
public:
BigUint()
: m_size(1),
m_cap(MIN_CAP),
m_valA(m_fixedValA)
{
m_valA[0] = 0;
}
BigUint(uint32_t val)
: m_size(1),
m_cap(MIN_CAP),
m_valA(m_fixedValA)
{
m_valA[0] = val;
}
BigUint(const BigUint& rhs)
: m_size(rhs.m_size),
m_cap(MIN_CAP),
m_valA(m_fixedValA)
{
if (m_size > MIN_CAP)
{
m_cap = m_size;
m_valA = new uint32_t[m_cap];
}
for (int iVal = 0; iVal < m_size; ++iVal)
{
m_valA[iVal] = rhs.m_valA[iVal];
}
}
~BigUint()
{
if (m_cap > MIN_CAP)
{
delete[] m_valA;
}
}
BigUint& operator=(uint32_t val)
{
m_size = 1;
m_valA[0] = val;
return *this;
}
BigUint& operator=(const BigUint& rhs)
{
if (rhs.m_size > m_cap)
{
if (m_cap > MIN_CAP)
{
delete[] m_valA;
}
m_cap = rhs.m_size;
m_valA = new uint32_t[m_cap];
}
m_size = rhs.m_size;
for (int iVal = 0; iVal < m_size; ++iVal)
{
m_valA[iVal] = rhs.m_valA[iVal];
}
return *this;
}
BigUint& operator+=(const BigUint& rhs)
{
if (rhs.m_size > m_size)
{
resize(rhs.m_size);
}
uint64_t sum = 0;
for (int iVal = 0; iVal < m_size; ++iVal)
{
sum += m_valA[iVal];
if (iVal < rhs.m_size)
{
sum += rhs.m_valA[iVal];
}
m_valA[iVal] = sum;
sum >>= 32u;
}
if (sum)
{
resize(m_size + 1);
m_valA[m_size - 1] = sum;
}
return *this;
}
BigUint& operator*=(const BigUint& rhs)
{
int resSize = m_size + rhs.m_size - 1;
uint32_t* resValA = new uint32_t[resSize];
uint64_t sum = 0;
for (int iResVal = 0; iResVal < resSize; ++iResVal)
{
uint64_t carry = 0;
for (int iLhsVal = 0;
iLhsVal <= iResVal && iLhsVal < m_size; ++iLhsVal)
{
int iRhsVal = iResVal - iLhsVal;
if (iRhsVal < rhs.m_size)
{
uint64_t prod = m_valA[iLhsVal];
prod *= rhs.m_valA[iRhsVal];
uint64_t newSum = sum + prod;
if (newSum < sum)
{
++carry;
}
sum = newSum;
}
}
resValA[iResVal] = sum & UINT64_C(0xFFFFFFFF);
sum >>= 32u;
sum += carry << 32u;
}
if (resSize > m_cap)
{
if (m_cap > MIN_CAP)
{
delete[] m_valA;
}
m_cap = resSize;
m_valA = resValA;
}
else
{
for (int iVal = 0; iVal < resSize; ++iVal)
{
m_valA[iVal] = resValA[iVal];
}
delete[] resValA;
}
m_size = resSize;
if (sum)
{
resize(m_size + 1);
m_valA[m_size - 1] = sum;
}
return *this;
}
void divMod(uint32_t rhs, uint32_t& rMod)
{
uint64_t div = 0;
for (int iVal = m_size - 1; iVal >= 0; --iVal)
{
div <<= 32u;
div += m_valA[iVal];
uint64_t val = div / rhs;
div -= val * rhs;
if (val || iVal == 0 || iVal < m_size - 1)
{
m_valA[iVal] = val;
}
else
{
--m_size;
}
}
rMod = div;
}
void toDecString(std::string& rStr) const
{
std::vector<char> digits;
BigUint rem(*this);
while (rem.m_size > 1 || rem.m_valA[0])
{
uint32_t digit = 0;
rem.divMod(10, digit);
digits.push_back(digit);
}
if (digits.empty())
{
rStr = "0";
}
else
{
rStr.clear();
rStr.reserve(digits.size());
for (int iDigit = digits.size() - 1; iDigit >= 0; --iDigit)
{
rStr.append(1, '0' + digits[iDigit]);
}
}
}
private:
static const int MIN_CAP = 8;
void resize(int newSize)
{
if (newSize > m_cap)
{
uint32_t* newValA = new uint32_t[newSize];
for (int iVal = 0; iVal < m_size; ++iVal)
{
newValA[iVal] = m_valA[iVal];
}
if (m_cap > MIN_CAP)
{
delete[] m_valA;
}
m_cap = newSize;
m_valA = newValA;
}
for (int iVal = m_size; iVal < newSize; ++iVal)
{
m_valA[iVal] = 0;
}
m_size = newSize;
}
int m_size;
int m_cap;
uint32_t* m_valA;
uint32_t m_fixedValA[MIN_CAP];
};
#endif // BIG_UINT_H
2What is the output format? – Doorknob – 2015-06-25T09:08:51.183
any format. Provided that you explain it and it is not a puzzle in itself. @Doorknob – Agnishom Chattopadhyay – 2015-06-25T09:26:43.040
Answers should be allowed to only count them and not output them as there can be far too many to print. – feersum – 2015-06-25T09:59:53.203
2For test: has someone any idea of the numbers involved? I found 3 solution for 2x2 and 28 solutions for 4x4 – edc65 – 2015-06-25T10:42:37.230
@feersum Okay, I changed the problem statement – Agnishom Chattopadhyay – 2015-06-25T11:14:31.703
@edc65 the next number is a little more than 400, I think – Agnishom Chattopadhyay – 2015-06-25T11:15:45.290
1@edc65, I make it 3, 30, 410. I've checked the 3 and 30 by an alternative method. – Peter Taylor – 2015-06-25T13:28:22.387
1I had my code generate the first few: 3, 30, 410, 6148, 96120, 1526700. Although, I have no way of checking. Anyone get the same? – cmxu – 2015-06-25T14:51:05.340
too bad! I have 28 and 408. Wonder what am I missing – edc65 – 2015-06-25T14:58:17.383
1To clarify on operator precedence, when you say
2n^2
pawns, is that(2n)^2
or2(n^2)
? – Reto Koradi – 2015-06-25T15:08:53.2671@RetoKoradi I would assume he means 2(n^2) pawns, as a 2n by 2n board can only fit 2(n^2) of each color. – cmxu – 2015-06-25T15:10:15.927
@Changming Yes, that makes sense. I was only looking at the value in isolation. But in the context, that's the only real option. – Reto Koradi – 2015-06-25T15:14:07.573
1I think the standard is that 2n^2 is 2*(n^2) which is what I meant. edc65's values are correct – Agnishom Chattopadhyay – 2015-06-25T15:18:06.890
@PeterTaylor How did you confirm 3 and 30, because according to Agnishom 30 is incorrect – cmxu – 2015-06-25T15:37:49.010
3, 30 and 410 are correct. – Agnishom Chattopadhyay – 2015-06-25T16:03:19.150
@edc65, I would guess that you're missing the cases that all the white pawns are on the white squares, and that all the white pawns are on the black squares. – Peter Taylor – 2015-06-25T16:13:13.873
1
Here is an image: Link
– Agnishom Chattopadhyay – 2015-06-25T16:21:12.267@AgnishomChattopadhyay So clarification, because the question text is slightly contradictory. We don't have to output the solutions, but do we have to find them or just find the number of solutions? I.E feersum's solution finds the number of solutions without technically ever finding a solution, if I'm reading it correctly. – Cain – 2015-06-25T23:51:03.847
@Cain You just have to output the number of solutions – Agnishom Chattopadhyay – 2015-06-26T12:37:27.477
The image link in your comment above does not work anymore for me. Can you restore it? I didn't have a chance to look at it closely before it disappeared. – Reto Koradi – 2015-06-26T14:53:27.643