Supersonic domino tilings

10

Task

Write a program that reads three integers m, n either from STDIN or as command-line arguments, prints all possible tilings of a rectangle of dimensions m × n by 2 × 1 and 1 × 2 dominos and finally the number of valid tilings.

Dominos of an individual tiling have to be represented by two dashes (-) for 2 × 1 and two vertical bars (|) for 1 × 2 dominos. Each tiling (including the last one) has to be followed by a linefeed.

For scoring purposes, you also have to accept a flag from STDIN or as command line argument that makes your program print only the number of valid tilings, but not the tilings itself.

Your program may not be longer than 1024 bytes. It has to work for all inputs such that m × n ≤ 64.

(Inspired by Print all domino tilings of 4x6 rectangle.)

Example

$ sdt 4 2
----
----

||--
||--

|--|
|--|

--||
--||

||||
||||

5
$ sdt 4 2 scoring
5

Scoring

Your score is determined by the execution time of your program for the input 8 8 with the flag set.

To make this a fastest code rather than a fastest computer challenge, I will run all submissions on my own computer (Intel Core i7-3770, 16 GiB PC3-12800 RAM) to determine the official score.

Please leave detailed instructions on how to compile and/or execute your code. If you require a specific version of your language's compiler/interpreter, make a statement to that effect.

I reserve the right to leave submissions unscored if:

  • There is no free (as in beer) compiler/interpreter for my operating system (Fedora 21, 64 bits).

  • Despite our efforts, your code doesn't work and/or produces incorrect output on my computer.

  • Compilation or execution take longer than an hour.

  • Your code or the only available compiler/interpreter contain a system call to rm -rf ~ or something equally fishy.

Leaderboard

I've re-scored all submissions, running both compilations and executions in a loop with 10,000 iterations for compilation and between 100 and 10,000 iterations for execution (depending on the speed of the code) and calculating the mean.

These were the results:

User          Compiler   Score                              Approach

jimmy23013    GCC (-O0)    46.11 ms =   1.46 ms + 44.65 ms  O(m*n*2^n) algorithm.
steveverrill  GCC (-O0)    51.76 ms =   5.09 ms + 46.67 ms  Enumeration over 8 x 4.
jimmy23013    GCC (-O1)   208.99 ms = 150.18 ms + 58.81 ms  Enumeration over 8 x 8.
Reto Koradi   GCC (-O2)   271.38 ms = 214.85 ms + 56.53 ms  Enumeration over 8 x 8.

Dennis

Posted 2015-05-31T13:31:41.570

Reputation: 196 637

Why not make this a GOLF contest? :( – orlp – 2015-05-31T14:36:23.080

No, I mean a GOLF challenge.

– orlp – 2015-05-31T15:06:37.377

2If you had suggested that in the sandbox, I might have. That would have saved my CPU and me a lot of work... – Dennis – 2015-05-31T15:11:14.607

@user23013: No, the results do not have to be sorted. I'll edit the question. – Dennis – 2015-05-31T16:19:05.483

I'm slightly confused. What I got is that every domino type (- or |) must appear at least twice. But then why is there no -||-? Or is it that the side-dominos (-) must always appear consecutively? Some more examples would be nice, too... – kirbyfan64sos – 2015-05-31T18:03:27.040

3@kirbyfan64sos The way I understand it, there's only one type of domino, which you can rotate. If it's horizontal, it looks like this: --. If it's vertical, it is two |, one below the other. – Reto Koradi – 2015-05-31T18:41:22.797

@RetoKoradi That makes sense. Thanks! – kirbyfan64sos – 2015-05-31T19:18:05.640

isn't this going to be I/O-limited? – feersum – 2015-05-31T20:32:17.933

@feersum: Possibly. I assumed this would take a lot longer. Shows how much I know about writing fast code. :/ – Dennis – 2015-05-31T20:39:42.870

@feersum I'm worried about that, too. Redirecting to /dev/null will cut out part of the I/O overhead. But you still have to generate strings, and it seems likely that this could take much longer than finding the solutions. An alternative might have been to only count the solutions, without having to output them. – Reto Koradi – 2015-05-31T20:58:25.820

Do we have to check for the case where m*n is odd? E.g. for 9x7, there are no solutions. – Reto Koradi – 2015-05-31T23:36:35.150

@RetoKoradi: Your code has to work for odd mn, which means it should print nothing. – Dennis – 2015-05-31T23:39:21.950

Unfortunately our concerns were correct. My initial implementation uses almost 99% of the time for output. Finding the solutions for the 8x6 size is under 10 ms on a modest laptop. At least if I got it right... Do you want to share how many solutions there are? Does the number start with a 1 and end with 89? – Reto Koradi – 2015-06-01T01:01:27.603

@Dennis If I were you I would re-post this challenge using the GOLF scoring method - right now it's kinda useless since it just becomes an I/O contest. – orlp – 2015-06-01T04:56:07.553

@RetoKoradi: As you already know by now, there are 167089 tilings for 8x6. Sorry for not answering earlier. – Dennis – 2015-06-01T05:00:33.047

@orlp: Given how unfamiliar I am with GOLF (and assembly in general, I might add), I'm not sure I should be the one to do that. At least not right now. You're more than welcome to though. – Dennis – 2015-06-01T05:13:36.323

1Your challenge is not bad. The problem is that our top coders are too strong. My solution that checks validity of rows and columns stays near 1 minute for 6x8. – edc65 – 2015-06-02T07:00:46.440

1I think the best strategy now is to use assembly, and try to get a binary file less than 1024 bytes, to get rid of the complication time. – jimmy23013 – 2015-06-04T07:44:15.990

Answers

5

C

A straightforward implementation...

#include<stdio.h>
int a,b,c,s[65],l=0,countonly;
unsigned long long m=0;
char r[100130];
void w(i,x,o){
    int j,k;
    j=(1<<b)-1&~x;
    if(i<a-1){
        s[i+1]=j;
        w(i+1,j,1);
        for(k=j&j/2&-o;j=k&-k;k&=~j)
            w(i,x|3*j,j);
    }
    else if((j/3|j/3*2)==j)
        if(countonly)
            m++;
        else{
            if(c==b)
                for(k=0;k<b;r[k++,l++]=10)
                    for(j=0;j<a;j++)
                        r[l++]=45+79*!((s[j]|s[j+1])&(1<<k));
            else
                for(j=0;j<a;r[j++,l++]=10)
                    for(k=0;k<b;k++)
                        r[l++]=124-79*!((s[j]|s[j+1])&(1<<k));
            r[l++]=10;
            if(l>=100000){
                fwrite(r,l,1,stdout);
                l=0;
            }
        }
}

int main(){
    scanf("%d %d %d",&a,&b,&countonly);
    c=b;
    if(a<b){a^=b;b^=a;a^=b;}
    s[0]=s[a]=0;
    w(0,0,1);
    if(countonly)
        printf("%llu\n",m);
    else if(l)
        fwrite(r,l,1,stdout);
}

The cheating version

#include<stdio.h>
#include<string.h>
int a,b,c,s[65],l=0,countonly;
unsigned long long m=0,d[256];
char r[100130];
void w2(){
    int i,j,k,x;
    memset(d,0,sizeof d);
    d[0]=1;
    j=0;
    for(i=0;i<a-1;i++){
        for(k=1;k<(1<<(b-1));k*=2)
            for(x=0;x<(1<<(b-2));x++)
                d[(x+x/k*k*3+k*3)^j]+=d[(x+x/k*k*3)^j];
        j^=(1<<b)-1;
    }
    for(x=0;x<(1<<b);x++)
        if((x/3|x/3*2)==x)
            m+=d[x^((1<<b)-1)^j];
    printf("%llu\n",m);
}

void w(i,x,o){
    int j,k;
    j=(1<<b)-1&~x;
    if(i<a-1){
        s[i+1]=j;
        w(i+1,j,1);
        for(k=j&j/2&-o;j=k&-k;k&=~j)
            w(i,x|3*j,j);
    }
    else if((j/3|j/3*2)==j){
        if(c==b)
            for(k=0;k<b;r[k++,l++]=10)
                for(j=0;j<a;j++)
                    r[l++]=45+79*!((s[j]|s[j+1])&(1<<k));
        else
            for(j=0;j<a;r[j++,l++]=10)
                for(k=0;k<b;k++)
                    r[l++]=124-79*!((s[j]|s[j+1])&(1<<k));
        r[l++]=10;
        if(l>=100000){
            fwrite(r,l,1,stdout);
            l=0;
        }
    }
}

int main(){
    scanf("%d %d %d",&a,&b,&countonly);
    c=b;
    if(a<b){a^=b;b^=a;a^=b;}
    s[0]=s[a]=0;
    if(countonly)
        w2();
    else{
        w(0,0,1);
        if(l)
            fwrite(r,l,1,stdout);
    }
}

Explanation of the faster algorithm

It scans from left to right, and maintain the state d[i][j], where:

  • i is in [0,m), which means the current column.
  • j is a bit vector of size n, where the bit would be 1 if the corresponding position in column i is already occupied before starting working on this column. I.e. it is occupied by the right half of a --.
  • d[i][j] is the total number of different tilings.

Then say e[i][j] = the sum of d[i][k] where you can put vertical dominoes base on k to form a j. e[i][j] would be the number of tilings where each 1 bit in j is occupied by anything but the left half of a --. Fill them with -- and you'll get d[i+1][~j] = e[i][j]. e[m-1][every bit being 1] or d[m][0] is the final answer.

A naive implementation will get you the time complexity somewhere near g[n]=2*g[n-1]+g[n-2] = O((sqrt(2)+1)^n) (already fast enough if n=m=8). But you can instead firstly loop for each possible domino, and try to add it to every tiling that can have this domino added, and merge the result to the original array d (like the algorithm for the Knapsack problem). And this becomes O(n*2^n). And everything else are implementation details. The whole code runs in O(m*n*2^n).

jimmy23013

Posted 2015-05-31T13:31:41.570

Reputation: 34 042

@Dennis You probably want to start a poll to change it. – jimmy23013 – 2015-06-01T01:37:47.850

@Dennis Not sure increasing the size would have helped much. While it increases the computation time substantially, it also produces about 100 times more output. Relatively speaking, the amount of output is actually larger. – Reto Koradi – 2015-06-01T02:34:40.883

1st version Execution: 0.286 s Compilation: 0.053 s Sum: 0.339 s 2nd version Execution: 0.002 s Compilation: 0.061 s Sum: 0.063 s (What did just happen here?) – Dennis – 2015-06-02T04:29:21.553

@Dennis It used another algorithm in O(mn2^n) if the flag is set. – jimmy23013 – 2015-06-02T04:38:15.323

@Dennis Was it the spirit of the rule change that a completely different algorithm could be used for the scoring run? I figured that it would still have to enumerate the solutions, and only skip the output. Then again, I didn't think that it would be possible to calculate the count orders of magnitude faster. That sounds intriguing. I would definitely love to see an explanation of how this works. – Reto Koradi – 2015-06-02T04:51:13.873

@RetoKoradi: I can't say this is what I had in mind, because I didn't even consider the possibility... – Dennis – 2015-06-02T05:26:14.960

@jimmy23013: To be clear, there would be no way of actually generating the output from within w2()? The line that updates d looks like d[(x+x/k*k*3)^j] and d[(x+x/k*k*3+k*3)^j] both represent a certain set of outputs, but frankly I don't really know what is going on. – Dennis – 2015-06-02T05:34:31.930

@Dennis Right. d[x]+=d[y] adds two counts together, and there will be d[y]+=d[x] later, so each of the d isn't for a single tiling. So adding generation code will be complex. (It's not impossible, if you think each count is a tree containing duplicated nodes, but it would be a much worse idea than my original code.) Explanations of the algorithm will be added soon. – jimmy23013 – 2015-06-02T05:44:05.587

1Execution: 190 ms Compilation: 68 ms Sum: 258 ms (-O1 seems to be the sweet spot. I've tried all optimization levels.) – Dennis – 2015-06-02T15:03:18.387

3

C

After a round of optimizations, and adapted for the modified rules:

typedef unsigned long u64;

static int W, H, S;
static u64 RM, DM, NSol;
static char Out[64 * 2 + 1];

static void place(u64 cM, u64 vM, int n) {
  if (n) {
    u64 m = 1ul << __builtin_ctzl(cM); cM -= m;

    if (m & RM) {
      u64 nM = m << 1;
      if (cM & nM) place(cM - nM, vM, n - 1);
    }

    if (m & DM) {
      u64 nM = m << W;
      vM |= m; vM |= nM; place(cM - nM, vM, n - 1);
    }
  } else if (S) {
    ++NSol;
  } else {
    char* p = Out;
    for (int y = 0; y < H; ++y) {
      for (int x = 0; x < W; ++x) { *p++ = vM & 1 ? '|' : '-'; vM >>= 1; }
      *p++ = '\n';
    }
    *p++ = '\0';
    puts(Out);
    ++NSol;
  }
}

int main(int argc, char* argv[]) {
  W = atoi(argv[1]); H = atoi(argv[2]); S = (argc > 3);

  int n = W * H;
  if (n & 1) return 0;

  for (int y = 0; y < H; ++y) {
    RM <<= W; RM |= (1ul << (W - 1)) - 1;
  }
  DM = (1ul << (W * (H - 1))) - 1;

  place(-1, 0, n >> 1);
  printf("%lu\n", NSol);

  return 0;
}

I started bumping into the length limit of 1024 characters, so I had to reduce the readability somewhat. Much shorter variable names, etc.

Build instructions:

> gcc -O2 Code.c

Run with solution output enabled:

> ./a.out 8 8 >/dev/null

Run with only solution count:

> ./a.out 8 8 s

Some comments:

  • With the larger test example, I do want optimization now. While my system is different (Mac), around -O2 seems to be good.
  • The code has gotten slower for the case where output is generated. This was a conscious sacrifice for optimizing the "count only" mode, and for reducing the code length.
  • There will be a few compiler warnings because of missing includes and external declarations for system functions. It was the easiest way to get me to below 1024 characters in the end without making the code totally unreadable.

Also note that the code still generates the actual solutions, even in "count only" mode. Whenever a solution is found, the vM bitmask contains a 1 for the positions that have a vertical bar, and a 0 for positions with a horizontal bar. Only the conversion of this bitmask into ASCII format, and the actual output, is skipped.

Reto Koradi

Posted 2015-05-31T13:31:41.570

Reputation: 4 870

@Dennis New version. Execution should be unchanged, but compilation faster. If we need to optimize for compile time, we don't need no system headers! – Reto Koradi – 2015-06-01T04:49:35.163

@Dennis Updated for new scoring, and for a round of optimizations. Note that I do want optimization now, something like -O2 should be good. – Reto Koradi – 2015-06-02T07:32:06.480

Execution: 256 ms Compilation: 65 ms Sum: 321 ms (-O2 seems to be the sweet spot. I've tried all optimization levels.) – Dennis – 2015-06-02T14:58:21.467

1

C

The concept is to first find all possible arrangements of horizontal dominoes in a row, store them in r[] and then organize them to give all possible arrangements of vertical dominoes.

The code for positioning the horizontal dominoes in a row is modified from this answer of mine: https://codegolf.stackexchange.com/a/37888/15599. It's slow for the wider grids but that isn't a problem for the 8x8 case.

The innovation is in the way the rows are assembled. If the board has an odd number of rows it is turned through 90 degrees in the input parsing, so it now has an even number of rows. Now I place some vertical dominoes across the centreline. Because of symmetry, if there are c ways of arranging the remaining dominoes in the bottom half, there must also be c ways of arranging the remaining dominoes in the top half, meaning that for a given arrangement of vertical dominoes on the centreline, there are c*c possible solutions. Therefore only the centreline plus one half of the board is analysed when the program is required to print only the number of solutions.

f() builds the table of possible arrangements of horizontal dominoes, and scans through the possible arrangements of vertical dominoes on the centreline. it then calls recursive function g() which fills in the rows. If printing is required, function h() is called to do this.

g() is called with 3 parameters. y is the current row and d is the direction (up or down) in which we are filling the board from the centre outwards. x contains a bitmap indicates the vertical dominoes that are incomplete from the previous row. All possible arrangements of dominoes in a row from r[] are tried. In this array, a 1 represents a vertical domino and a pair of zeroes a horizontal domino. A valid entry in the array must have at least enough 1's to finish any incomplete vertical dominoes from the last row: (x&r[j])==x. It may have more 1's which indicates that new vertical dominoes are being started. For the next row, then, we need only the new dominoes so we call the procedure again with x^r[j].

If an end row has been reached and there are no incomplete vertical dominoes hanging of the top or bottom of the board x^r[j]==0 then the half has been succesfully completed. If we are not printing, it is sufficient to complete the bottom half and use c*c to work out the total number of arrangements. If we are printing, it will be necessary to complete also the top half and then call the printing function h().

CODE

unsigned int W,H,S,n,k,t,r[1<<22],p,q[64];
long long int i,c,C;


//output: ascii 45 - for 0, ascii 45+79=124 | for 1
h(){int a;
  for(a=n;a--;a%W||puts(""))putchar(45+(q[a/W]>>a%W)%2*79);
  puts("");
}

g(y,d,x){int j;
  for(j=0;j<p;j++)if((x&r[j])==x){
    q[y]=r[j];
    if(y%(H-1)==0){
       (x^r[j])==0?
        y?c++,S||g(H/2-1,-1,i):h() 
       :0;
    }else{
      g(y+d,d,x^r[j]);
    }

  }    
}

e(z){int d;
  for(d=0;z;d++)z&=z-1;return n/2+1+d&1; 
}

f(){
  //k=a row full of 1's
  k=(1ULL<<W)-1;

  //build table of possible arrangements of horizontal dominoes in a row;
  //flip bits so that 1=a vertical domino and save to r[]
  for(i=0;i<=k;i++)(i/3|i/3<<1)==i?r[p++]=i^k:0;

  //for each arrangement of vertical dominoes on the centreline, call g()
  for(i=0;i<=k;i++)e(i)?c=0,g(H/2,1,i),C+=c*c:0;
  printf("%llu",C);
}


main(int argc, char* argv[]) {
  W = atoi(argv[1]); H = atoi(argv[2]); S = (argc > 3);
  1-(n=W*H)%2?
      H%2?W^=H,H^=W,W^=H,t=1:0,f()
    :puts("0");

}

Note that input with an odd number of rows and even number of columns is turned though 90 degrees in the parsing phase. If this is unacceptable, the printing function h() can be changed to accomodate it. (EDIT: not required, see comments.)

EDIT: A new function e() has been used to check the parity of i (i.e the number of dominoes straddling the centreline.) The parity of i (the number of half-dominoes on the centreline protruding into each half of the board) must be the same as the oddness of the total number of spaces in each half (given by n/2) because only then can the dominoes fill all available space. This edit eliminates half the values of i and therefore makes my program approximately twice as fast.

Level River St

Posted 2015-05-31T13:31:41.570

Reputation: 22 049

Execution: 18 ms Compilation: 50 ms Sum: 68 ms (-O0 was the sweet spot for the total. Other options slowed down compilation.) – Dennis – 2015-06-03T02:53:41.400

This either never terminates, or at least takes a very long time, for input 32 2 s. I stopped it after about 15 minutes. – Reto Koradi – 2015-06-03T05:08:32.900

@RetoKoradi indeed, yet 2 32 s runs almost instantly. The scanning through all possible vertical dominoes is extremely wasteful for the H=2 case, because in fact we already have all necessary info in r[]. I'm very pleased with the official time for 8 8 s Here's a patch for the case you mention: if(H==2){C=p;if(!S)for(i=0;i<p;i++)q[0]=q[1]=r[i],h();}else for(i=0;i<=k;i++)c=0,g(H/2,1,i),C+=c*c; As you can see this snippet will run instantly for H=2 with the flag set. The overall runtime is then limited by the building of r[] which certainly has room for improvement. – Level River St – 2015-06-03T06:55:16.483

For completeness here's the patch for turning the output the right way up, if required: if(t)for(a=n;a--;a%H||puts(""))putchar(124-(q[a%H]>>a/H)%2*79);else for(a=n;a--;a%W||puts(""))putchar(45+(q[a/W]>>a%W)%2*79); Codelength is still well below 1000 bytes and impact on compile time should be minimal. I didn't include these patches last night as I was too tired. – Level River St – 2015-06-03T07:36:41.987

I meant to comment on that last night, but I forgot. Since the scoring is done on a square, I'm not going to insist on a particular order. – Dennis – 2015-06-03T17:39:26.013

Out of curiosity: Would it be possible to apply your approach recursively if the number of rows is a power of 2? – Dennis – 2015-06-03T17:43:22.560

@Dennis while there may be mileage in splitting the board into sections & analysing each separately, I'm not sure there's anything special about powers of 2. With 4 or 8 slices, the top&bottom ones would have 1 clean edge but the others wouldn't. So it might work just as well with 6. Exploiting symmetry more may help. For example analyse 1 quadrant, list all possible tilings by the overlaps on their inner "ragged" edge, then combine them like a 4 piece jigsaw. Symmetry is used to reduce complexity of computational fluid dynamics. You can study 1/4 or 1/2 of a round object instead of the whole – Level River St – 2015-06-03T21:08:09.980

@Dennis I have just done a new edit that should make the program twice as fast. Let me know the official time! – Level River St – 2015-06-04T21:47:01.983

Execution: 10 ms Compilation: 53 ms Sum: 63 ms (-O0) Tied to the millisecond! I'll rerun the test this weekend (with a lot more iterations) to see if I can determine a winner. When I wrote up this challenge, I expected to measure execution time in minutes, not milliseconds... – Dennis – 2015-06-05T02:37:37.237

I've re-scored all submissions, determining the most favorable optimization levels and compiling and executing the fast answers 10,000 times. All timings appear lower than before since I timed the entire loop and divided by the number of iterations. Your answer turned out to be a just a tad slower than Jimmy's. Let me know if you make any improvements. – Dennis – 2015-06-08T06:42:48.910

@Dennis Thanks for checking thoroughly. I think I'll leave it now. I expected to win with my last edit, but I didn't reckon on the compile time increasing by 3ms with respect to my original answer. Jimmy's answer is better on both run time and compile time. In fact if I reduced my run time to zero he would still beat me! It was interesting to see different approaches, which these domino questions have always produced. – Level River St – 2015-06-08T12:14:08.173