Find Possible Word Rectangles

8

3

Johnny is trying to create crossword puzzles, but he is having difficulty making words match each other.

He has come up with several simple word rectangles: that is, groups of words that form a rectangle where all horizontal and vertical paths form a word.

//2x2 
PA
AM

//2x3
GOB
ORE

//3x3
BAG
AGO
RED

//3x4
MACE
AGES
WEES

However, to make a good puzzle, he needs some word rectangles that are somewhat larger than 3x4. Instead of agonizing over the arrangement letters for hours, Johnny would prefer to have a program that does this for him--and in as few characters as possible, because long blocks of code are extremely intimidating for casual programmers like Johnny.


Given

  • a text file dictionary where words are separated by newlines in alphabetical order,
  • input specifying the number of rows and columns in the words rectangle (which can be provided however is most convenient in your programming language of choice)

generate at least one word rectangle. If it is not possible to generate a word rectangle with the given lexicon and dimensions, the program need not have defined behaviour. It is not necessary for the program to be able to generate rectangles which contain more than 64 letters or have dimensions that exceed 8 in either direction. The program should be able to complete in a reasonable amount of time, say, in thirty minutes or less.


EDIT: If you are doing an NxN rectangle, you are permitted to use a smaller dictionary file that only contains words that have a length of N letters.

Peter Olson

Posted 2011-05-12T05:58:27.330

Reputation: 7 412

I may be able to adapt the code I wrote for a Java4k game.

– Peter Taylor – 2011-05-12T07:03:24.830

1I'm curious if anyone has an implementation that can generate an 8x8 from that word list in under ½ hour. – MtnViewMark – 2011-05-14T06:05:54.177

@MtnViewMark If not, I'll be willing to lower the size requirement. Even so, I think that Keith's implementation could be made quite a bit faster if he could do some checking to reduce the number of possiblities. – Peter Olson – 2011-05-14T14:44:42.760

and all words must be different? – Ming-Tang – 2011-05-14T23:27:43.360

@shinkirou No, there can be duplicate words in the rectangle. – Peter Olson – 2011-05-15T04:10:37.567

Did anyone manage to generate a solution grid greater than 6x6 in under ½ hour? – MtnViewMark – 2011-05-15T16:07:08.683

@MtnViewMark, 7x6 in 21 seconds. But I left it running for 5 hours looking for 8x8 without finding anything. I have an idea for a reduction to exact set cover which I want to try. – Peter Taylor – 2011-05-15T21:46:22.837

@Peter - That's fast.. or do you just a blazing computer on your desk? What is the implementation in? And are you using something other than one of the two algorithms represented here? – MtnViewMark – 2011-05-15T22:52:20.180

@MtnViewMark, 2.5GHz probably doesn't count as "blazing" nowadays. Implementation in Java. Prefix-based with heuristics; I think it's a bit more complex than the algorithms posted, although I can't decipher the Python one entirely. – Peter Taylor – 2011-05-16T06:04:58.997

@Peter - the Python one is the same algorithm as the awk one: Recursive enumeration of all possible rows combinations, pruning on if the columns are viable prefixes. My Haskell solution enumerates by alternating filling out rows and columns, and now prunes by viable prefixes as well. Can you share your method? – MtnViewMark – 2011-05-17T04:17:21.987

1@MtnViewMark, rather than testing merely whether a prefix is viable I use a trie which stores the number of children under each node to give a heuristic for which continuations are most likely to give a solution. (I then pick one at random weighted by the heuristic - I should probably try just picking the best candidate, since variability of results isn't explicitly in the spec). I'm testing the reduction to set cover, solving with Knuth's dancing links, on the basis that it applies heuristics globally rather than to prefixes, but so far not promising. Maybe there's a better reduction. – Peter Taylor – 2011-05-17T08:42:57.157

Answers

5

Haskell, 586 characters

import Data.List
import qualified Data.Vector as V
import System
data P=P[String](V.Vector P)
e=P[]$V.replicate 128 e
(a:z)∈(P w v)|a>' '=z∈(V.!)v(fromEnum a);_∈(P w _)=w
p∫w=q p w where q(P u v)=P(w:u).V.accum q v.x;x(a:z)=[(fromEnum a,z)];x _=[]
l%n=foldl'(∫)e.filter((==n).length)$l
d§(p,q:r)=map(\w->p++w:r)$q∈d;_§(p,_)=[p]
(d¶e)i=filter(not.any(null.(∈e))).map transpose.(d§).splitAt i
s i n d e m|i==n=["":m]|1<3=(d¶e)i m>>=(e¶d)i>>=s(i+1)n d e
p[b,a,n]w=take n$s 0(a`max`b)(w%a)(w%b)$replicate b$replicate a ' '
main=do{g<-map read`fmap`getArgs;interact$unlines.concat.p g.lines}

Invoked by supplying 3 arguments: number of rows, number of columns, number of solutions; and the word list is accepted on stdin:

$> ghc -O3 2554-WordRect.hs 
[1 of 1] Compiling Main             ( 2554-WordRect.hs, 2554-WordRect.o )
Linking 2554-WordRect ...

$> time ./2554-WordRect 7 7 1 < 2554-words.txt

zosters
overlet
seriema
trimmer
element
remends
startsy

real    0m22.381s
user    0m22.094s
sys     0m0.223s

As you can see 7×7 runs relatively fast. Still timing 8×8 and 7×6....

It would be 9 characters shorter to remove the number of solutions argument and just produce all solutions, but then it becomes impossible to time.


  • Edit: (585 → 455) replaced custom data structure with a simple map of prefix string to possible replacements; oddly, this is a bit slower, perhaps because Map String a is slower than a hand-built tree of Map Char a...
  • Edit: (455 → 586) Bigger?!?!! This version does more optimization of the search space, using both the techniques of my original solution, and of the python and awk solutions. Further, the custom data structure, based on Vector is much faster than using a simple Map. Why do this? Because I'm think a solution that is closer to the goal of 8x8 in under ½ hour is preferable to a shorter solution.

MtnViewMark

Posted 2011-05-12T05:58:27.330

Reputation: 4 779

1What version of GHC are you using? Using the same commands (except I need --make after ghc) with the word list posted in the question, I'm getting words that aren't in the list, like "zymesz" and "youthy". – Joey Adams – 2011-05-14T16:52:37.690

2Perhaps the requirements should be altered to say that the output grid shouldn't be its own transpose. – Peter Taylor – 2011-05-14T17:09:22.903

1@Joey: Did you convert the words.txt file to local line endings? I'm running on Mac OS X, and I converted the file to \n line endings before using it. – MtnViewMark – 2011-05-14T19:21:06.343

Thanks, that worked. The input file has to be in lowercase and have line endings compatible with one's system. – Joey Adams – 2011-05-14T21:36:19.000

5

Python, 232 chars

x,y=input()
H=[]
P={}
for w in open('words.txt'):
 l=len(w)-2
 if l==x:H+=[w]
 if l==y:
  for i in range(y+1):P[w[:i]]=1
def B(s):
 if(x+2)*y-len(s):[B(s+w)for w in H if all((s+w)[i::x+2]in P for i in range(x))]
 else:print s
B('')

It can only handle up to 6x6 in the 1/2 hour limit, though.

Keith Randall

Posted 2011-05-12T05:58:27.330

Reputation: 19 865

1Does it generate all the pair or valid words pair only? When I read vertical from above to bottom, it does not seems to form a valid word. Eg., I got one result "aahs","abet","lack", but vertical reading "stk" is not in the words list, and also for above I pass parameter 3,3 but its return words 3x4 – YOU – 2011-05-13T06:20:13.180

Hmmm, that's not what I get. I suspect the problem is with linebreaks in the dictionary. The provided file has Windows linebreaks (\r\n) in it, so the length of the words is len(w)-2. If you somehow converted the line breaks (or if Python on Windows does that for you?), change it to len(w)-1 and that should fix it. – Keith Randall – 2011-05-13T16:17:02.500

...and change the other +2s to +1s. – Keith Randall – 2011-05-13T16:17:55.043

Ah, I see. I tested with Windows, and Linux. Python on Windows automatically removed \r from w, and on Linux I have converted file to Unix format, so both didn't work. – YOU – 2011-05-14T03:42:38.670

This is a very elegant solution. – asoundmove – 2011-05-15T05:26:25.143

3

Java (1065 bytes)

import java.util.*;public class W{public static void main(String[]a){new
W(Integer.parseInt(a[0]),Integer.parseInt(a[1]));}W(int w,int h){M
H=new M(),V=new M();String L;int i,j,l,m,n=w*h,p[]=new int[n];long
I,J,K,M,C=31;long[]G=new long[h],T=new long[w],W[]=new long[n][],X;try{Scanner
S=new Scanner(new java.io.File("words.txt"));while(0<1){L=S.nextLine();l=L.length();for(i=0;i>>l<1;i++){K=0;for(j=0;j<l;j++)K+=(i>>j&1)*(L.charAt(j)-96L)<<5*j;if(l==w)H.put(K,H.g(K)+1);if(l==h)V.put(K,V.g(K)+1);}}}catch(Exception
E){}while(n-->0){j=1;if(W[n]==null){M=1L<<62;for(i=w*h;i-->0;){m=i/w;l=i%w*5;if((G[m]>>l&C)<1){X=new
long[27];I=K=0;for(;K++<26;){J=H.g(G[m]+(K<<l))*V.g(T[i%w]+(K<<5*m));X[(int)K]=K-32*J;I+=J;}if(I<1)j=0;if(I<M){M=I;p[n]=i;W[n]=X;}}}}X=W[n];Arrays.sort(X);M=X[0]*j;X[0]=0;K=M&C;i=p[n]%w;j=p[n]/w;l=5*i;m=5*j;G[j]&=~(C<<l);G[j]+=K<<l;T[i]&=~(C<<m);T[i]+=K<<m;if(M>=0){W[n]=null;n+=2;}}for(long
A:G){L="";for(i=0;i<w;)L+=(char)(96+(C&A>>5*i++));System.out.println(L);}}class
M extends HashMap<Long,Long>{long g(Long s){return get(s)!=null?get(s):0;}}}

A long way from being the shortest, but I think it's the closest to meeting the timing constraints. I saved 14 bytes by assuming that the input file has been filtered to words of the right lengths; on my netbook, if you feed the whole words.txt then it spends the first minute preprocessing it, discarding most of what it produces, and then takes a mere 20 or so seconds to solve 7x7. On my desktop it does the whole thing in under 15 seconds, giving:

rascals
areolae
serrate
coroner
alanine
latents
seeress

I've let it run for over 50 hours without finding a solution to 8x7 or 8x8. 8-letter words seem to be a critical boundary for this problem - it just hovers around half-full without making much progress.

The approach used is full pivoting and a heuristic based on the number of possible horizontal completions times the number of possible vertical completions. E.g. if we have intermediate grid

*ean*
algae
*ar**
*ier*
*nee*

then we give the top-left corner a heuristic value of count(aean*)count(aa***) + count(bean*)count(ba***) + ... + count(zean*)count(za***). Of all the cells we pick the one with the smallest heuristic value (i.e. hardest to satisfy), and then work though the letters in descending order of the amount they contributed to the heuristic value of that cell (i.e. starting with the most likely to succeed).

Peter Taylor

Posted 2011-05-12T05:58:27.330

Reputation: 41 901

Lovely approach and explanation. – MtnViewMark – 2011-05-25T05:44:38.983

2

F#

Backtracking solution, but I'm going to optimize the search space later.

open System

(*-NOTES
    W=7 H=3
    abcdefg<-- searching from wordsW
    abcdefg
    abcdefg
    ^
    partial filtering from wordsH
  *)

let prefix (s : char[]) (a : char[]) =
  a.[0..s.Length - 1] = s

let filterPrefix (s : char[]) =
  Array.filter (prefix s)

let transpose (s : char[][]) =
  [|
    for y = 0 to s.[0].Length - 1 do
      yield [|
        for x = 0 to s.Length - 1 do
          yield s.[x].[y]
      |]
  |]

[<EntryPoint>]
let main (args : String[]) =
  let filename, width, height = "C:\Users\AAA\Desktop\words.txt", 3, 3
  let lines = System.IO.File.ReadAllLines filename |> Array.map (fun x -> x.ToCharArray())
  let wordsW = Array.filter (Array.length >> ((=) width)) lines
  let wordsH = Array.filter (Array.length >> ((=) height)) lines

  let isValid (partial : char[][]) =
    if partial.Length = 0 then
      true
    else
      seq {
        for part in transpose partial do
          yield Seq.exists (prefix part) wordsH
      }
      |> Seq.reduce (&&)

  let slns = ref []
  let rec back (sub : char[][]) =
    if isValid sub then
      if sub.Length = height then
        slns := sub :: !slns
      else
        for word in wordsW do
          back <| Array.append sub [| word |]

  back [| |]
  printfn "%A" !slns
  0

Ming-Tang

Posted 2011-05-12T05:58:27.330

Reputation: 5 383

1

awk, 283

(may need to add 14 for parameter input flags)

Call with e.g. awk -v x=2 -v y=2...
Find first match and print it (283 chars):

{if(x==L=length)w[++n]=$0;if(y==L)for(;L>0;L--)W[substr($0,1,L)]++}END{for(i[Y=1]++;i[j=1]<=n;){b[Y]=w[i[Y]];while(j<=x){s="";for(k=1;k<=Y;k++)s=s substr(b[k],j,1);W[s]?0:j=x;j++}if(W[s])if(Y-y)i[++Y]=0;else{for(k=1;k<=Y;k++)print b[k];exit}i[Y]++;while(Y>1&&i[Y]>n)i[--Y]++}print N}

Find number of matches (245 chars, much slower):

{if(x==L=length)w[++n]=$0;if(y==L)for(;L>0;L--)W[substr($0,1,L)]++}END{for(i[Y=1]++;i[j=1]<=n;){b[Y]=w[i[Y]];while(j<=x){s="";for(k=1;k<=Y;k++)s=s substr(b[k],j,1);W[s]?0:j=x;j++}W[s]?Y-y?i[++Y]=0:N++:0;i[Y]++;while(Y>1&&i[Y]>n)i[--Y]++}print N}

For both programs (of course more so for the solutions count), the running time far exceeds 30 minutes for some values of x and y.

Just as a matter of interest, here is the word count for each word length:

 2     85
 3    908
 4   3686
 5   8258
 6  14374
 7  21727
 8  26447
 9  16658
10   9199
11   5296
12   3166
13   1960
14   1023
15    557
16    261
17    132
18     48
19     16
20      5
21      3

asoundmove

Posted 2011-05-12T05:58:27.330

Reputation: 358