Fastest python code to find a set of winning words in this game

This is a word game from a set of activity cards for children. Below the rules is code to find the best triplet using /usr/share/dict/words. I thought it was an interesting optimization problem, and am wondering if people can find improvements.

Rules

Choose one letter from each of the sets below.
Choose a word using the letters chosen (and any others).
Score the word.
- Each letter from the chosen set gets the number shown with the set (repeats included).
- AEIOU count 0
- All other letters are -2
Repeat steps 1-3 above (no reusing letters in step 1) twice more.
Final score is the sum of the three word scores.

Sets

(set 1 scores 1 point, set 2 scores 2 points, etc.)

Code:

from itertools import permutations
import numpy as np

points = {'LTN' : 1,
          'RDS' : 2,
          'GBM' : 3,
          'CHP' : 4,
          'FWV' : 5,
          'YKJ' : 6,
          'QXZ' : 7}

def tonum(word):
    word_array = np.zeros(26, dtype=np.int)
    for l in word:
        word_array[ord(l) - ord('A')] += 1
    return word_array.reshape((26, 1))

def to_score_array(letters):
    score_array = np.zeros(26, dtype=np.int) - 2
    for v in 'AEIOU':
        score_array[ord(v) - ord('A')] = 0
    for idx, l in enumerate(letters):
        score_array[ord(l) - ord('A')] = idx + 1
    return np.matrix(score_array.reshape(1, 26))

def find_best_words():
    wlist = [l.strip().upper() for l in open('/usr/share/dict/words') if l[0].lower() == l[0]]
    wlist = [l for l in wlist if len(l) > 4]
    orig = [l for l in wlist]
    for rep in 'AEIOU':
        wlist = [l.replace(rep, '') for l in wlist]
    wlist = np.hstack([tonum(w) for w in wlist])

    best = 0
    ct = 0
    bestwords = ()
    for c1 in ['LTN']:
        for c2 in permutations('RDS'):
            for c3 in permutations('GBM'):
                for c4 in permutations('CHP'):
                    for c5 in permutations('FWV'):
                        for c6 in permutations('YJK'):
                            for c7 in permutations('QZX'):
                                vals = [to_score_array(''.join(s)) for s in zip(c1, c2, c3, c4, c5, c6, c7)]
                                ct += 1
                                print ct, 6**6
                                scores1 = (vals[0] * wlist).A.flatten()
                                scores2 = (vals[1] * wlist).A.flatten()
                                scores3 = (vals[2] * wlist).A.flatten()
                                m1 = max(scores1)
                                m2 = max(scores2)
                                m3 = max(scores3)
                                if m1 + m2 + m3 > best:
                                    print orig[scores1.argmax()], orig[scores2.argmax()], orig[scores3.argmax()], m1 + m2 + m3
                                    best = m1 + m2 + m3
                                    bestwords = (orig[scores1.argmax()], orig[scores2.argmax()], orig[scores3.argmax()])
    return bestwords, best


if __name__ == '__main__':
    import timeit
    print timeit.timeit('print find_best_words()', 'from __main__ import find_best_words', number=1)

The matrix version is what I came up with after writing one in pure python (using dictionaries and scoring each word independently), and another in numpy but using indexing rather than matrix multiplication.

The next optimization would be to remove the vowels from the scoring entirely (and use a modified ord() function), but I wonder if there are even faster approaches.

EDIT: added timeit.timeit code

EDIT: I'm adding a bounty, which I'll give to whichever improvement I most like (or possibly multiple answers, but I'll have to accrue some more reputation if that's the case).

thouis

Posted 2011-04-29T13:18:48.417

Reputation: 93

3BTW, I wrote the code to give my eight-year-old three words to memorize for when he played the game against his mother. Now I know what xylopyrography means. – None – 2011-04-29T13:25:06.163

2This is a fun question. I think you could make it more likely to get responses if you provide the following: (1) A link to an online word list so everyone works with the same data set. (2) Put your solution in a single function. (3) Run that function using the time-it module to show timings. (4) Make sure to put the loading of dictionary data outside of the function so that we are not testing disk speed. People then can use the existing code as a framework for comparing their solutions. – None – 2011-04-29T14:17:27.930

I'll rewrite to use timeit, but for fair comparisons I'd have to use my own machine (which I'm happy to do for people posting solutions). A word list should be available on most systems, but if not, there are several here: http://wordlist.sourceforge.net/

– None – 2011-04-29T15:07:24.097

1Fair comparisons can be had if each user times your solution and any other posted solutions against their own on their own machine. There will be some differences cross platform, but in general this method works. – None – 2011-04-29T17:57:15.073

Good point. I've added the timeit code. For testing, I suggest trimming the word list to 100 or so entries. – None – 2011-04-29T20:17:23.033

Why is this python specific? There seems to be no need for it. – Juan – 2011-04-29T21:57:14.507

@Juan: It came from SO; this wasn't originally intended to be a code challenge/puzzle. – Joey – 2011-04-30T07:07:43.157

@Joey: Exactly. @Juan: I'm interested in any new methods for solving the problem, but more in algorithmic improvements than particular implementations. – thouis – 2011-04-30T16:56:19.760

1Hm, in that case I wonder whether this is the correct site. I think SO would have been the best fit. – Joey – 2011-04-30T17:13:22.787

Feel free to flag it to be moved back... I'm not sure why it got moved in the first place. – thouis – 2011-04-30T18:46:19.803

thouis: Because some mod (one who I have never seen here, though) thought it would fit better here than on SO. I guess it was more an outside-this-site decision. At the very least it could have been discussed :) – Joey – 2011-05-01T14:23:44.363

Answers

Using Keith's idea of precomputing the best possible score for each word, I managed to reduce the execution time to about 0.7 seconds on my computer (using a list of 75,288 words).

The trick is to go through the combinations of words to be played instead of all combinations of letters picked. We can ignore all but a few combinations of words (203 using my word list) because they can't get a higher score than we've already found. Nearly all of the execution time is spent precomputing word scores.

Python 2.7:

import collections
import itertools


WORDS_SOURCE = '../word lists/wordsinf.txt'

WORDS_PER_ROUND = 3
LETTER_GROUP_STRS = ['LTN', 'RDS', 'GBM', 'CHP', 'FWV', 'YKJ', 'QXZ']
LETTER_GROUPS = [list(group) for group in LETTER_GROUP_STRS]
GROUP_POINTS = [(group, i+1) for i, group in enumerate(LETTER_GROUPS)]
POINTS_IF_NOT_CHOSEN = -2


def best_word_score(word):
    """Return the best possible score for a given word."""

    word_score = 0

    # Score the letters that are in groups, chosing the best letter for each
    # group of letters.
    total_not_chosen = 0
    for group, points_if_chosen in GROUP_POINTS:
        letter_counts_sum = 0
        max_letter_count = 0
        for letter in group:
            if letter in word:
                count = word.count(letter)
                letter_counts_sum += count
                if count > max_letter_count:
                    max_letter_count = count
        if letter_counts_sum:
            word_score += points_if_chosen * max_letter_count
            total_not_chosen += letter_counts_sum - max_letter_count
    word_score += POINTS_IF_NOT_CHOSEN * total_not_chosen

    return word_score

def best_total_score(words):
    """Return the best score possible for a given list of words.

    It is fine if the number of words provided is not WORDS_PER_ROUND. Only the
    words provided are scored."""

    num_words = len(words)
    total_score = 0

    # Score the letters that are in groups, chosing the best permutation of
    # letters for each group of letters.
    total_not_chosen = 0
    for group, points_if_chosen in GROUP_POINTS:
        letter_counts = []
        # Structure:  letter_counts[word_index][letter] = count
        letter_counts_sum = 0
        for word in words:
            this_word_letter_counts = {}
            for letter in group:
                count = word.count(letter)
                this_word_letter_counts[letter] = count
                letter_counts_sum += count
            letter_counts.append(this_word_letter_counts)

        max_chosen = None
        for letters in itertools.permutations(group, num_words):
            num_chosen = 0
            for word_index, letter in enumerate(letters):
                num_chosen += letter_counts[word_index][letter]
            if num_chosen > max_chosen:
                max_chosen = num_chosen

        total_score += points_if_chosen * max_chosen
        total_not_chosen += letter_counts_sum - max_chosen
    total_score += POINTS_IF_NOT_CHOSEN * total_not_chosen

    return total_score


def get_words():
    """Return the list of valid words."""
    with open(WORDS_SOURCE, 'r') as source:
        return [line.rstrip().upper() for line in source]

def get_words_by_score():
    """Return a dictionary mapping each score to a list of words.

    The key is the best possible score for each word in the corresponding
    list."""

    words = get_words()
    words_by_score = collections.defaultdict(list)
    for word in words:
        words_by_score[best_word_score(word)].append(word)
    return words_by_score


def get_winning_words():
    """Return a list of words for an optimal play."""

    # A word's position is a tuple of its score's index and the index of the
    # word within the list of words with this score.
    # 
    # word played: A word in the context of a combination of words to be played
    # word chosen: A word in the context of the list it was picked from

    words_by_score = get_words_by_score()
    num_word_scores = len(words_by_score)
    word_scores = sorted(words_by_score, reverse=True)
    words_by_position = []
    # Structure:  words_by_position[score_index][word_index] = word
    num_words_for_scores = []
    for score in word_scores:
        words = words_by_score[score]
        words_by_position.append(words)
        num_words_for_scores.append(len(words))

    # Go through the combinations of words in lexicographic order by word
    # position to find the best combination.
    best_score = None
    positions = [(0, 0)] * WORDS_PER_ROUND
    words = [words_by_position[0][0]] * WORDS_PER_ROUND
    scores_before_words = []
    for i in xrange(WORDS_PER_ROUND):
        scores_before_words.append(best_total_score(words[:i]))
    while True:
        # Keep track of the best possible combination of words so far.
        score = best_total_score(words)
        if score > best_score:
            best_score = score
            best_words = words[:]

        # Go to the next combination of words that could get a new best score.
        for word_played_index in reversed(xrange(WORDS_PER_ROUND)):
            # Go to the next valid word position.
            score_index, word_chosen_index = positions[word_played_index]
            word_chosen_index += 1
            if word_chosen_index == num_words_for_scores[score_index]:
                score_index += 1
                if score_index == num_word_scores:
                    continue
                word_chosen_index = 0

            # Check whether the new combination of words could possibly get a
            # new best score.
            num_words_changed = WORDS_PER_ROUND - word_played_index
            score_before_this_word = scores_before_words[word_played_index]
            further_points_limit = word_scores[score_index] * num_words_changed
            score_limit = score_before_this_word + further_points_limit
            if score_limit <= best_score:
                continue

            # Update to the new combination of words.
            position = score_index, word_chosen_index
            positions[word_played_index:] = [position] * num_words_changed
            word = words_by_position[score_index][word_chosen_index]
            words[word_played_index:] = [word] * num_words_changed
            for i in xrange(word_played_index+1, WORDS_PER_ROUND):
                scores_before_words[i] = best_total_score(words[:i])
            break
        else:
            # None of the remaining combinations of words can get a new best
            # score.
            break

    return best_words


def main():
    winning_words = get_winning_words()
    print winning_words
    print best_total_score(winning_words)

if __name__ == '__main__':
    main()

This returns the solution ['KNICKKNACK', 'RAZZMATAZZ', 'POLYSYLLABLES'] with a score of 95. With the words from Keith's solution added to the word list I get the same result as him. With thouis's "xylopyrography" added, I get ['XYLOPYROGRAPHY', 'KNICKKNACKS', 'RAZZMATAZZ'] with a score of 105.

flornquake

Posted 2011-04-29T13:18:48.417

Reputation: 1 467

Here's an idea - you can avoid checking lots of words by noticing that most words have awful scores. Say you have found a pretty good scoring play that gets you 50 points. Then any play with more than 50 points must have a word of at least ceil(51/3)=17 points. So any word that can't possibly generate 17 points can be ignored.

Here's some code that does the above. We compute the best possible score for each word in the dictionary, and store it in an array indexed by score. Then we use that array to check only words that have the required minimum score.

from itertools import permutations
import time

S={'A':0,'E':0,'I':0,'O':0,'U':0,
   'L':1,'T':1,'N':1,
   'R':2,'D':2,'S':2,
   'G':3,'B':3,'M':3,
   'C':4,'H':4,'P':4,
   'F':5,'W':5,'V':5,
   'Y':6,'K':6,'J':6,
   'Q':7,'X':7,'Z':7,
   }

def best_word(min, s):
    global score_to_words
    best_score = 0
    best_word = ''
    for i in xrange(min, 100):
        for w in score_to_words[i]:
            score = (-2*len(w)+2*(w.count('A')+w.count('E')+w.count('I')+w.count('O')+w.count('U')) +
                      3*w.count(s[0])+4*w.count(s[1])+5*w.count(s[2])+6*w.count(s[3])+7*w.count(s[4])+
                      8*w.count(s[5])+9*w.count(s[6]))
            if score > best_score:
                best_score = score
                best_word = w
    return (best_score, best_word)

def load_words():
    global score_to_words
    wlist = [l.strip().upper() for l in open('/usr/share/dict/words') if l[0].lower() == l[0]]
    score_to_words = [[] for i in xrange(100)]
    for w in wlist: score_to_words[sum(S[c] for c in w)].append(w)
    for i in xrange(100):
        if score_to_words[i]: print i, len(score_to_words[i])

def find_best_words():
    load_words()
    best = 0
    bestwords = ()
    for c1 in permutations('LTN'):
        for c2 in permutations('RDS'):
            for c3 in permutations('GBM'):
            print time.ctime(),c1,c2,c3
                for c4 in permutations('CHP'):
                    for c5 in permutations('FWV'):
                        for c6 in permutations('YJK'):
                            for c7 in permutations('QZX'):
                                sets = zip(c1, c2, c3, c4, c5, c6, c7)
                                (s1, w1) = best_word((best + 3) / 3, sets[0])
                                (s2, w2) = best_word((best - s1 + 2) / 2, sets[1])
                                (s3, w3) = best_word(best - s1 - s2 + 1, sets[2])
                                score = s1 + s2 + s3
                                if score > best:
                                    best = score
                                    bestwords = (w1, w2, w3)
                                    print score, w1, w2, w3
    return bestwords, best


if __name__ == '__main__':
    import timeit
    print timeit.timeit('print find_best_words()', 'from __main__ import find_best_words', number=1)

The minimum score gets rapidly up to 100, which means we need only consider 33+ point words, which is a very small fraction of the overall total (my /usr/share/dict/words has 208662 valid words, only 1723 of which are 33+ points = 0.8%). Runs in about half an hour on my machine and generates:

(('MAXILLOPREMAXILLARY', 'KNICKKNACKED', 'ZIGZAGWISE'), 101)

Keith Randall

Posted 2011-04-29T13:18:48.417

Reputation: 19 865

Nice. I'll add that to the matrix solution (removing words as their score falls too low), but this is significantly better than any of the pure python solutions I had come up with. – thouis – 2011-05-01T18:28:44.733

1I'm not sure I've ever seen that many nested for loops before. – Peter Olson – 2011-05-02T14:02:46.087

Combining your idea with matrix scoring (and a tighter upper bound on best possible score) trims the time down to about 80 seconds on my machine (from around an hour). code here

– thouis – 2011-05-02T15:14:49.520

A good portion of that time is in the precomputation of best-possible-scores, which could be made a lot faster. – thouis – 2011-05-02T15:17:24.573