Idea

From a given word dictionary (containing only plain letters, i.e: no accent or other special chars), with a given fixed word length, find every same when a given number letters are missing.

For sample: un--ed could match: united, untied or unused

Using my dictionnary: /usr/share/dict/american-english, searching for same 6 letters words, having 2 letters missing let me find:

19242 different possible combination (shapes),

The most matching shape is -a-ing, that will match saving, gaming, hating and more than 80 other words

There is 11209 matching shapes for which only two words match, like -hre-d that would only match shrewd or thread.

The goal of this challenge is to write the most efficient algorithm (and implement them in any language) for

extract all given fixed length words from a dictionnary
find all possible multiple words that could be match same shape containing a given fixed number of missing letter.
Sort and present output from the most matching shape to the less matching shape.

There are 3 variables:

Dictionnary file name (could be STDIN)
Fixed words length
Number of missing chars

This could be transmitted in any way you may find usefull.

Samples

A quick sample could better explain: Searching for same shape of 18 letters, having 2 missing letters give only one match:

same /usr/share/dict/american-english 18 2
electrocardiogra--: electrocardiograms,electrocardiograph

same /usr/share/dict/american-english 8 2
17: --ddling: coddling,cuddling,diddling,fiddling,fuddling,huddling,meddling,middling,muddling,paddling,peddling,piddling,puddling,riddling,saddling,toddling,waddling
17: -us-iest: bushiest,cushiest,duskiest,dustiest,fussiest,fustiest,gushiest,gustiest,huskiest,lustiest,mushiest,muskiest,mussiest,mustiest,pushiest,pussiest,rustiest
16: --tching: batching,bitching,botching,catching,ditching,fetching,hatching,hitching,latching,matching,notching,patching,pitching,retching,watching,witching
15: sta--ing: stabbing,stabling,stacking,staffing,staining,stalking,stalling,stamping,standing,stapling,starling,starring,starting,starving,stashing
15: --ttered: battered,bettered,buttered,fettered,guttered,lettered,littered,mattered,muttered,pattered,pottered,puttered,tattered,tittered,tottered
...
 2: -e-onate: detonate,resonate
 2: c-mmune-: communed,communes
 2: trou-le-: troubled,troubles
 2: l-t-rals: laterals,literals
 2: sp-ci-us: spacious,specious
 2: -oardi-g: boarding,hoarding
 2: pre-ide-: presided,presides
 2: blen-he-: blenched,blenches
 2: sw-llin-: swelling,swilling
 2: p-uc-ing: plucking,pouching

The shape ---l- match hello and world as well as 297 other words.:
same /usr/share/dict/american-english 5 4 |
    sed -ne '/hello/{s/^ *$[0-9]*: [a-z-]*$:.*,world,.*$/\1/p}'
299: ---l-

Test file

For testing, I've posted this 62861 length extract from my dict file in plain ascii, useable from jsfiddle, jsbin or even from your filesystem (because of the directive access-control-allow-origin) ( sample: @edc65 answer, with score count ).

Ranking

This is a fastest algorithm challenge.

The goal of this is to find quickest way for building shape, making comparissions and sorting list.

But the natural speed of language have to not be matter. So if a solution using bash is slower, but simplier, using more efficient algorithm than another c based solution, the simplier (maybe slower) have to win.

The score is the number of logical steps required by the whole job, using any dictionary file (my personal english dictionary: wc -l /usr/share/dict/american-english -> 99171) and different combinations of word length and number of missing letters.

The lowest score win.

F. Hauri

Posted 2014-08-02T17:11:09.230

Reputation: 2 654

It looks like you're mixing up the winning criteria a bit here. First you say "most efficient", then you say it's a code-challenge. But then you say scoring is by upvotes, which would mean popularity-contest. Also, you're using the fastest-code tag which is for runtime as opposed to the fastest-algorithm tag for computational complexity. I'm confused. – Martin Ender – 2014-08-02T17:32:52.130

@MartinBüttner You're right, I was not so clear... I hope this is better now. – F. Hauri – 2014-08-02T18:55:33.830

I see... scoring by computational complexity isn't a new thing here though, and I think with the help of the community it should be possible to figure it out for most algorithms. I wouldn't side step this by scoring by votes, because you can't control people to vote by efficiency either. This sounds like a decent challenge for a plain fastest-algorithm scoring. – Martin Ender – 2014-08-02T19:03:50.627

@MartinBüttner Ok this would implie a consistent work for evaluating answer, but it's the goal anyway. – F. Hauri – 2014-08-02T19:31:50.970

based on the samples given, it looks like you already have a solution. would you submit it? – Sparr – 2014-08-02T20:15:06.333

My test command: cat <(time tee >(wc -l) >(sed -ne '1,3{s/^$.\{'$[COLUMNS-4]'\}$.*$/\1.../;p};4s/^.*$/.../p;5{N;N;:a;$!N;$!{ s/^[^\n]*\n//;ta};};$p;') >/dev/null < <(./same_missing-X.py /usr/share/dict/american-english 4 2 )) Doing so permit quick change of parameters for another run :-) – F. Hauri – 2014-08-02T23:36:17.017

Can someone give the url of a suitable dictionary? I have this one used in another challenge but is quite big. https://github.com/noirotm/lingo/blob/master/wordlist.txt

– edc65 – 2014-08-03T06:39:26.760

Yes, this is big, but correct. @edc65 the file you use is fine. Ok you could drop some line for test: sed -ne '/^[a-z]\{2,15\}$/{p;N;N;N;N;d}' <wordlist.txt >wordlist-test.txt , with more or less ;N for making smaller or bigger file. – F. Hauri – 2014-08-03T07:57:08.460

Playing with my solution, I found that many words repeat a lot, like: accen-e-:2:accensed,accented accen--d:2:accensed,accented acce-s-d:2:accensed,accessed ac-en-ed:2:accensed,accented a-cen-ed:2:accensed,accented -ccen-ed:2:accensed,accented ... and counting. Is that correct and expected? – edc65 – 2014-08-03T10:31:58.697

@edc65 Yes. the job is purely mechanical... Even where only 1 letter differ, this could match 2 missing letters mask... – F. Hauri – 2014-08-03T19:31:01.560

Answers

C++11

Excuse the variable names.

Calculates hamming distance between each pair of words then tries replacing characters in words with '-', and calculates hamming distance between the word with '-' and the words without (that had a low enough hamming distance). See comments for details.

#include <algorithm>
#include <cassert>
#include <cstdio>
#include <fstream>
#include <ios>
#include <iostream>
#include <map>
#include <string>
#include <utility>
#include <vector>

// Calculates hamming distance between a and b, both of which are length m
inline int fries(const std::string &a, const std::string &b, int m){
    int c = 0;
    for(int i = 0; i < m; ++i)
        a[i]==b[i] || ++c;
    return c;
}
using namespace std;

int main(int eggs, char **bagels){
    ios_base::sync_with_stdio(false);
    // make sure proper number of arguments are passed
    assert(eggs == 4);

    int m, k;
    // read in m (word length) and k (number of '-'s)
    sscanf(bagels[2], "%i", &m);
    sscanf(bagels[3], "%i", &k);

    // bacon will store all words that are of length m
    vector<string> bacon;
    bacon.reserve(500000);

    // Save only words that are of length m
    ifstream milk(bagels[1]);
    {
        string toast;
        while(getline(milk, toast))
            if(toast.length() == m)
                bacon.push_back(toast);
    }

    // ham[i] will store the indices of strings with a hamming distance from bacon[i] that is at most k
    vector<vector<int>> ham(bacon.size());

    // Calculate hamming distance, if <=k, add to ham. 
    for(int i = 0, sz = bacon.size(); i < sz; ++i){
        // Uses i+1 to ensure each PAIR is only checked once
        for(int j = i+1; j < sz; ++j){
            unsigned strip = fries(bacon[i], bacon[j], m);
            if(strip && strip <= k)
                ham[i].push_back(j);
        }
    }

    // cereal[string], where string is the matching shape, and the vector is a vector of indices of strings that are matched by string
    map<string, vector<int>> cereal;
    for(int i = 0, sz = ham.size(); i < sz; ++i){
        // skip words that are not similar enough to other words
        if(ham[i].size() == 0) continue;

        // Iterate through all versions of bacon[i] with k '-'s
        // THIS ONLY ITERATES THROUGH COMBINATIONS NOT PERMUTATIONS
        vector<bool> cream(m);
        fill(cream.begin()+k,cream.end(),true);
        do{
            std::string juice = bacon[i];
            // Add in '-'s
            for(int i = 0; i < m; ++i){
                if(!cream[i]) juice[i] = '-';
            }
            // Ensure that this pattern has not yet been processed
            if(!cereal.count(juice)){
                // For each of the words with hamming distance<=k, check hamming distance with the matching shape
                // If hamming distance == k, then it matches, so save to cereal
                for(int j = 0, ssz = ham[i].size(); j < ssz; ++j){
                    if(fries(juice, bacon[ham[i][j]], m) == k)
                        cereal[juice].push_back(ham[i][j]);
                }
                // In the case that there ARE matches, put the index of the original string at the end
                if(cereal.count(juice))
                    cereal[juice].push_back(i);
            }
        }while(next_permutation(cream.begin(), cream.end()));
    }
    // Sorts all the matching shapes by number of matches
    vector<pair<string, vector<int>>> orange_juice;
    copy(cereal.begin(), cereal.end(), back_inserter(orange_juice));
    sort(orange_juice.begin(), orange_juice.end(), [](const pair<string, vector<int>> &a, const pair<string, vector<int>> &b){
        return a.second.size() > b.second.size();
    }); 
    // Iterate through sorted matching shapes, print
    for(auto cake:orange_juice){
        // bacon[cake.second.back()] is the original string from which the pattern was derived
        cout << cake.second.size() << ": " << cake.first << ": " << bacon[cake.second.back()];
        // do not print original string twice
        cake.second.pop_back();
        // print other strings
        for(auto coffee:cake.second){
            cout << "," << bacon[coffee];
        }
        cout << "\n";
    }
    return 0;
}

es1024

Posted 2014-08-02T17:11:09.230

Reputation: 8 953

It's a algorithm based challenge, so please explain (and/or de-obfuscate;) – F. Hauri – 2014-08-04T10:19:06.757

@F.Hauri Added explanation+comments throughout source – es1024 – 2014-08-04T10:51:00.263

Ok, I've learn a lot today. If I rightly understand, the use of next_permutation seem very interesting, but for 8 char length this will generate 8x7x6x5x4x3x2x1 = 40320 pattern (you will drop using another loop to) 8, 28, 56, 70, 56, 28 or 8 different patterns, for 1, 2, 3, 4, 5, 6 or 7 missing letters... C++ is very quick, but I think you could improve your algorithm... If I rightly understood (meaning the case where next_permutation won't present same pattern while working with booleans cream) – F. Hauri – 2014-08-04T19:56:18.737

@F.Hauri Right now, the use of next_permutation generates combinations, so for 8 char length and 3 missing characters, it would generate 8*7*6/3*2*1 = 56 patterns. – es1024 – 2014-08-04T21:55:23.793

If so, why did you have to Ensure that this pattern has not yet been processed? – F. Hauri – 2014-08-04T23:19:21.660

If you have words abcd, abce, and abcf, there would be the lists [abcd,abce,abcf] and [abce,abcf], where each list would have generated abc-; Ensure that this pattern has not yet been processed is to ensure that the second time does not get processed. next_permutation generates combinations for each of the lists. – es1024 – 2014-08-04T23:23:59.973

So for having, 56 final pattern, you have to check over 40320 primary pattern (generated by next-permutation) in order to wipe duplicates?! – F. Hauri – 2014-08-06T00:23:49.780

Have a look at my non answer and please comment – F. Hauri – 2014-08-06T15:24:55.897

@F.Hauri This does not generate 40320 patterns per word. This generates C(8,3) = 56 patterns for 8 char length and 3 missing characters per unique pair of words that have a hamming distance of less than two. I've also editted my answer to remove some dead code (including a conditional you mentioned). – es1024 – 2014-08-06T21:22:50.163

@F.Hauri is it me or you always have the wrong tone? – edc65 – 2014-08-06T21:30:11.957

@edc65 You're right, I do apologise. I was a little annoyed because of the work for understanding this code, evaluate the score, learn about cpp's next_generation condition... Than my first ask about this was unclearly answered (maybe not so clearly asked). So I've done another dive into cpp docs for, in fine, reach a score higher than if binary construction was used. – F. Hauri – 2014-08-09T08:41:29.070

1bacon.reserve(500000); yeeeessss – cjfaure – 2014-08-09T11:25:46.003

C++,

Where l is the word length, m is the number of "wildcards" (or dashes) and n is the number of words (of length l.)

Skimming over the submissions, it seems that they're all either inherently quadratic with the number of words or are very liberal with their use of resources, while there is a very straight-forward O(n log n) solution: For every possible pattern (l choose m), sort the word list (n log n) while ignoring the masked-out letters (l - m). Now all the words that compare equal (modulo masked-out letters) are grouped together and can be collected with a linear scan. This not only scales better but is also very easy to parallelize over different patterns.

Another thing worth noting, I bet most of the submissions spend a significant amount of time just doing I/O. Buffer more and avoid I/O like the plague.

Compile with: g++ wildcards.cpp -owildcards -std=c++11 -pthread -O3.

Run with: ./wildcards <word-list> <l> <m>.

#define THREAD_COUNT 2
#define WILDCARD '-'

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <iterator>
#include <algorithm>
#include <utility>
#include <thread>
#include <atomic>
#include <sstream>
#include <cstdlib>
#include <cstddef>

using namespace std;

/* Writes an integer to a string. */
template <typename String, typename T>
void write_int(String& str, T n) {
    if (n < 0) {
        str += '-';
        n = -n;
    }
    int p10 = 1;
    for (T m = n; m; m /= 10, p10 *= 10);
    if (p10 > 1) p10 /= 10;
    for (; p10; p10 /= 10) str += '0' + (n / p10 % 10);
}

/* Function object that lexicographically compares a starting segment of two
 * containers. */
struct fixed_size_lexicographical_compare {
    /* Starting segment size. */
    size_t size;

    explicit fixed_size_lexicographical_compare(size_t size): size(size) {}

    /* If we ever actually need a generic version... */
    template <typename Container>
    bool operator()(const Container& x, const Container& y) const = delete;
    /* Fast specialized case for strings. */
    bool operator()(const char* x, const char* y) const {
        for (size_t i = 0; i < size; ++i) {
            int c = x[i] - y[i];
            if (c) return c < 0;
        }
        return false;
    }
};

/* A pattern is an array of boolean values (avoiding the dreaded vector<bool>)
 * where a truth value represensts a wildcard. */
typedef vector<char> pattern_type;
typedef vector<pattern_type> patterns_type;
/* We ues a buffer of type `word_data_type` to hold all the word data in one big
 * chunk, and a separate array of type `words_type` to point to individual words
 * in the buffer. */
typedef vector<char> word_data_type;
typedef vector<const char*> words_type;
/* The result type. */
struct result {
    /* The pattern and matching set of words. */
    const pattern_type* pattern;
    const words_type* word_list;
    /* Index range into word_list. */
    pair<size_t, size_t> words;

    words_type::const_iterator begin() const {
        return next(word_list->begin(), words.first);
    }
    words_type::const_iterator end() const {
        return next(word_list->begin(), words.second);
    }
    size_t size() const { return words.second - words.first; }

    /* Write the result to a string. */
    void write(string& str) const {
        size_t word_length = pattern->size();
        write_int(str, size());
        str += ": "; {
            const char* w = *begin();
            for (size_t i = 0; i < word_length; ++i)
                str += (*pattern)[i] ? WILDCARD : w[i];
        }
        str += ':';
        for (const char* w : *this) {
            str += ' ';
            copy(w, w + word_length, back_inserter(str));
        }
        str += '\n';
    }
};
typedef vector<result> results_type;

/* Thread context. */
struct context {
    size_t word_length, wildcards;
    const patterns_type* patterns;
    atomic<size_t>* next_pattern;
    const word_data_type* word_data;
    results_type results;
    words_type result_words;
};
void thread_proc(context& ctx) {
    const char* ctx_word_data = &ctx.word_data->front();
    const size_t ctx_word_data_size = ctx.word_data->size();

    const size_t word_length = ctx.word_length, wildcards = ctx.wildcards;
    const size_t masked_word_length = word_length - wildcards;
    const size_t word_count = ctx_word_data_size / word_length;

    /* For each pattern, we make a copy of the word data buffer without the
     * masked out letters. We can already allocate space for the buffer and
     * initialize the array of pointers into the buffer. */
    word_data_type word_data(word_count * masked_word_length);
    words_type words;
    words.reserve(word_count);
    for (auto i = word_data.begin(); i != word_data.end(); i += masked_word_length)
        words.push_back(&*i);

    while (true) {
        /* Grab a pattern. */
        size_t pattern_index; {
            pattern_index = (*ctx.next_pattern)++;
            if (pattern_index >= ctx.patterns->size())
                break;
        }
        const pattern_type& pattern = (*ctx.patterns)[pattern_index];

        /* Copy the word data while skipping the wildcards. */
        for (size_t i = 0, j = 0; i < ctx_word_data_size; ++i) {
            if (!pattern[i % word_length])
                word_data[j++] = ctx_word_data[i];
        }
        /* Sort the words. */
        sort(
            words.begin(), words.end(),
            fixed_size_lexicographical_compare(masked_word_length)
        );

        /* Traverse the list of words while ... */
        for (auto i = words.begin(); i != words.end(); ) {
            /* ... looking for adjacent words that compare equal (when
             * masked.) */
            auto j = find_if(
                next(i), words.end(),
                bind(
                    fixed_size_lexicographical_compare(masked_word_length),
                    *i, placeholders::_1
                )
            );
            /* If more than one word compares equal, add a result. */
            if (j != next(i)) {
                size_t begin_index = ctx.result_words.size();
                for (; i != j; ++i)
                    /* Make sure to index the word from the global word buffer,
                     * not the local copy. */
                    ctx.result_words.push_back(
                        ctx_word_data +
                        (*i - &word_data.front()) / masked_word_length * word_length
                    );
                /* Sort the words alphabetically (not really part of the
                 * algorithm, just for neatness.) */
                sort(
                    next(ctx.result_words.begin(), begin_index), ctx.result_words.end(),
                    fixed_size_lexicographical_compare(word_length)
                );
                ctx.results.push_back(
                    {&pattern, &ctx.result_words, {begin_index, ctx.result_words.size()}}
                );
            } else
                i = j;
        }
    }
}

int main(int argc, const char* argv[]) {
    ios_base::sync_with_stdio(false);

    if (argc != 4) {
        cerr << "Wrong number of arguments" << endl;
        return 1;
    }

    int word_length = atoi(argv[2]), wildcards = atoi(argv[3]);
    if (word_length <= 0 || wildcards <= 0 || word_length < wildcards) {
        cerr << "Invalid arguments" << endl;
        return 1;
    }

    /* Generate all possible patterns. */
    patterns_type patterns; {
        pattern_type pattern(word_length);
        fill_n(pattern.begin(), wildcards, true);
        do
            patterns.push_back(pattern);
        while (prev_permutation(pattern.begin(), pattern.end()));
    }

    /* Read the word list. */
    clog << "Reading word list..." << endl;
    word_data_type word_data; {
        string word;
        ifstream file(argv[1]);
        if (!file.is_open()) {
            cerr << "Failed to open `" << argv[1] << "'" << endl;
            return 1;
        }
        while (getline(file, word)) {
            if (word.size() == word_length)
                /* Add the word to the end of the buffer. */
                move(word.begin(), word.end(), back_inserter(word_data));
        }
    }

    context ctx[THREAD_COUNT];
    thread threads[THREAD_COUNT];
    results_type results;
    atomic<size_t> next_pattern(0);

    /* Run the threads. */
    clog << "Searching..." << endl;
    for (int i = 0; i < THREAD_COUNT; ++i) {
        ctx[i].word_length = size_t(word_length);
        ctx[i].wildcards = size_t(wildcards);
        ctx[i].patterns = &patterns;
        ctx[i].next_pattern = &next_pattern;
        ctx[i].word_data = &word_data;
        threads[i] = thread(thread_proc, ref(ctx[i]));
    }
    /* Collect the results. */
    for (int i = 0; i < THREAD_COUNT; ++i) {
        threads[i].join();
        results.reserve(results.size() + ctx[i].results.size());
        move(ctx[i].results.begin(), ctx[i].results.end(), back_inserter(results));
        results_type().swap(ctx[i].results);
    }
    sort(
        results.begin(), results.end(),
        [] (const result& r1, const result& r2) { return r1.size() > r2.size(); }
    );
    /* Print the results. */
    clog << "Writing results..." << endl; {
        string str;
        for (const result& r : results) {
            r.write(str);
            if (str.size() >= (1 << 20)) {
                cout << str;
                str.clear();
            }
        }
        cout << str;
    }
}

DarwinBot

Posted 2014-08-02T17:11:09.230

Reputation: 701

Wow, this seem very interesting, there is a totaly different approach! I would love if you post a javascript version of this protocol (this may help me to understand and evaluate a score... (ou maybe could you add a score++ in your source)) – F. Hauri – 2014-08-12T13:55:31.280

This is very fast. Abount 20 times faster than my implementation. – Ray – 2014-08-12T21:36:55.533

JavaScript (E6) + HTML or C# console or C++ console

Edit 3 At last, here is the C++ version. It's not exactly C++11 as it compiles with Visual Studio 2010, but requires -std=c++11 with gcc. Compiled with max optimization and 64bit is marginally faster than C#. Time for 11,3 is ~7 sec
Sorry, no funny variable names, but on the other hand it's at least 10 times faster of the other C++ answer.
Side note: many C++ libraries have weak hashing functions, that cause more collisons and bad performance using hash tables. I included in the C++ source one of the better hashing functions freely available.

Edit 2 Added an implementation in C# - same algorithm, no parallelism - Time for 11,3 is ~11.5 sec

Edit Removed dependencies on EcmaScript 6, Chrome is really faster:

For 18,2 found 965 in 80sec

chlorophylli-e-ous:3:chlorophylliferous,chlorophylligenous,chlorophylligerous
intercommunicati--:3:intercommunicating,intercommunication,intercommunicative
non--structiveness:3:nondestructiveness,noninstructiveness,nonobstructiveness
oversentimentali--:3:oversentimentalism,oversentimentality,oversentimentalize
transubstantiati--:3:transubstantiating,transubstantiation,transubstantiative
...

Using a hashtable (a javascript object). It is quite fast beeing a scripting language.
The score ... uuu ... help!!!

Time for 11,3 with my huge 351k word list is 4min 10sec

Screenshot Problems (for javascript version)

needs to load the dictionary from local storage, so I can't make a jsfiddle. It works locally opening the file with any compatible browser (even a recent MSIE will do well)
hits the time limit for javascript execution in some browser (that I already keep high to 1 minute)

Next time I'll try in C++

Javascript Code

<!DOCTYPE html>
<html>
<head>
<title>Same word when missing letters</title>
</head>
<body>
Load dictionary<input type="file" id="dictinput" /><span id="spWords"></span><br>
Word size: <input id="tbWordSize"><br>
Missing letters: <input id="tbLetters"><br>
<button id="gobutton" onclick="go()">GO</button><br>
<div id="Info" style="width:95%"></div>
<pre id="Output" style="width:95%; height: 600px; overflow: auto; border: 1px solid #000"></pre>
<script type="text/javascript">
var words
  function readWordFile(evt) {
    var f = evt.target.files[0]; 
    if (f) {
      var r = new FileReader();
      r.onload = function(e) { 
          words = e.target.result.match(/[a-z]+/gi)
        console.log( "Got the file " + f.name + "\nWords " + words.length);  
                document.getElementById('spWords').innerHTML = 'Words: ' + words.length;

      }
      r.readAsText(f);
    } else { 
      console.log   ("Failed to load file");
    }
  }

  function bitcount(u) {
    for (var n=0; u; ++n, u=u&(u-1));
    return n;
  }

  function elab(word, mask, wordSize, letters, store)
  {
    var chars, p, bit
    for (; --mask;)
    {
      if (bitcount(mask) == letters)
      {
        chars='';
        for (bit=1, p = 0; p < wordSize; p++, bit+=bit)
        {
          chars += bit & mask ? '-' : word[p];
        }
        if (!store[chars]) store[chars] = [];
        store[chars].push(word);
      }
    }
  }     

  function go()
  {
    document.getElementById('Output').innerHTML = '...';
    document.getElementById('Info').innerHTML = '...running...';
    store = {};

    setTimeout( function() {
      var tstop, tstart = new Date;
      var wordSize = document.getElementById('tbWordSize').value;
      var letters = document.getElementById('tbLetters').value;
      var result, found, i, mask;
      console.log(tstart);
      console.log(wordSize, letters, words.length);

      mask = 1<<wordSize;
      for (i = 0; i < words.length; i++)
      {
        if (words[i].length == wordSize)
          elab(words[i], mask, wordSize, letters, store);
      }
      keys = Object.keys(store);
      keys.sort(function(a,b) { return store[b].length - store[a].length});
      tstop = new Date;
      console.log(tstop);
      console.log('Time to scan (ms) '+ (tstop-tstart));

      found = 0;
      result = keys.map(function(key) { return store[key][1] ? (found++, key + ':' + store[key].length + ':' + store[key] + '\n'): ''});

      document.getElementById('Output').innerHTML = result.join('');
      tstop = new Date;
      console.log(tstop);
      console.log('Total ' + found + ' time (ms) ' + (tstop-tstart));

      document.getElementById('Info').innerHTML = "#Found " + found + " in time (ms) " + (tstop-tstart);
    }, 100);
  }
  document.getElementById('dictinput').addEventListener('change', readWordFile, false);
</script>
</body>
</html>

C# Code

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace SameWord
{
    class Program
    {
        Dictionary<String, List<String>> store = new Dictionary<string, List<string>>(100000); // start big

        string wordfile;
        int wordsize;
        int letters;
        int tmask;

        Program(string[] args)
        {
            wordfile = args[0];
            wordsize = Convert.ToInt32(args[1]);
            letters = Convert.ToInt32(args[2]);
            tmask = 1 << wordsize;
        }

        void Dowork()
        {
            DateTime start = DateTime.UtcNow;
            String word;
            using (StreamReader input = new StreamReader(wordfile))
            {
                while ((word = input.ReadLine()) != null)
                {
                    if (word.Length == wordsize)
                    {
                        Elab(word);
                    }
                }
            }

            var results = store
              .Where(x => x.Value.Count > 1)
              .OrderByDescending(x => x.Value.Count)
              .Select(x => new { x.Key, x.Value });
            int found = 0;
            using (StreamWriter output = new StreamWriter(wordfile + "." + wordsize + "." + letters + ".txt"))
            {
                foreach (var line in results)
                {
                    found++;
                    output.WriteLine(line.Key + ":" + line.Value.Count + ":" + String.Join(",", line.Value));
                }
            }
            Console.WriteLine("Time {0}. Found {1}", DateTime.UtcNow - start, found);
        }

        int BitCount(int u)
        {
            int n;
            for (n = 0; u > 0; ++n, u = u & (u-1));
            return n;
        }

        void Elab(string word)
        {
            for(int mask = tmask; --mask > 0; )
            {
                if (BitCount(mask) == letters)
                {
                    int p, bit;
                    var chars = word.ToCharArray();
                    for (bit = 1, p = 0; bit <= mask; p++, bit += bit)
                    {
                        if ((bit & mask) != 0) chars[p] = '-';
                    }
                    string key = new String(chars);
                    List<String> wordskey;

                    if (!store.TryGetValue(key, out wordskey))
                    {
                        store.Add(key, wordskey = new List<String>());
                    }
                    wordskey.Add(word);
                }
            }
        }

        static void Main(string[] args)
        {
            new Program(args).Dowork();
        }
    }
}

C++ Code

#include <stdio.h>
#include <stdint.h>

#include <string>
#include <unordered_map>
#include <forward_list>
#include <list>
#include <vector>
#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
#include <ctime>

using namespace std;

// SuperFastHash by Paul Hsieh 2008

#define get16bits(d) (*((const uint16_t *) (d)))

    uint32_t SuperFastHash (const char * data, int len) {
    uint32_t hash = len, tmp;
    int rem;

    if (len <= 0 || data == NULL) return 0;

    rem = len & 3;
    len >>= 2;

    /* Main loop */
    for (;len > 0; len--) {
        hash  += get16bits (data);
        tmp    = (get16bits (data+2) << 11) ^ hash;
        hash   = (hash << 16) ^ tmp;
        data  += 2*sizeof (uint16_t);
        hash  += hash >> 11;
    }

    /* Handle end cases */
    switch (rem) {
        case 3: hash += get16bits (data);
                hash ^= hash << 16;
                hash ^= ((signed char)data[sizeof (uint16_t)]) << 18;
                hash += hash >> 11;
                break;
        case 2: hash += get16bits (data);
                hash ^= hash << 11;
                hash += hash >> 17;
                break;
        case 1: hash += (signed char)*data;
                hash ^= hash << 10;
                hash += hash >> 1;
    }

    /* Force "avalanching" of final 127 bits */
    hash ^= hash << 3;
    hash += hash >> 5;
    hash ^= hash << 4;
    hash += hash >> 17;
    hash ^= hash << 25;
    hash += hash >> 6;

    return hash;
}

size_t fast_hash( const string & value )
{
    return SuperFastHash(value.data(), value.length());
}

typedef unordered_map<string, vector<string>, decltype(&fast_hash)> MyStore;
MyStore store(100000, fast_hash);

int wordSize, letters; 

int BitCount(int u)
{
    int n;
    for (n = 0; u; ++n, u = u & (u-1));
    return n;
}

void Elab(string word)
{
    char chars[100];
    for(int mask = 1<<wordSize; --mask; )
    {
        if (BitCount(mask) == letters)
        {
            int p, bit;
            const char *wdata = word.data();
            for (bit = 1, p = 0; p < wordSize; p++, bit += bit)
            {
                chars[p] = bit & mask ? '-' : wdata[p];
            }
            vector<string> &wordsKey = store[string(chars, wordSize)];
            wordsKey.push_back(word);
        }
    }
}

bool mycomp (const pair<const string,vector<string>> &a, const pair<const string,vector<string>> &b) 
{
    return (a.second.size() > b.second.size());
}

int main(int argc, char **argv)
{
    string wordFile = argv[1];
    sscanf(argv[2], "%d", &wordSize);
    sscanf(argv[3], "%d", &letters); 

    ifstream infile(wordFile);
    {
        string word;
        while(getline(infile, word))
            if(word.length() == wordSize)
                Elab(word);
    }
    forward_list<pair<const string,vector<string>>> results;
    int found = 0;
    for (MyStore::iterator it = store.begin();  it != store.end(); ++it) 
    {
        if (it->second.size() > 1)
        {
            found++;
            results.push_front(*it);
        }
    }
    cerr << "Found " << found << "\n";

    results.sort(mycomp);

    stringstream sf;
    sf << wordFile << '.' << wordSize << '.' << letters << ".txt";
    ofstream outfile(sf.str());

    for(forward_list<pair<const string,vector<string>>>::iterator it = results.begin(); it != results.end(); ++it)
    {
        int size = it->second.size();
        outfile << it->first << ':' << size << ':' << it->second[0];
        for(size_t i = 1; i < size; ++i)
        {
            outfile << ',' << it->second[i];
        }
        outfile << '\n';
    }
}

edc65

Posted 2014-08-02T17:11:09.230

Reputation: 31 086

The count is wrong: your 1st line display many time same word... The algorithm seem not to be fastest... – F. Hauri – 2014-08-03T17:12:55.777

@F.Hauri the count is not wrong, I count many time the same word because the same word has different missing letters. I pointed out just this in a comment 6 hours ago, look at your question and maybe reply. Myquestion is : how do we manage such repetitions? – edc65 – 2014-08-03T17:23:56.550

In fact I don't count words at all. I count the keys that group words, like '---ological' (what you call 'shape') – edc65 – 2014-08-03T17:30:18.293

There is a little shared dictionnary http://jsfiddle.net/38cUH/2/ with your script...

– F. Hauri – 2014-08-03T19:40:33.250

Have a look at my non-anser and please comment – F. Hauri – 2014-08-06T15:24:16.760

Because this needs more Java Streaming:

Running the unofficial 351k word list at 11, 3 in ~7 seconds, beating the JavaScript version by only 4 minutes and 3 seconds. ;)

My secret: instead of comparing every possible strike-out combination for each word, I just calculate the word distance (amount of different letters) between each two words, generate a strike-out word for that match if applicable and add it to an incremental map. In this way only those strike-out words are generated that actually have at least one match.

Input: wordListFile wordSize wordDistanceLimit

public class WordList {

  static boolean acceptableWordDistance(final String a, final String b, final int maxWordDistance) {
    int result = 0;
    for (int i = a.length() - 1; i >= 0; --i) {
      if (a.charAt(i) != b.charAt(i)) {
        ++result;
        if (result > maxWordDistance) {
          return false;
        }
      }
    }
    return (result > 0);
  }

  public static void main(String[] args) throws IOException {
//    args = new String[]{"wordlist.txt", "11", "3"};
    final String fileName = args[0];
    final int wordLength = Integer.parseInt(args[1]);
    final int maxWordDistance = Integer.parseInt(args[2]);

    final List<String> allWords = Files.lines(Paths.get(fileName)).parallel()
            .filter(s -> s.length() == wordLength)
            .distinct()
            .collect(Collectors.toList());

    final HashMap<String, Set<String>> result = new HashMap<>();
    allWords.parallelStream()
            .forEach(a -> {
              allWords.stream()
              .filter(b -> acceptableWordDistance(a, b, maxWordDistance))
              .forEach(b -> {
                final StringBuilder buildA = new StringBuilder(a);
                for (int i = b.length() - 1; i >= 0; --i) {
                  if (buildA.charAt(i) != b.charAt(i)) {
                    buildA.replace(i, i + 1, "-");
                  }
                }
                result.merge(buildA.toString(), new TreeSet<>(Arrays.asList(b)), (list, v) -> {
                  list.addAll(v);
                  return list;
                });
              });
            });

    result.entrySet().parallelStream()
            .sorted((a, b) -> -Integer.compare(a.getValue().size(), b.getValue().size()))
            .forEachOrdered(e -> System.out.println(e.getKey() + ": " + Arrays.toString(e.getValue().toArray())));
  }
}

TwoThe

Posted 2014-08-02T17:11:09.230

Reputation: 359

This seem very interesting, there is a different way of thinking. But I'm not sure to really understand how loop are constructed. I will have to make this work on my desk, but I have no usage with java... +1 for different meaning! – F. Hauri – 2014-08-03T19:59:37.203

Bah! Just 4 minutes less! Now a i MUST write a C++ implementation to be sure that your algorithm is better. – edc65 – 2014-08-04T12:13:11.280

Have a look at my non anser and please comment. – F. Hauri – 2014-08-06T15:23:08.473

If you iterate on all words in a and in b, when checking for distance between a==b, you are checking all characters before return false as result>0. This increase score by number of words X word length! – F. Hauri – 2014-08-10T15:01:27.590

Python

This is an exceptionally naive approach. Runtime is abysmal for even moderate use cases.

n = number of words of length l

l = length of words

m = number of missing letters (ish)

I think my runtime complexity is O(n*n*l*l*m).

The slowest internal function is the string masking. I'm not sure of the optimal way to do that in Python. I tried representing words and masks as strings, as lists, and ended up with the method below where words are strings and masks are a list of integer positions.

The algorithm itself could be sped up by filtering out words more effectively than comparing them for every mask.

#!/usr/bin/env python

import sys
import itertools

if len(sys.argv)!=4:
    print "Usage: same.py dictionaryfile wordlength missingletters"
    exit(1)

dictionaryfile = sys.argv[1]
wordlength = int(sys.argv[2])
missingletters = int(sys.argv[3])

words = [line.strip() for line in open(dictionaryfile) if line==line.lower() and len(line.strip())==wordlength]

# print words

def mask_string(string,mask):
    new_string = ""
    for i in range(len(string)):
        if i in mask:
            new_string += "-"
        else:
            new_string += string[i]
    return new_string

masks = itertools.permutations(range(wordlength),missingletters)

results = {}

for mask in masks:
    patterns = set()
    for word in words:
        masked_word = mask_string(word,mask)
        if masked_word in patterns:
            continue        
        results[masked_word] = set()
        results[masked_word].add(word)
        patterns.add(masked_word)
        for word2 in words:
            masked_word2 = mask_string(word2,mask)
            if masked_word2 == masked_word and word2 != word:
                # print word + " and " + word2 + " match with mask " + "".join(mask)
                results[masked_word].add(word2)

for k,v in sorted(results.items(), key=lambda x: (-len(x[1]),x[0])):
    if len(v)>1:
        print str(len(v)) + ": " + k + ": " + ",".join(v)

Sample output for 3,1 (1.8 seconds):

19: -at: bat,cat,eat,fat,gat,hat,kat,lat,mat,nat,oat,pat,rat,sat,tat,vat,wat,yat,zat
19: ta-: taa,tab,tad,tae,tag,tai,taj,tal,tam,tan,tao,tap,tar,tat,tau,tav,taw,tax,tay
...
2: zi-: zig,zip
2: zo-: zoa,zoo

And for 3,2 (0.5 seconds):

237: -a-: aal,aam,baa,bac,bad,bae,bag,bah,bal,bam,ban,bap,bar,bas,bat,baw,bay,cab,cad,cag,cal,cam,can,cap,car,cat,caw,cay,dab,dad,dae,dag,dah,dak,dal,dam,dan,dao,dap,dar,das,daw,day,ean,ear,eat,fad,fae,fag,fam,fan,far,fat,fay,gab,gad,gag,gaj,gal,gam,gan,gap,gar,gas,gat,gau,gaw,gay,gaz,had,hag,hah,hak,ham,han,hao,hap,hat,hau,haw,hay,iao,jab,jag,jam,jap,jar,jaw,jay,kai,kan,kat,kay,lab,lac,lad,lag,lai,lak,lam,lan,lap,lar,las,lat,law,lax,lay,mac,mad,mae,mag,mal,man,mao,map,mar,mas,mat,mau,maw,may,naa,nab,nae,nag,nak,nam,nan,nap,nar,nat,naw,nay,oaf,oak,oam,oar,oat,pac,pad,pah,pal,pam,pan,pap,par,pat,pau,paw,pax,pay,rab,rad,rag,rah,raj,ram,ran,rap,ras,rat,raw,rax,ray,saa,sab,sac,sad,sag,sah,sai,saj,sal,sam,san,sao,sap,sar,sat,saw,sax,say,taa,tab,tad,tae,tag,tai,taj,tal,tam,tan,tao,tap,tar,tat,tau,tav,taw,tax,tay,vag,van,vas,vat,vau,wab,wad,wae,wag,wah,wan,wap,war,was,wat,waw,wax,way,yad,yah,yak,yam,yan,yap,yar,yas,yat,yaw,zac,zad,zag,zak,zar,zat,zax
198: -o-: boa,bob,bod,bog,bom,bon,boo,bop,bor,bot,bow,boy,cob,cod,coe,cog,col,con,coo,cop,cor,cos,cot,cow,cox,coy,coz,dob,doc,dod,doe,dog,dom,don,dop,dor,dos,dot,dow,eon,fob,fod,foe,fog,foo,fop,for,fot,fou,fow,fox,foy,goa,gob,god,gog,goi,gol,gon,goo,gor,gos,got,goy,hob,hod,hoe,hog,hoi,hop,hot,how,hox,hoy,ion,job,joe,jog,jot,jow,joy,koa,kob,koi,kon,kop,kor,kos,kou,loa,lob,lod,lof,log,loo,lop,lot,low,lox,loy,mob,mog,mon,moo,mop,mor,mot,mou,mow,moy,noa,nob,nod,nog,non,nor,not,now,noy,pob,pod,poe,poh,poi,pol,pom,pon,pop,pot,pow,pox,poy,rob,roc,rod,roe,rog,roi,rot,row,rox,sob,soc,sod,soe,sog,soh,sok,sol,son,sop,sot,sou,sov,sow,soy,toa,tod,toe,tog,toi,tol,ton,too,top,tor,tot,tou,tow,tox,toy,voe,vog,vol,vow,wob,wod,woe,wog,won,woo,wop,wot,wow,woy,yoe,yoi,yok,yom,yon,yor,yot,you,yow,yox,yoy,zoa,zoo
...
2: -x-: axe,oxy
2: q--: qua,quo

And for 4,2 (39 seconds):

146: -a-e: babe,bade,bake,bale,bane,bare,base,bate,baze,cade,cage,cake,came,cane,cape,care,case,cate,cave,dace,dade,dale,dame,dare,date,daze,ease,eave,face,fade,fage,fake,fame,fare,fate,faze,gade,gage,gale,game,gane,gape,gare,gate,gave,gaze,hade,haje,hake,hale,hame,hare,hate,have,haze,jade,jake,jane,jape,kale,kame,lace,lade,lake,lame,lane,late,lave,laze,mace,made,mage,make,male,mane,mare,mate,maze,nace,nake,name,nane,nape,nave,naze,pace,page,pale,pane,pape,pare,pate,pave,race,rage,rake,rame,rane,rape,rare,rase,rate,rave,raze,sabe,sade,safe,sage,sake,sale,same,sane,sare,sate,save,tade,take,tale,tame,tane,tape,tare,tate,tave,vade,vage,vale,vane,vare,vase,wabe,wace,wade,wage,wake,wale,wame,wane,ware,wase,wave,yade,yaje,yale,yare,yate
121: -i-e: aide,aile,aire,bice,bide,bike,bile,bine,bite,cine,cise,cite,cive,dice,dike,dime,dine,dire,dite,dive,fice,fide,fife,fike,file,fine,fire,fise,five,gibe,give,hide,hike,hipe,hire,hive,jibe,jive,kibe,kike,kipe,kite,life,like,lile,lime,line,lire,lite,live,mice,mide,mike,mile,mime,mine,mire,mise,mite,nice,nide,nife,nine,oime,pice,pike,pile,pine,pipe,pise,pize,ribe,rice,ride,rife,rile,rime,rine,ripe,rise,rite,rive,sice,side,sife,sike,sile,sime,sine,sipe,sire,sise,site,size,tice,tide,tige,tile,time,tine,tipe,tire,tite,vice,vile,vine,vire,vise,vive,wice,wide,wife,wile,wime,wine,wipe,wire,wise,wite,wive,yite
...
2: z-z-: zizz,zuza
2: zy--: zyga,zyme

Sparr

Posted 2014-08-02T17:11:09.230

Reputation: 5 758

Nice effort, but if the goal is done, the way don't seem to be the quickest: I think your host must be able of doing last try in less than two second, using python. ( Try to ask for 10 char length, with 3 missing, with same file ;-) – F. Hauri – 2014-08-02T23:58:35.317

@F.Hauri I know that this solution is too slow. I hope someone else submits a better solution. – Sparr – 2014-08-03T00:29:52.577

my dictionary is also about 3x as long as yours. maybe I should try a smaller dictionary. – Sparr – 2014-08-03T01:13:08.540

U could gain a little by import string def mask_string(istr,mask): new_string=list(istr) for dig in mask: new_string[dig]="_" return string.join(new_string,'') – F. Hauri – 2014-08-12T14:04:48.453

As you already test len(v)>1:,You could wipe lines from 35 to 44 (incl, ie if masked_word in patterns to ...add(word2) and replay by simply: if masked_word not in results:\\n results[masked_word] = set()\\n results[masked_word].add(word) – F. Hauri – 2014-08-12T14:08:12.390

Same shape when missing letters

My approach was to build lists of shapes for counting matches.

There is 3 version of same prototype: perl first, bash was usefull to build charWalk and a javascript with a link to JSFIDDLE.

If execution time differ, all this version have to reach same score.

perl version

There is my 1st perl version:

#!/usr/bin/perl -w

use strict;
use Time::HiRes qw|time|;

my $dict="/usr/share/dict/american-english";
my $fixLen=8;
my $numBlank=2;
$dict = $ARGV[0] if $ARGV[0] && -f $ARGV[0];
$fixLen=$ARGV[1] if $ARGV[1] && $ARGV[1] > 1 && $ARGV[1] <100;
$numBlank=$ARGV[2] if $ARGV[2] && $ARGV[2] > 0 && $ARGV[2] < $fixLen;

my %array;
my $cword;
my %counters;
my $started=time();

sub charWalk {
    my ( $word, $cnt, $lhs ) = @_;
    if ( $cnt gt 1 ) {
        my $ncnt=$cnt-1;
        for my $i ( 0 .. length($word) - $cnt ) {
            charWalk( substr( $word, $i + 1 ), $ncnt,
                $lhs . substr( $word, 0, $i ) . "-" );
        };
    } else {
        for my $i ( 0 .. length($word) - $cnt ) {
            push @{ $array{ $lhs.  substr( $word, 0, $i ).
                "-".  substr( $word, $i + 1 ) } },
              $cword;
        };
    };
}

open my $fh,"<".$dict or die;

while (<$fh>) {
    $counters{'words'}++;
    chomp;
    $cword = $_;
    charWalk( $cword, $numBlank,"" ) if $cword=~/^[a-z]{$fixLen}$/ &&
        $counters{'matchlen'}++;
}
map {
    $counters{'found'}++;
    printf "%2d: %s: %s\n", 1+$#{$array{$_} }, $_, join ",", @{ $array{$_} };
} sort {
    $#{ $array{$b} } <=> $#{ $array{$a} }
} grep {
    $#{ $array{$_} } > 0
} keys %array;

print "Counters: ".join(", ",map {sprintf "%s: %s",$_,$counters{$_}} keys %counters).". Time: ".(time()-$started)."\n";

bash proto for charWalk

For the bash version, I've used an unreasonable small dictionary file by dropping 90% of my smallest dictionary file:

 sed -e 'p;N;N;N;N;N;N;N;N;N;d' <testdict.txt >tiny-dict.txt

Well, there it is:

#!/bin/bash

file=${1:-tiny-dict.txt}
wsiz=${2:-11}
mltr=${3:-3}

charWalk () { 
    local word=$1 wlen=${#1} cnt=${2:-1} lhs=$3 pnt;
    [ $cnt -gt $wlen ] && return;
    if [ $cnt -gt 1 ]; then
        for ((pnt=0; pnt<=wlen-cnt; pnt++))
        do
            charWalk ${word:pnt+1} $[cnt-1] $lhs${word:0:pnt}-;
        done;
    else
        for ((pnt=0; pnt<=wlen-cnt; pnt++))
        do
            store[$lhs${word:0:pnt}-${word:pnt+1}]+=" $cword";
        done;
    fi
}

declare -A store=();

while read cword ;do
    charWalk $cword $mltr
  done < <(
    grep "^[a-z]\{$wsiz\}$" <$file
)

count=0
for i in ${!store[@]};do
    if [ ${#store[$i]} -gt $[1+wsiz] ];then
        w=(${store[$i]})
        ((count+=1))
        echo ${#w[@]} $i : ${w[@]}
    fi
done > >( sort -rn )

echo "Found: $count matchs."

javascript

Seeing @edc65 post, I've taked some part of his script to build a JSFIDDLE.

Mainly 3 parts (+HTML):

charWalk for building all possible pattern

function charWalk(word,cnt,lhs) {
    if (cnt>1) {
        var ncnt=cnt-1;
        for (var k=0;k<=word.length-cnt;k++) {
            charWalk(word.substr(k+1),ncnt,lhs+word.substr(0,k)+"-");
        };
    } else {
        for (var k=0;k<=word.length-cnt;k++) {
            if (typeof(gS.shapes[lhs+word.substr(0,k)+"-"+word.substr(k+1)])
                == 'undefined') {
                gS.shapes[lhs+word.substr(0,k)+"-"+word.substr(k+1)]=[];
            };
            gS.shapes[lhs+word.substr(0,k)+"-"+
                      word.substr(k+1)].push(gS.currentWord);
        };
    };
}

Main loop.

function go() {
    var wordSize = document.getElementById('tbWordSize').value;
    var letters = document.getElementById('tbLetters').value;
    gS.shapes={};
    for (var i = 0; i < gS.words.length; i++) {
        if (gS.words[i].length == wordSize) {                        
            gS.currentWord=gS.words[i];
            charWalk(gS.currentWord, letters, "" );
        };
    };
    var keys = Object.keys(gS.shapes).
        filter(
            function(i){
                return gS.shapes[i][1];
            }
        ).sort(
            function(a,b) {
                return gS.shapes[b].length - gS.shapes[a].length;
            }
        );
    var intLen=gS.shapes[keys[0]].length.toFixed(0).length;
    gS.result = keys.map(
        function(key) {
            return gS.shapes[key].length.toFixed(0)+ ' : '+
                key +' : '+ gS.shapes[key];
        });
    document.getElementById('Output').innerHTML = gS.result.join('\n');
};

Init part

var gS={sharedDictUrl:'http://cartou.ch/sharedstuff/testdict.txt'};
window.onload=function() {
    var zrequest = new XMLHttpRequest();
    zrequest.open('GET',gS.sharedDictUrl,false);
    zrequest.onload = loadWords; zrequest.send(null);
    document.getElementById('gobutton').addEventListener('click',go);
}
function loadWords(e) {
    gS.words = e.target.responseText.match(/[a-z]+/g);
    document.getElementById('spWords').innerHTML = "Shared dict loaded: " +
        gS.words.length + ' words.)';
};

HTML part

<!DOCTYPE html>
<html>
<head>
<title>Same word when missing letters</title>
</head>
<script type="text/javascript" src="same-missing.js"></script>
<body><button id="gobutton">GO</button>
 <span id="spWords">Loading dictionary...</span><br>
 Word size: <input id="tbWordSize" size="3" value="11"> &nbsp; 
 Missing letters: <input id="tbLetters" size="3" value="3">
 <br>
 <div id="Info">---</div>
 <pre id="Output"></pre>
</body>
</html>

F. Hauri

Posted 2014-08-02T17:11:09.230

Reputation: 2 654

Very nice jsfiddle! The JS code here seems corrupted instead. – edc65 – 2014-08-09T09:28:51.493

...and the performance is impressive. – edc65 – 2014-08-09T09:32:27.573

@edc65 Code posted re-writted and tested. – F. Hauri – 2014-08-09T10:22:28.340

Haskell

Time complexity is O(C(n, m) * m * L * n + k log(k)), where

C(n, m) = n ! / ((n - m)! * m !)
n: number of words with length L
m: mask count
L: word length
k: number of results

Steps

Read and filter words
Generate all masks, each looks like "0011000"
For each mask, apply it to all words. For example, applying mask "00110" for word "masks" results in "mas--s".
Group the words by first applying mask on them and then group by a hashmap.
For each group, process it and get a result.
Concat , sort and then print all results.

Usage

Name the file wordcalc.hs and then compile it with

ghc -O3 wordcalc.hs

Then run it with

./wordcalc testdict.txt 11 3

Or without arguments, that's profiling

./wordcalc

Performence

Tested on my four years old i3 laptop with Archlinux.

testdict.txt 11 3 -> Time: 1.639882s Count: 31695
testdict.txt 11 5 -> Time: 5.328377s Count: 217138
wordlist.txt 11 3 -> Time: 18.316671s Count: 323347
wordlist.txt 11 5 -> Time: 63.536398s Count: 2069632

Code

import Control.Monad
import Data.Function
import Data.Time.Clock
import Data.List
import qualified Data.HashMap.Strict as Map
import System.Environment
import System.IO

type Dict = [String]
type Mask = [Bool]
data Result = Result Int String [String]

instance Show Result where
  show (Result count mask words) = concat $ [show count, ": ", mask, ": ", concat $ intersperse "," words]

main = do
  args <- getArgs
  if length args == 3 then
    let [dictPath, wordLen, maskCount] = args
    in mapM_ print =<< run dictPath (read wordLen) (read maskCount)
  else
    profile [
      ("testdict.txt", 11, 3)
    , ("testdict.txt", 11, 5)
    , ("wordlist.txt", 11, 3)
    , ("wordlist.txt", 11, 5)
    ]

profile :: [(FilePath, Int, Int)] -> IO ()
profile cases = do
  forM_ cases $ \(path, n, m) -> do
    handle <- openFile ".profile-tmp" WriteMode
    t0 <- getCurrentTime
    results <- run path n m
    mapM_ (hPrint handle) results
    hClose handle
    t1 <- getCurrentTime
    putStrLn . unwords $ [
      path, show n, show m, "->", "Time:", show $ diffUTCTime t1 t0, "Count:", show $ length results]

run :: FilePath -> Int -> Int -> IO [Result]
run dictPath wordLen maskCount = do
  words <- readWords wordLen dictPath
  let sort = sortBy (compare `on` (\(Result c _ _) -> -c))
  let masks = genMasks wordLen maskCount
  return . sort $ calc masks words 

readWords :: Int -> FilePath -> IO [String]
readWords n path = return . filter ((==n) . length) . lines =<< readFile path

calc :: [Mask] -> [String] -> [Result]
calc masks words = do
  mask <- masks
  let group = Map.elems . Map.fromListWith (++)
  words' <- group [(applyMask' mask w, [w]) | w <- words]
  let count = length words'
  if count > 1 then
    return $ Result count (applyMask mask . head $ words') words'
  else
    fail ""

genMasks :: Int -> Int -> [Mask]
genMasks n m
  | n < m = []
  | m == 0 = [replicate n False]
  | otherwise = map (True:) (genMasks (n-1) (m-1)) ++ map (False:) (genMasks (n-1) m)

applyMask, applyMask' :: Mask -> String -> String
applyMask = zipWith (\b c -> if b then '-' else c)
applyMask' mask word = [c | (b, c) <- zip mask word, not b]

CoffeeScript Version

Try it on jsfiddle.

MASK_CHAR = '-'

output = (text) ->
  div = $('#output')
  div.html(div.html() + text + '\n')

clear = () ->
  $('#output').html('')

compare = (a, b) ->
  if a < b then return -1
  if a > b then return 1
  return 0

class Result
  constructor: (@mask, @words) ->
    @count = @words.length

  toString: () ->
    @count + ': ' + @mask + ': ' + @words.join ','

calc = (words, wordLen, maskCount) ->
  words1 = words.filter (w) -> w.length == wordLen
  results = []
  for mask in genMasks(wordLen, maskCount)
    calcForMask(words1, mask, results)
  results.sort (r1, r2) -> - compare(r1.count, r2.count)
  return results.filter (r) -> r.count > 1

applyMask = (mask, word) ->
  ((if mask[i] != MASK_CHAR then word[i] else MASK_CHAR) for i in [0..mask.length-1]).join ''

calcForMask = (words, mask, results) ->
  resultMap = {}
  for word in words
    maskedWord = applyMask(mask, word)
    if not resultMap[maskedWord]?
      resultMap[maskedWord] = [word]
    else
      resultMap[maskedWord].push(word)
  for maskedWord, words of resultMap
    results.push(new Result(maskedWord, words))

genMasks = (n, m) ->
  if n < m then return []
  if m == 0
    if n == 0 then return [[]]
    return [('x' for i in [1..n])]
  use = genMasks(n - 1, m - 1)
  notUse = genMasks(n - 1, m)
  for x in use
    x.push(MASK_CHAR)
  for x in notUse
    x.push('x')
  return use.concat notUse

run = (dictURL, wordLen, maskCount, onStart, onFinish) ->
  $.ajax
    method: 'get'
    url: dictURL
    dataType: 'text'
  .success (data) ->
    output('File size: ' + data.length)
    words = data.match(/[a-z]+/g)
    output('Calculating...')
    onStart(words)
    results = calc(words, wordLen, maskCount)
    onFinish(results)
  .fail (error) ->
    output('Failed to fetch words')
    output('Error: ' + arguments)

profile = (dictURL, wordLen, maskCount) ->
  timer = {}
  clear()
  # output(('-' for i in [1..120]).join '')
  output('Fetching words...')
  onStart = (words) ->
    output('Got ' + words.length + ' words')
    timer.timeStart = new Date()
  onFinish = (results) ->
    timer.timeEnd = new Date()
    timer.time = timer.timeEnd - timer.timeStart
    output('Time used: ' + (timer.time / 1000).toFixed(3) + ' sec')
    output('Count: ' + results.length)
    if $('#output_result').is(':checked')
      output((r.toString() for r in results).join '\n')
    output('Finished')

  run(dictURL, wordLen, maskCount, onStart, onFinish)

$ () ->
  $('#button').click () ->
    profile($('#url').val(), parseInt($('#word_len').val()), parseInt($('#mask_count').val()))
  # $('#button').click()

Another version with same algorithm: Python 3

import sys
import itertools
from collections import defaultdict

def calc(words, n, m):
    words = [w for w in words if len(w) == n]
    results = []
    for mask in itertools.combinations(range(n), m):
        groups = defaultdict(list)
        for word in words:
            word1 = list(word)
            for i in mask:
                word1[i] = '-'
            word1 = ''.join(word1)
            groups[word1].append(word)
        for word1, words1 in groups.items():
            if len(words1) > 1:
                results.append((len(words1), word1, words1))
    results.sort(key=lambda x: -x[0])
    return results

def main():
    path, n, m = sys.argv[1:]
    words = list(map(str.rstrip, open(path)))
    for c, w, ws in calc(words, int(n), int(m)):
        print('{}: {}: {}'.format(c, w, ','.join(ws)))

main()

Ray

Posted 2014-08-02T17:11:09.230

Reputation: 1 946

Interesting post! Unfortunately I don't understant how it work. Compiled on my desk, it run a little slower than my perl version, but It's not the right way to calculate the score. Could you offer a javascript version or maybe explain how I can add score++ into your source... – F. Hauri – 2014-08-12T15:40:35.030

@F.Hauri I don't like javascript so I wrote a coffeescript instead. Updated the answer. – Ray – 2014-08-12T19:26:11.090

Wow, fine, thanks! You could have a look at my scoring method: Fiddle feel free to comment if you think I'm wrong!

– F. Hauri – 2014-08-12T20:18:06.777

C++11

Not quite sure how to define a "logical operation" so I may be a little more/less strict than you'd like, but I gave it a shot.

#include <algorithm>
#include <fstream>
#include <iostream>
#include <list>
#include <map>
#include <queue>
#include <string>
#include <vector>

#include <cstring>

using namespace std;

static size_t _g_score = 0;

typedef vector<char> WordMask;
typedef vector<WordMask> WordMaskList;

typedef list<string const *> WordSet;

struct ScoredString : public string
{
    ScoredString(string s) : string(s) { }
    bool operator <(ScoredString const rhs) const
    {
            _g_score++;
            return strcmp(c_str(), rhs.c_str()) < 0;
    }
};

typedef map<ScoredString, WordSet *> WordMaskMap;

struct WordDatabase
{
    vector<string *> wordList;
    WordMaskMap maskMap;
};

static WordMaskList makeMaskList(int wordLen, int numBlanks);
static WordDatabase createWordDatabase(ifstream &file, int wordLen, WordMaskList const &masks);

int main(int argc, char **argv)
{
    if (argc < 4)
    {
            cout << "Usage:  perms <dict_file> <word_len> <num_blanks>";
            return 1;
    }

    ifstream file(argv[1]);
    if (!file.is_open())
    {
            cerr << "Failed to open dictionary file: [" << argv[1] << "]!" << endl;
            return 1;
    }

    int wordLen = atoi(argv[2]);

    // generate all mask permutations
    WordMaskList masks = makeMaskList(wordLen, atoi(argv[3]));

    // read file and create the database of words and their mappings
    WordDatabase db = createWordDatabase(file, wordLen, masks);

    // iterate the mapping database and created a priority (sorted) queue
    //   using the desired sorting criteria 
    struct QueueData
    {
            string const *mask;
            WordSet const *ws;
            size_t wsSz;

            bool operator <(QueueData const rhs) const
            {
                    _g_score++;
                    if (wsSz == rhs.wsSz)
                    {
                            return strcmp(mask->c_str(), rhs.mask->c_str()) < 0 ? false : true;
                    }

                    return wsSz < rhs.wsSz;
            }
    };

    priority_queue<QueueData> q;

    for (auto i = db.maskMap.begin(); i != db.maskMap.end(); ++i)
    {
            if (i->second->size() > 1)
            {
                    QueueData qd = {&i->first, i->second, i->second->size()};
                    q.push(qd);
                    _g_score++;
            }
    }

    // and emit the queue by removing items from it
    while (!q.empty())
    {
            QueueData qd = q.top();
            cout << qd.wsSz << ": " << *(qd.mask) << ": { ";
            for(auto i = qd.ws->begin(); i != qd.ws->end(); ++i)
            {
                    cout << *(*i) << " ";
            }
            cout << "}" << endl;
            q.pop();
            _g_score++;
    }

    cout << endl << "Score:  " << _g_score << endl;
    return 0;
}

static WordMaskList makeMaskList(int wordLen, int numBlanks)
{
    WordMaskList masks;

    WordMask mask(wordLen);

    fill_n(mask.begin(), numBlanks, true);
    do
    {
            masks.push_back(mask);

    } while (prev_permutation(mask.begin(), mask.end()));

    return masks;
}


static void mapMaskedWord(WordMaskMap &mapInto, string const &word, WordMask const &mask)
{
    string mword;
    int i = 0;
    for (auto j=mask.begin(); j != mask.end(); ++j, ++i)
    {
            mword += *j ? '-' : word[i];             
    }

    auto wsi = mapInto.find(mword);
    WordSet *ws;
    if (wsi != mapInto.end())
    {
            ws = wsi->second;
    }
    else
    {
            ws = new WordSet();

            mapInto.insert(make_pair(mword, ws));
    }

    ws->push_back(&word);
    _g_score++;
}

static WordDatabase createWordDatabase(ifstream &file, int wordLen, WordMaskList const &masks)
{
    WordDatabase db;
    string word;

    while (getline(file, word))
    {
            db.wordList.push_back(new string(word));

            if (word.size() == wordLen)
            {
                    for (auto mi = masks.begin(); mi != masks.end(); ++mi)
                    {
                            mapMaskedWord(db.maskMap, *(db.wordList.back()), *mi);
                    }
            }
    }

    return db;
}

The algoirthm is pretty straight-forward; I create a map of ("mask", word set) pairs as I read the dictionary file. Each word in the dictionary file having a matching length is inserted into the map's sets once per mask permutation. Then I iterate the entire map to create a sorted queue to print. This is filtered by only those map entries whose set has > 1 element.

I define an operation as:

any comparison performed by the map/queue to add elements
any insertion/deletion/iteration

Like I said, I may have been a little more/less lenient. To be honest, I'd prefer just to count the comparisons as that's what you're trying to limit most during the execution.

DreamWarrior

Posted 2014-08-02T17:11:09.230

Reputation: 571

Interesting purpose, quite efficient. In order to match my score counting and check for mask generation method, I've added score++ before lines 124, 138 and 131, but this don't increase much the score. – F. Hauri – 2014-08-13T10:42:39.547

@F.Hauri; Ok, I also changed the set to a list to make insertions into the mask "set" more efficient since uniqueness is already guaranteed and I was inserting a pointer so any ordering provided by the set was useless. – DreamWarrior – 2014-08-13T15:17:01.773

Counter draft (Not an answer!)

I would try to use this as draft for final counting.

Scoreboard

There is a score tab where some try was counter or evaluated, by using shared dictionary file and the pairs (word length , missing chars): (4,1 5,2 11,3 11,9)

  Test pairs      4,1              5,2             11,3             11,9

1? F.Hauri - charwalk + hash Fiddle

          105 564          289 575        3 329 871        2 849 711

1? Ray - O(C(n, m) * m * L * n * log(n)) - Fiddle

           94 418          467 888       18 231 225        3 572 406

1? DarwinBot - n log n based - no fiddle no score.
```
TODO...
```

2? edc65 - bitcount + hash Fiddle

          258 457          992 807       83 344 443       70 071 001

2? DreamWarrior - filtered sorted queue - no fiddle

          322 239        1 563 388       55 460 636        8 471 745

3 es1024 - eggs, fries milk and cream - Fiddle

       14 299 352       75 187 045          aborted        not tried

4 TwoThe - parallelStream acceptableWordDistance - Fiddle

       22 991 392      112 794 153          aborted        not tried

5 Sparr - exceptionally naive approach - (no fiddle)

       70 234 852    1 081 946 020          aborted        not tried

Nota: There is a big difference between two 1st classified answer:

While pairs using small number seem present @Ray's solution a lot quicker than mine, for big numbers, this become totaly reversed:

pairs       2,1      3,2          9,4     18,2         18,8
F.Hauri  63 353   69 279    6 706 593   73 772    4 623 416
Ray         598    7 490   23 065 888   73 416   26 075 903

@Ray seem using same aproach than @DarwinBot, but I don't already understand hot it work... I have a lot to work all around, in order to finalize this scoreboard... :-b

Masks generations

This challenge do contain a strong step: mask or pattern generation.

Now, if 4 answer, there is already 3 (maybe 4, i'm not totally sure) different meaning:

@Sparr seem doing approx same than @es1024: doing base mask containing WordLengt-missing letters concatenated with missing number of _, than iterate over all possible combinations.
@edc65 use binary based constructions, searching for shapes containing correct number of missing
@TwoThe use a totaly different method where words pairs are knowns and from one pair we just have to search for differences.

If I rightly count, taking for sample 8 char length and 3 missing chars:

Iterate over 8 chars implie 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 = 40320 steps to be unified, regardless of missing number chars.
Building binary masks for 8 chars generate 256 shapes to be filtered for having missing number of 0 (or 1 depending on how routine work): 256 shapes + 256 tests + 28 mask creation = 540 steps (et least).

Where, for 8 chars, having 2 missing, there are 28 different masks.

In a 1'st day comment @sparr was right when saying it looks like you already have a solution ;-)

For comparission, my 1st prototype algoritm was bash writed. I'ts really easy for counting steps (using set -x). So for 8 char length, and

1 missing -> 8 masks, 32 steps
2 missing -> 28 masks, 148 steps
3 missing -> 56 masks, 392 steps
4 missing -> 70 masks, 658 steps
5 missing -> 56 masks, 728 steps
6 missing -> 28 masks, 532 steps
7 missing -> 8 masks, 248 steps

Note: Using set -x is very strict for counting steps, as each comparission, test or assing is counted as one step.

Word iteration

There is mainly two different method:

Where @es1024, @ThoThe and @Sparr use double iteration for making comparission, @edc65 use only one iteration over all words for storing counters based on shape.

The second method seem more efficient for a big direcory file. (they was the one I've used in my first try;)

Algorithm improvement

Instead of:

 for () { if () { ... } else { ... } }

prefer to write:

 if () { for () { ... } } else { for () { ... } }

whenever possible

Same as previous: do filtering over results before sorting!

prefer

  keys.filter(match).sort().map( printout )

than:

  keys.sort().map( if (match) { ... } )

... if you can.

F. Hauri

Posted 2014-08-02T17:11:09.230

Reputation: 2 654

This first post is subject to comments, there is no score counting for now... Please feel free to comment... – F. Hauri – 2014-08-04T23:21:24.913

I'd like to see more different answers. About counting the steps: some steps are more 'heavy' then others. For instance, I could avoid calling bit count when building the mask, but these added steps make the algorihtm faster, not slower. The same for the complex shift and loops of tha hashing function in the cpp versione. – edc65 – 2014-08-06T16:05:56.703

@edc65 I agree, I fact, your answer il already the best, ( for now, I will publish mine at end of this week. ) I like your javascript, I use them now for myself javascript version. (for info: comparission at 11,3 pair give 83 344 443 as score with your js version (6605ms) while my version reach 3 329 871 (and 3.063sec).) – F. Hauri – 2014-08-06T18:49:23.443

My answer is updated. Now it use a hash table instead of sorting to group the intermediate results. So log(n) in time complexity should be dropped. The coffeescript code already use this method so you don't need to update the score. For easy understanding, I've added a Python version. – Ray – 2014-08-13T11:26:30.967