Find letter-similar words in a word list

(I am not sure that this is the best place to put this question, but I will give it a shot. Forgive me if it is improper. This is just a fun project with which I am tinkering and I am looking for advice on how best to implement it; but it is also something of a 'challenge' (not implying that it is hard!), so I thought that it might fit here)

Here is the basics of the problem.

I have a list of words (strings); basically, it is a partial dictionary in a language but nothing but the words are included. I want to check whether I can make any of the following exchanges in their spelling exactly once and yet still generate a word which is already present in the list:

• (b f) • (p v) • (c z) • (j s)

By "exchange exactly once", I mean that only one occurrence of one letter in a word gets switched to its partner.

If such an exchange produces a word in the list, then I want to be notified about it (by printing the pair of words and moving to the next line). I want to check the whole word list for any pair of words such that if exactly one occurrence of one of these letters in them is changed to the partner letter, then the new word is also in the list (being the other word).

To be clear, these are not all of the consonants which are present in the word list; there are also vowels. These letters are just pairs which I want to exchange.

So, "bats" would become "fats" and vice-versa, and perhaps both of those words are in the list; if so, then the check would tell me so. On the other hand, "baba" would generate "faba" or "bafa" (and vice-versa), but not "fafa" (because that would be exchanging two occurrences of one letter); likewise, "jazz" would generate "sazz", "jacz", and "jazc", but nothing else; none of the generated words in this example would be in an English word list.

There are additional constraints on the list though, which are helpful: • Every word in the list is four letters long (so I do not need to check string lengths). • No occurrence of a letter can be adjacent to another occurrence of the same letter. So, "jazz" does not occur in the list because there is the substring "zz", which adjacently repeats a letter. • "b", "j", "v", and "z" may be mutually adjacent, and "p", "c", "f", and "s" may be mutually adjacent, but none of those first four characters can be adjacent to any of the second four characters. • All words in the list are of the form 'CVCC' or the form 'CCVC', where 'C' represents a consonant and 'V' represents a vowel. This means that the list is partitioned into two classes of words.

The word list is approximately one thousand five hundred (1500) words long.

I want to do this efficiently. A really simply effort, which I will describe, seems rather poor.

This effort would be this:

0) Look at the file of words. Sort the word list alphabetically. (It comes this way)

1) Run through the word list and discard any of the words which lack any of these letters: "b", "p", "v", "f", "z", "s", "j", "c". Counting multiplicities (words which have multiple occurrences of these letters get counted as many times as there are occurrences), the list has N words; in other terms, these letters occur N times in the word list.

2) Sort the word list into two sublists based on the form of the words (put them into two different files and close the original one, then open each of these new files in turn?). One will have $N_1$ words and the other will have $N_2$ words, counting multiplicities (where $N_1 + N_2 = N$). It turns out that $N_1, N_2 > 300$.

3) Now, in each list, run through and compare each occurrence of a letter against the corresponding letter in each subsequent word. So, the first word will be checked against the second, third, fourth, ..., nth words - but really, the ith letter in the first word will be checked against the ith letter in the jth word, running over all $i \in [1,4], j>1$; then the second word will be compared against the third, fourth, fifth, ..., nth words; ...; all the way down to the penultimate word, which will end the loop. Comparison will just at most three checks (see if the first letters form such a pair, see if the second consonants form such a pair, see if the third consonants form such a pair; if the ith consonant of the word being checked against all of the subsequent words is not one of these letters, then we can ignore the ith consonant in all of the checks and move on to the (i+1)st consonant).

(I should say that words may differ in only their vowel without notifying me, such as "fats" and "fits". This is okay and I do not want to be notified about these pairs.)

4) If a word differs by only one occurrence of a letter from another word and the differing letters are one of the pairs which I mentioned, then print both words (separated by a space) and move to a new line.

Done.

Now, this is really slow. I think that it takes at least $(N1-1)! + (N2-1)!$ comparisons. I can speed it up a bit (for example: since the sorting is alphabetic, I know that "b"-initial words could/should be checked against only "f"-initial words for their first consonant (I do not need to run through the whole list - just the "f"-initial section of it) - but this is only for the first letters, and it makes the code somewhat less elegant).

Is there any better way to do this?

(If we are framing it as a challenge: Produce the method which needs to do the least number of comparisons. Ties will be broken in favor of code 'elegance' and brevity (in bytes).)

I am most familiar with C++ and want to do it in that language if possible. But I will take any advice or code which works.

regular-expression

user173897

Posted 2017-02-08T08:11:07.973

Reputation: 121

Shouldn't this go to Stack Overflow? – Matthew Roh – 2017-02-08T14:08:30.147

Do you think so? I was not sure. I will post it there in any case. – user173897 – 2017-02-08T21:23:18.970

Find letter-similar words in a word list

No answers