Find all strings at distance k

6

The edit (or Levenshtein) distance between two strings is the minimal number of single character insertions, deletions and substitutions needed to transform one string into the other.

The challenge is to write code to output all strings with edit distance exactly k from a fixed string of length 10, where k is a variable set in your code as is the fixed string. The alphabet is 0 or 1. To test, take

1111011010

as the fixed string. The twist is that you can't output the same string twice.

You may not use any library or built in functions that compute the Levenshtein distance for you.

Score

This is a standard code-golf challenge. You can assume k will be a one-digit number.

Reference Implementation

(added by Martin Büttner)

For reference, here is a Mathematica solution (which isn't valid, due to use of EditDistance) which can be used to check results for correctness:

s = "1111011010";
k = 2;
Union @ Cases[
     Join @@ Array[# <> "" & /@ {"0", "1"}~Tuples~# &, 2 k + 1, 10 - k],
     x_ /; x~EditDistance~s == k
]

For the first ten values of k I the following results:

 k | # of results

 0              1
 1             28
 2            302
 3           1652
 4           5533
 5          14533
 6          34808
 7          80407
 8         180663
 9         395923

user9206

Posted 2014-09-19T17:03:48.663

Reputation:

3I'd suggest you make the input string variable, too. Also, you might want to specify which forms of I/O are allowed. Full program? Function? ARGV? STDIN? STDOUT? Return value? Oh one more thing, all strings over which alphabet? – Martin Ender – 2014-09-19T17:05:35.100

1And examples are always a good idea. Plus explaining what "edit distance" in your description AND linking to wikipedia is also good. – AndoDaan – 2014-09-19T17:11:39.643

2What constraints are there on the alphabet? Even for 1/0 only, there are a large number of outputs for a given string length and distance. It's much worse if you're using [a-z], [A-Za-z], etc. – Geobits – 2014-09-19T17:29:42.893

Reminds me of the "Causes Challenge". And Norvig's spell checker. The additional constraint that the found words are in some dictionary might also be interesting; it would mean you can iterate the word list and compute edit distance, or chose to mutate the input string and check each against the dictionary. The no-word-twice seems to be natural to both approaches, though, and not be hard to program around. – Will – 2014-09-19T17:36:34.617

@MartinBüttner Thanks. No input needed. You can just put the string directly into the source. – None – 2014-09-19T17:44:47.557

@Geobits The alphabet is 1 or 0. The output size can be large is k is large and so is the string. – None – 2014-09-19T17:47:22.473

Do we count the length of the code by assuming k is set to a one-digit number such as 1? – Ingo Bürk – 2014-09-19T18:08:08.410

1@IngoBürk Usually, if the input can be hardcoded that means you just assume that k already has its value, so you don't count the assignment. – Martin Ender – 2014-09-19T18:09:44.427

@IngoBürk Yes assume k is a one-digit number. – None – 2014-09-19T18:10:58.500

1Can you give some example results? I get 302 different strings for k = 2. – Martin Ender – 2014-09-19T18:57:08.747

@MartinBüttner Banned. See update. – None – 2014-09-19T19:10:26.443

1

I've added the number of results for different k so people can check there submissions (along with the script used to generate them). I would have added a CW answer but that's not where test data belongs.

– Martin Ender – 2014-09-20T12:45:47.727

Can we assume klength(s)? – TwiNight – 2014-09-21T07:29:13.503

@TwiNight length(s) = 10 and k is a single digit number.. so ... yes :) – None – 2014-09-21T15:59:24.653

@Lembik I was thinking about cases like s='1010' and k=5 – TwiNight – 2014-09-22T03:43:05.670

Answers

1

CJam, 72 64 bytes

QaK{_{_,),\f{[\:Lm>_,L>{_)1^+_);@}*_'0+\'1+]{Lm<}/}~}%_&}*]W%:-p

Assumes the string is stored in Q and the integer in K. This can be achieved by executing

"1111011010":Q;1:K;

for example. Try it online.

Example run

$ cjam <(echo '"1111011010":Q;1:K;'; cat edit-distance.cjam)
["1111011011" "11110110100" "11110110101" "111101101" "1111011000" "11110110110" "111101100" "1111011110" "11110110010" "111101110" "1111010010" "11110111010" "111101010" "1111001010" "11110101010" "1111111010" "11110011010" "111111010" "1110011010" "11111011010" "111011010" "1101011010" "11101011010" "1011011010" "11011011010" "0111011010" "10111011010" "01111011010"]
$ cjam <(echo '"1111011010":Q;1:K;'; sed 's/p/,p/' < edit-distance.cjam)
28
$ cjam <(echo '"1111011010":Q;2:K;'; sed 's/p/,p/' < edit-distance.cjam)
302

How it works

The idea is to recursively compute all strings at distance d or less ("or less" part not exhaustive), then remove the strings at distance k - 1 or less from the array of strings at distance k or less.

Qa                " Push R := [Q].                                                    ";
K{                " Do the following K times:                                         ";
  _               " Push a copy of R.                                                 ";
  {               " For each S ∊ R:                                                   ";
    _,),\         " Push A := [ 0 1 ... len(S) ] below S.                             ";
    f{            " For each I ∊ A, push I S and:                                     ";
      [           "                                                                   ";
        \:Lm>     " Rotate S (L := I) characters to the right.                        ";
        _,L       " Push len(S) L.                                                    ";
        >{        " If len(S) > L:                                                    ";
          _)1^+   " XOR the last character code of a copy of S with 1.                ";
          _);@    " Remove the last character of a copy of S.                         ";
        }*        "                                                                   ";
        _'0+      " Append '0' to a copy of S.                                        ";
        \'1+      " Append '1' to S.                                                  ";
      ]           " Collect the modified values of S in an array T.                   ";
      {Lm<}/      " For each U ∊ T, push U rotated L characters to the left.          ";
    }~            " Dump the resulting array on the stack.                            ";
  }%              " Collect the results in R.                                         ";
  _&              " Remove duplicates from R.                                         ";
}*                "                                                                   ";
]W%               " Collect the different values of R in an array and reverse it.     ";
:-                " Reduce the array using setwise difference.                        ";
p                 " Print a string representation of the resulting array.             ";

Dennis

Posted 2014-09-19T17:03:48.663

Reputation: 196 637

Very impressive! Could you give some explanation of how you did it? – None – 2014-09-30T11:47:12.103

1I always do, I just like to make sure I'm done golfing first. I'll post an explanation in a few hours. – Dennis – 2014-09-30T13:37:48.913

3

Scala 168 228

My first submission had a bug where it didn't consider multiple insertions at the same point in the input string.

After

val k = 1
val s = "1111011010"

The code that counts is:

def e(k:Int,s:String):Set[String]=if(k<1)Set(s)else e(k-1,s).flatMap(e=>"01"map(_+e))++(if(s=="")Set()else e(k,s.tail).map(s.head+_)++e(k-1,s.tail).flatMap(e=>(Set("0","1","")-(s take 1))map(_+e)))
print(e(k,s))

To run, save both sections together as a file and run like:

$ scala distancek.scala
Set(111111010, 111101010, 1111001010, 11110011010, 1101011010, 11101011010, 111101101, 111101100, 11111011010, 1111011011, 1111011000, 11110101010, 1111010010, 11110110110, 11110110101, 111101110, 11110110010, 0111011010, 1110011010, 11110110100, 10111011010, 11011011010, 01111011010, 1011011010, 1111111010, 111011010, 11110111010, 1111011110)
$

Recursive function to build the edited strings. k is the number of edits remaining, s is the string to edit. Slightly ungolfed:

def e(k:Int,s:String):Set[String]=
  if(k<1) Set(s)                           // No edits left
  else e(k-1,s).flatMap(e=>"01"map(_+e))++ // Insert a character, consume an edit, continue from the same point
      (if(s=="") Set()                     // We're at the end of the string so only insertions are allowed
       else e(k,s.tail).map(s.head+_)++    // skip past this character, making no edit
       e(k-1,s.tail).flatMap(e=>(Set("0","1","")-(s take 1))map(_+e))) // replace this character with its inverse or ""

Note that each step makes at least one of k, s smaller so we're guaranteed to terminate.


First submission (Incorrect)

print((0 to 10).permutations.flatMap{i=>((s.map(_+""):+"")zip i).:\(Set("")){case((s,i),a)=>a.flatMap{a=>(if(i<k)Set(1+s,0+s,"","0","1")-s else Set(s))map(_+a)}}}toSet)

We take the permutations of 0..10 and zip them with the word s + a terminal "" (for insertions at the end). All the positions < k are positions where an edit should occur. For each edit we replace the string at that position with

Set(1+s,0+s,"","0","1")-s # don't count substituting s for itself as an edit.

and finally convert to a set to remove duplicates.

paradigmsort

Posted 2014-09-19T17:03:48.663

Reputation: 71

Could you provide instructions for how to run your code and get it to output the list of strings at distance k? I have never used scala and would like to test it. – None – 2014-09-20T04:25:40.610

2

JavaScript (E6) 167 183

Assuming the definition of variables k (required distance) and s (fixed string, 1111011010 for test). As requested it outputs in console just the list of string, so good luck for counting.

o={},o[s]=1,
A=x=>o[x]||(o[x]=d,d-k||console.log(x)),
S=r=>{
  for(l='';A(l+1+r),A(l+0+r),c=r[0];l+=c)
    r=r.slice(1),A(l+r),A(l+1+r),A(l+0+r)
};
for(d=0;++d<=k;)
  for(q in o)S(q)

Less golfed and more sensible, a function with 2 parameters and a count in output

F=(k,s)=>
{
  o={}, o[s]=1, h=0,
  add=x=>{
    o[x]||(o[x] = d, d == k && console.log(++h,x));
  },
  scan=s=>{
    add(s+1)
    add(s+0)
    for (l='',r=s,i=0; r; l += c) // l,r:left and right
    {
      add(l+1+r)
      add(l+0+r)
      c=r[0]
      r=r.slice(1)
      add(l+r)
      add(l+1+r)
      add(l+0+r)
    }
  };
  for(d=0; ++d <= k;)
    for (q in o) 
      scan(q)
}

Test In FireFox/FireBug console

F(1,'1111011010')      

Output

1 11110110101
2 11110110100
3 11111011010
4 01111011010
5 111011010
6 0111011010
7 10111011010
8 1011011010
9 11011011010
10 1101011010
11 11101011010
12 1110011010
13 11110011010
14 111111010
15 1111111010
16 11110111010
17 111101010
18 1111001010
19 11110101010
20 1111010010
21 11110110010
22 111101110
23 1111011110
24 11110110110
25 111101100
26 1111011000
27 111101101
28 1111011011

Just counting, no string output:

10 -> 850064 (in 3min & 600 MB)
11 -> 1795537 (in 8min & 900 MB)

edc65

Posted 2014-09-19T17:03:48.663

Reputation: 31 086

2

Haskell, 136 134

import Data.List
n%l=iterate(>>=foldr(\t@(x:y)r->['0':t,'1':t,'0':y,'1':y,y]++map(x:)r)["0","1"].init.tails)[l]!!n
r=nub(n%l)\\(n-1)%l

upon running r will become the list of all strings with edit distance of 3 from "1111011010". the string is embedded in the code at line 5 and the required distance is at the end of the last line.

edit: explanation is removed because it is nor relevant anymore

proud haskeller

Posted 2014-09-19T17:03:48.663

Reputation: 5 866

Would you mind adding a little explanation? Your solution looks very interesting. Also, other people haven't included the length of the string in their count. – None – 2014-09-21T18:16:39.037

@Lembik should i assume any Haskell knowledge in my explanation? – proud haskeller – 2014-09-21T19:25:09.050

I love your answer.. sadly CJam always wins. But why did you remove the explanation? It was very informative. – None – 2014-10-03T08:57:19.643

@Lembik Some of it was out of date and I didn't want to update it at the time. Do you want a renewed version? – proud haskeller – 2014-10-03T09:01:01.793

Yes please! If you don't mind. I also posed a new question you might like. – None – 2014-10-03T09:12:24.283

1

Mathematica, 244 bytes

l=Length;e[s_,t_]:=(m=l@s;n=l@t;Clear@f;f[0,m_]:=m;f[m_,0]:=m;For[j=0,j++<n,For[i=0,i++<m,p=f[i-1,j-1];f[i,j]=If[s[[i]]==t[[j]],p,Min[p,f[i,j-1],f[i-1,j]]+1]]];f[m,n])#<>""&/@Union@Cases[Join@@Array[{"0","1"}~Tuples~#&,2k+1,l@s-k],x_/;x~e~s==k]

This is basically the same as the reference implementation I added to the question, except I reimplemented EditDistance (as e) using the Wagner-Fischer algorithm.

That is, I simply generate all possible strings of the relevant lengths, and then I just reject all which aren't an edit distance of 2 away from the input string.

This expects the distance to be stored in k and the input as an array of characters in s.

Less golf:

l = Length;
e[s_, t_] := (
  m = l@s; n = l@t;
  Clear@f;
  f[0, m_] := m;
  f[m_, 0] := m;
  For[j = 0, j++ < n,
   For[i = 0, i++ < m,
    p = f[i - 1, j - 1];
    f[i, j] = If[s[[i]] == t[[j]],
      p,
      Min[p, f[i, j - 1], f[i - 1, j]] + 1
      ]
    ]
   ];
  f[m, n]
  )
# <> "" & /@ 
 Union@Cases[Join @@ Array[{"0", "1"}~Tuples~# &, 2 k + 1, l@s - k], 
   x_ /; x~e~s == k]

Martin Ender

Posted 2014-09-19T17:03:48.663

Reputation: 184 808

1

APL, 109 103 101

Assuming s and k are defined

~⍨/{{∪⊃,/{⍵≡'':⍕¨0 1⋄(,/⍕¨↑(⍎¨⍵)∘≠¨↓∘.=⍨x),(/∘⍵¨↓∘.≠⍨x),(,⍨-0,x)⌽¨,'01'∘.,⌽∘⍵¨0,x←⍳⍴⍵}¨⍵}⍣⍵⊂s}¨0,⍳k

Explantion
The program consists of 3 nested functions.

The inner function {⍵≡'':⍕¨0 1⋄(,/⍕¨↑(⍎¨⍵)∘≠¨↓∘.=⍨x),(/∘⍵¨↓∘.≠⍨x),(,⍨-0,x)⌽¨,'01'∘.,⌽∘⍵¨0,x←⍳⍴⍵}returns all string that is distance 1 from the argument

(,⍨-0,x)⌽¨,'01'∘.,⌽∘⍵¨0,x←⍳⍴⍵ calculate insertions:
⍴⍵ length of the argument
generate array from 1 to that number
x← assign to variable x
0, concatenate 0 to the start of the array
⌽∘⍵¨ and for each of the number in the array, rotate the argument by that amount, generating an array of strings (rotate is defined as taking characters from the start of the string and appending to the back)
'01'∘., generate a matrix whose first row are the strings with 0 concatenated to the start and the second row are the strings with 1 concatenated to the start , unravel the matrix to give a simple array of its elements
(,⍨-0,x) give the negatives of the array constructed by concatenating 0 to the start of x, and then concatenate that with another copy of it
⌽¨ finally, rotate the strings back by using the negative numbers

(/∘⍵¨↓∘.≠⍨x) calculates deletions:
∘.≠⍨x generate a square matrix consisting of 0's on the diagonal and 1's elsewhere, whose side is same length as x
split the matrix rows to create a nested array
/∘⍵¨ for each of those arrays, replicate the characters of the argument according the the array (in the case where the array only contains 0's and 1's, it means where there is a 1 in the array, keep the character in the corresponding index and where there is a 0, discard the corresponding character)

(,/⍕¨↑(⍎¨⍵)∘≠¨↓∘.=⍨x) calculates substitutions:
∘.=⍨x generate a square matrix consisting of 1's on the diagonal and 0's elsewhere, whose side is same length as x
split the matrix rows to create a nested array
(⍎¨⍵)∘≠¨ generate an array by converting the argument into a numerical array of its digits, then XOR it with each of the arrays from the split
"mix" the nested array to create a matrix (reverse of split) ⍕¨ for each element of the matrix (which is either 1 or 0), convert to string
,/ for each row of the matrix, concatenate the elements

⍵≡'':⍕¨0 1 is a special case for the empty string (the substitution part will throw an error on empty string)
⍵≡'': if the argument is the empty string...
⍕¨0 1 return this (an array of the strings '0' and '1')

The middle function {∪⊃,/{...}¨⍵} takes an array of strings and return an array of all strings that is distance 1 from any of them

{...}¨⍵ for each of the strings in the argument, use the inner function to get all strings distance 1 for it
,/ concatenate all of the returned arrays
as the result is "wrapped" in a scalar, "unwrap" it
remove duplicates

The outer function {{...}⍣⍵⊂s}, given a number, calculates all string that is potentially that distance from s

{...}⍣⍵ repeat the middle function times ( is the argument)
⊂s "wrap" s and call the function with that as the argument

The remaining part ~⍨/{...}¨0,⍳k

0,⍳k generate array from 0 to k
{...}¨ for each of those numbers, call the outer function with it
~⍨/ remove the strings found in distances 0,1,2,...,k-1 from the strings found in distance k. This is necessary because the "distance 1" function is stateless - it does not consider stuff like substituting the same character twice or inserting and deleting the same character.

TwiNight

Posted 2014-09-19T17:03:48.663

Reputation: 4 187

For those of us not used to APL, can you explain a) how to run your code in practice on Linux and b) why you get to use non-ASCII characters? – None – 2014-09-21T16:03:20.260

@Lembik Dyalog used to be free for non-commercial use but I just checked and it is no longer free... Then the best I can do is to provide results/screenshots for the input you want. – TwiNight – 2014-09-22T03:42:22.210

@Lembik 2) because that's how APL works? – edc65 – 2014-09-22T10:16:01.093

I tried this out using http://www.gnu.org/software/apl/ but with no luck. I just ran "apl" and then tried to copy and paste your code. I get "SYNTAX ERROR".

– None – 2014-09-24T17:59:44.780

I think http://ngn.github.io/apl/web/index.html would work. Could you explain how to set k to 1 and s to 1111011010 so I can test your code?

– None – 2014-09-24T18:06:30.310

@Lembik: Prepend s←"1111011010" and k←2 as two separate lines. – Dennis – 2014-09-29T14:42:22.613