Average number of strings with Levenshtein distance up to 3

6

2

The Levenshtein distance between two strings is the minimum number of single character insertions, deletions, or substitutions to convert one string into the other one. Given a binary string \$S\$ of length \$n\$, we are a interested in the number of different strings of length \$n\$ which have distance at most \$3\$ from \$S\$.

For example, if \$S = 0000\$ there are four strings with Levenshtein distance exactly \$3\$ from \$S\$, six with distance exactly \$2\$, four with distance exactly \$1\$ and exactly one with distance \$0\$. This makes a total of \$15\$ distinct strings with distance at most \$3\$ from the string \$0000\$. The only string with distance greater than \$3\$ is \$1111\$.

However, this number depends on the value of \$S\$. For example if \$S = 0010\$ then the number of distinct strings with distance at most \$3\$ is \$16\$, in other words all of them.

For this task the input is a value of \$n \geq 3\$. Your code must output the average number of binary strings of length \$n\$ which have Levenshtein distance at most \$3\$ from a uniform and randomly sampled string \$S\$. Your answer can be output in any standard way you choose but it must be exact.

Examples

n = 3. Average = 8
n = 4. Average = 15 1/2
n = 5. Average = 28 13/16
n = 6. Average = 50 15/16
n = 7. Average = 85 23/64
n = 8. Average = 135 27/32
n = 9. Average = 206 77/256
n = 10. Average = 300 181/256
n = 11. Average = 423 67/1024
n = 12. Average = 577 99/256
n = 13. Average = 767 2793/4096.
n = 14. Average = 997 3931/4096.
n = 15. Average = 1272 3711/16384.

Score

Your score is the highest value of \$n\$ you can reach.

Anush

Posted 2019-12-27T14:39:42.103

Reputation: 3 202

Answers

9

GAP and the automata package

The average number is the number of pairs of words of length \$n\$ with Levenshtein distance up to three, divided by \$2^n\$.

It is not very difficult to construct an nondeterministic finite automaton over the alphabet of pairs of bits that accepts the word \$(a_1,b_1)(a_2,b_2)\dots(a_n,b_n)\$ iff the binary words \$a_1a_2\dots a_n\$ and \$b_1b_2\dots b_n\$ have Levenshtein distance up to three. My version uses 14 states. This automaton can be transformed into a minimal deterministic one, which has 39 states. From its transition function we can get a matrix that describes the number of ways we can get from one state to another. Now counting the number of ways we can get from the initial state to some accepting state is just a matter of multiplication.

The implicit recurrance could be simplified because some values are always equal, and it might be solved to give a closed formula, but it seems to be good enough as is.

LoadPackage("automata");

nfa := Automaton("nondet", 14, 4,
     [[[1,5,9],[2,7,11],3,4,[5,13],[7,13],[7,14],14,
       [9,13],[11,13],[11,14],14,13,14],
      [[2,5,10],[3,7,12],4,0,[7,14],[5,14],0,7,
       [10,14],[12,14],12,0,14,0],
      [[2,6,9],[3,8,11],4,0,[6,14],[8,14],8,0,
       [11,14],[9,14],0,11,14,0],
      [[1,6,10],[2,8,12],3,4,[8,13],[6,13],14,[8,14],
       [12,13],[10,13],14,[12,14],13,14] ],
     [1], [1..14] );

dfa := MinimalizedAut(nfa);

size := NumberStatesOfAutomaton(dfa);;

mat := NullMat(size, size);;
for row in TransitionMatrixOfAutomaton(dfa) do
  for i in [1..size] do
    mat[i][row[i]] := mat[i][row[i]]+1;
  od;
od;

init := 0 * [1..size];;
init[InitialStatesOfAutomaton(dfa)[1]] := 1;;

fin := 0 * [1..size];;
for i in FinalStatesOfAutomaton(dfa) do
  fin[i] := 1;
od;

f := function(n)
  local res, intpart, fraction;
  res := init * mat^n * fin / 2^n;
  intpart := Int(res);
  fraction := res-intpart;
  Print("n = ", n, ". Average = ", intpart);
  if fraction <> 0 then
    Print(" ",fraction);
  fi;
  Print(".\n");
end;

Try it online!

Put it in a file, start gap and read the file with a command like Read("l3.gap");, then try something like f(20); or for i in [0..100] do f(i); od;.

Here are some results:

n = 0. Average = 1.
n = 1. Average = 2.
n = 2. Average = 4.
n = 3. Average = 8.
n = 4. Average = 15 1/2.
n = 5. Average = 28 13/16.
n = 6. Average = 50 15/16.
n = 7. Average = 85 23/64.
n = 8. Average = 135 27/32.
n = 9. Average = 206 77/256.
n = 10. Average = 300 181/256.
n = 11. Average = 423 67/1024.
n = 12. Average = 577 99/256.
n = 13. Average = 767 2793/4096.
n = 14. Average = 997 3931/4096.
n = 15. Average = 1272 3711/16384.
n = 16. Average = 1594 3985/8192.
n = 17. Average = 1968 48645/65536.
n = 18. Average = 2398 65249/65536.
n = 19. Average = 2889 64891/262144.
n = 20. Average = 3443 16339/32768.
n = 30. Average = 13385 268434611/268435456.
n = 40. Average = 34128 68719475971/137438953472.
n = 50. Average = 69670 281474976708241/281474976710656.
n = 60. Average = 124013 36028797018963093/72057594037927936.
n = 70. Average = 201155 295147905179352821071/295147905179352825856.
n = 80. Average = 305098 75557863725914323416001/151115727451828646838272.
n = 90. Average = 439840 309485009821345068724773101/
309485009821345068724781056.
n = 100. Average = 609383 9903520314283042199192993177/
19807040628566084398385987584.
n = 1000. Average = 660694208 
669692879491417075592765655662501131600878007315958504652343992731469406953085\
076558248986759809911329746670573470716765741965803557696277249036098418660925\
245910485926514436588817162816398196367372136384565404686473871329212422972447\
846496629816432160699779855408885478776864478289024177325353755091/
133938575898283415118553131132500226320175601463191700930468798546293881390617\
015311649797351961982265949334114694143353148393160711539255449807219683732185\
049182097185302887317763432563279639273474427276913080937294774265842484594489\
5692993259632864321399559710817770957553728956578048354650708508672.
n = 10000. Average = 666066942458 
[fractional part removed]

Pari/GP, 51 bytes

All the eigenvalues of the matrix are integers (could I have known or expected that?), and I found this formula for \$n\ge 2\$:

f(n)=(40+6*n-4*n^2)/2^n-83/2+331/12*n-6*n^2+2/3*n^3

Try it online!

Christian Sievers

Posted 2019-12-27T14:39:42.103

Reputation: 6 366

Wow! Can this be extended to larger values than 3 easily? It looks like the part you did by hand is the original NFA. I hope this doesn't put other people off answering in different languages. – Anush – 2019-12-28T17:00:26.907

1It should be possible, I wouldn't say easily, but also not too difficult. I think I'd create the NFA by program. The number of states will increase exponentially: the possibility of insertions and deletions means that the words get out of sync, so $a_i$ has to be compared with e.g. $b_{i-1}$, so the state needs to remember the last value. The nice thing about distance 3 is that there is at most one insertion and one deletion. For bigger distances, the strings can get more out of sync, and more history has to be preserved in the state. – Christian Sievers – 2019-12-28T18:36:10.640

12 states are enough for the NFA: states 3 and 13 can be joined, and states 4 and 14 can also be joined. Of course that doesn't change the DFA and therefore makes almost no difference. (This is a result of thinking about generalisation.) – Christian Sievers – 2019-12-28T20:44:42.270

That sounds promising.. If you feel I should ask a separate question about generalisation please do let me know as I am very interested in it. – Anush – 2019-12-28T22:50:46.413

The bounty is definitely yours as soon as I am allowed to give it. What is the matrix explicitly? It would be interesting to see if we can see why the eigenvalues are integers. – Anush – 2019-12-31T11:28:42.810

To see the matrix, replace the footer in the first TIO link with Print(mat);. By construction it is a matrix with nonnegative integers as entries, all rows sums equal to 4 and some symmetries from related states. – Christian Sievers – 2019-12-31T11:46:49.930

Thank you very much. – Anush – 2019-12-31T12:10:43.430

What is your current thinking about how hard n=4 would be? I am tempted to ask a follow-up question. – Anush – 2020-01-01T10:19:33.457

I think $n=4$ would not be difficult, but I'm not completely sure if the computation of the DFA is feasible. I wonder if it is too similar for a question here. BTW, I like your other Levenshtein question. I always like variations of questions I've put a lot of effort into. – Christian Sievers – 2020-01-01T17:03:28.693

Thank you in relation to the other question. I am hopeful for some more interesting answers to that. I will wait a bit before posing the n = 4 version which might help its acceptance. I like to ask variants of hard questions for exactly the reason you suggest. – Anush – 2020-01-01T17:53:35.673