Average number of strings with Levenshtein distance up to 4

7

2

This is a version of this question which should not have such a straightforward solution and so should be more of an interesting coding challenge. It seems, for example, very likely there is no easy to find closed form solution, even though we have only increased the bound by one from the previous version. Having said that, you never know...

The Levenshtein distance between two strings is the minimum number of single character insertions, deletions, or substitutions to convert one string into the other one. Given a binary string \$S\$ of length \$n\$, we are a interested in the number of different strings of length \$n\$ which have distance at most \$4\$ from \$S\$.

For example, if \$S = 0000\$ there are four strings with Levenshtein distance exactly \$3\$ from \$S\$, six with distance exactly \$2\$, four with distance exactly \$1\$ and exactly one with distance \$0\$. This makes a total of \$15\$ distinct strings with distance at most \$3\$ from the string \$0000\$. The only string with distance exactly \$4\$ is \$1111\$.

For this task the input is a value of \$n \geq 4\$. Your code must output the average number of binary strings of length \$n\$ which have Levenshtein distance at most \$4\$ from a uniform and randomly sampled string \$S\$. Your answer can be output in any standard way you choose but it must be exact.

Examples

  • n = 4. Average \$16\$.
  • n = 5. Average 31 \$\frac{11}{16}\$.
  • n = 6. Average 61 \$\frac{21}{32}\$.
  • n = 7. Average 116 \$\frac{7}{8}\$.
  • n = 8. Average 214 \$\frac{43}{128}\$.
  • n = 9. Average 378 \$\frac{49}{246}\$.
  • n = 10. Average 640 \$\frac{301}{512}\$.
  • n = 11. Average 1042 \$\frac{1}{16}\$.
  • n = 12. Average 1631 \$\frac{1345}{2048}\$.
  • n = 13. Average 2466 \$\frac{3909}{4096}\$.
  • n = 14. Average 3614 \$\frac{563}{8192}\$

Score

Your score is the highest value of \$n\$ you can reach in less than a minute on my linux box. If two answers get the same value then the first posted (or amended) wins. The timing is for each value separately.

My CPU is an Intel(R) Xeon(R) CPU X5460.

Anush

Posted 2020-01-15T12:20:22.150

Reputation: 3 202

2less than a minute on my linux box - how are we to access your box for testing? ssh? we'll need an account ...could we come on over to yours? T_T – Noodle9 – 2020-01-15T13:58:10.627

4@Noodle9 :) It's a standard thing for a [tag:fastest-code] competition to say what machine the testing will be done on. You should also state when you answer what you get on your PC of course. – Anush – 2020-01-15T14:11:32.110

1Doesn't this admit the same type of solution as the previous challenge, with a 25-state NFA instead of a 12-state NFA? – Nitrodon – 2020-01-15T18:56:54.230

@Nitrodon I am not sure but the person who answered the previous question suggested it wasn't so simple in the comments. – Anush – 2020-01-15T20:12:14.810

@Nitrodon Hadn't thought about the concrete number before, but yes, 25 states should be correct – Christian Sievers – 2020-01-15T21:04:45.710

I see a DFA with 266 states, and a matrix with complex eigenvalues. – Christian Sievers – 2020-01-17T19:48:48.587

@ChristianSievers Fascinating! But what does it mean? – Anush – 2020-01-17T20:12:16.540

1@Anush Closed formula contains trigonometric functions (or powers of complex numbers), or has to be written as depending on $n \mod k$ for, in this case, $k\in{3,4}$ (maybe not both are needed), – Christian Sievers – 2020-01-17T20:28:58.647

Answers

5

GAP

So my previous approach works indeed when we replace the definition of nfa with this:

nfa := Automaton("epsilon", 25, 5,  
      [[[1,6,7],[2,8,9],[3,10,11],4,5,[6,18],
       [7,19],8,9,10,11,[8,22],[9,23],10,11,
       0,0,18,19,22,23,0,0,0,0],
      [[2,6,13],[3,8,15],[4,10,17],5,0,[8,18],
       [13,21],10,15,0,17,[6,22],[15,25],8,17,
       10,0,0,21,0,25,18,0,22,0],
      [[2,12,7],[3,14,9],[4,16,11],5,0,[12,20],
       [9,19],14,11,16,0,[14,24],[7,23],16,9,
       0,11,20,0,24,0,0,19,0,23],
      [[1,12,13],[2,14,15],[3,16,17],4,5,[14,20],
       [15,21],16,17,0,0,[12,24],[13,25],14,15,
       16,17,0,0,0,0,20,21,24,25],
      [0,0,0,0,0,3,3,4,4,5,5,3,3,4,4,5,5,10,
       11,16,17,10,11,16,17] ],
     [1], [1..25] );

This automaton was constructed with a program that I have explained here.

The resulting DFA has 266 states.

We can again compute as much values as we like:

n = 16. Average = 7150 23011/32768.
n = 17. Average = 9714 60059/65536.
n = 18. Average = 12940 42919/131072.
n = 19. Average = 16935 3985/8192.
n = 20. Average = 21817 225447/524288.
n = 21. Average = 27711 716517/1048576.
n = 22. Average = 34752 552357/2097152.
n = 23. Average = 43081 714481/1048576.
n = 24. Average = 52850 7923401/8388608.
n = 25. Average = 64219 9352665/16777216.
n = 50. Average = 1508800 387601224486599/562949953421312.
n = 100. Average = 29267070 38719469213378154341013566355/
633825300114114700748351602688.

The eigenvalues of the matrix are \$4,2,1,0\$ and \$-1\$ as before and additionally two conjugate pairs of complex numbers: \$\pm i\$, and the third root of unity and its conjugate (which is also its square and its inverse).

Usually complex eigenvalues mean that the closed formula contains trigonometric functions, but since there were already terms corresponding to the eigenvector \$1\$ to compensate, I took the liberty to use the periodic functions \$ n \bmod 3\$ and \$ -n \bmod 3\$ to handle the third root of unity.

The imaginary unit does not contribute to the closed formula. Neither does \$4\$, because it comes from the sink state, which is not counted. It can be removed from the automaton, and the resulting matrix no longer has this eigenvalue. The eigenvalue \$-1\$ does contribute, even though it didn't in the distance 3 case.

The resulting closed formula, which is quite long, works for \$n\ge 5\$. (That's an effect of the \$0\$ eigenvalues.) Since the challenge asks for \$n\ge 4\$, the following code has a special case for small values.

Pari/GP

f(n)=if(n>4,168264301/345744-4115011/16464*n+19597/336*n^2-161/24*n^3+17/48*n^4+(-9188617/19208-268724/3087*n-4658/441*n^2+449/81*n^3-515/324*n^4+(-1)^n/72+(56188/194481-824/9261*n+4/3969*n^2)*(n%3)+(-1756/64827+94/27783*n-26/3969*n^2)*(-n%3))*2^-n,2^n)

Try it online!

Christian Sievers

Posted 2020-01-15T12:20:22.150

Reputation: 6 366

Wow! I hope this answer is awarded a prize at some point. I am really looking forward to the math.SE answer as I would love to fully understand the steps needed. How much can be automated so we can try it with any values of $k$? – Anush – 2020-01-19T18:10:20.627