Estimate the mean minimum Hamming distance

3

2

Task

Inputs \$b \leq 100\$ and \$n \geq 2\$. Consider \$n\$ binary strings, each of length \$b\$ sampled uniformly and independently. We would like to compute the expected minimum Hamming distance between any pair. If \$n = 2\$ the answer is always \$b/2\$.

Correctness

Your code should ideally be within \$\pm0.5\$ of the correct mean. However, as I don't know what the correct mean is, for values of \$n\$ which are not too large I have computed an estimate and your code should be within \$\pm0.5\$ of my estimate (I believe my estimate is within \$\pm0.1\$ of the correct mean). For larger values than my independent tests will allow, your answer should explain why it is correct.

  • \$b = 99, n = 2^{16}.\$ Estimated avg. \$19.61\$.
  • \$b = 89, n = 2^{16}.\$ Estimated avg. \$16.3\$.
  • \$b = 79, n = 2^{16}.\$ Estimated avg. \$13.09\$.
  • \$b = 69, n = 2^{16}.\$ Estimated avg. \$10.03\$.
  • \$b = 59, n = 2^{16}.\$ Estimated avg. \$7.03\$.
  • \$b = 99, n = 2^{15}.\$ Estimated avg. \$20.67\$.
  • \$b = 89, n = 2^{15}.\$ Estimated avg. \$17.26\$.
  • \$b = 79, n = 2^{15}.\$ Estimated avg. \$13.99\$.
  • \$b = 69, n = 2^{15}.\$ Estimated avg. \$10.73\$.
  • \$b = 59, n = 2^{15}.\$ Estimated avg. \$7.74\$.
  • \$b = 99, n = 2^{14}.\$ Estimated avg. \$21.65\$.
  • \$b = 89, n = 2^{14}.\$ Estimated avg. \$18.23\$.
  • \$b = 79, n = 2^{14}.\$ Estimated avg. \$14.83\$.
  • \$b = 69, n = 2^{14}.\$ Estimated avg. \$11.57\$.
  • \$b = 59, n = 2^{14}.\$ Estimated avg. \$8.46\$.

Score

Your base score will be the number of the examples above that your code gets right within \$\pm0.5\$. On top of this you should add \$x > 16\$ for the largest \$n = 2^x\$ for which your code gives the right answer within \$\pm0.5\$ for all of \$b = 59, 69, 79, 89, 99\$ (and also all smaller values of \$x\$). That is, if your code can achieve this for \$n=2^{18}\$ then you should add \$18\$ to your score. You only get this extra score if you can also solve all the instances for all smaller \$n\$ as well. The number of strings \$n\$ will always be a power of two.

Time limits

For a single \$n, b\$ pair, your code should run on TIO without timing out.

Rougher approximation answers for larger values of \$n\$

For larger values of \$n\$ my independent estimates will not be as accurate but may serve as a helpful guide nonetheless. I will add them here as I can compute them.

  • \$b = 99, n = 2^{17}.\$ Rough estimate \$18.7\$.
  • \$b = 99, n = 2^{18}.\$ Rough estimate \$17.9\$.
  • \$b = 99, n = 2^{19}.\$ Rough estimate \$16.8\$.

Anush

Posted 2020-02-07T22:19:07.883

Reputation: 3 202

4I thought this was posted on the Sandbox earlier today. – S.S. Anne – 2020-02-08T01:19:06.573

@S.S.Anne Yes I posted it there first. I look forward to the first answers! – Anush – 2020-02-08T06:38:39.603

1The time limit is probably unnecessary as it both limits users to tio languages and needlessly disqualifies solutions that take a long time, but do solve the problem. I understand wanting to prevent brute forcing, but I’d argue that should also be valid – ATaco – 2020-02-08T07:44:53.657

1@ATaco I understand that point of view but it completely changes the challenge. Brute force solutions are valid here of course. It's just a question of what score you get. The time limit/TIO is also crucial as otherwise you will just get a higher score by having a more powerful computer or leaving it to run for longer. – Anush – 2020-02-08T07:48:44.597

Using TIO for timing is not a good idea. Had you left this in the sandbox longer, I'd have suggested you switching to requiring that submissions pass your examples within a certain time limit which you would verify (like [tag:fastest-code]). What you currently have makes it extremely difficult to determine what the score of any submission should actually be.

– FryAmTheEggman – 2020-02-08T19:03:11.707

@FryAmTheEggman I understand the complications all too well. [tag:fastest-code] is highly non-ideal as people can't test their code properly for speed themselves. Using TIO is non-ideal because it's not super reliable timing-wise. For this challenge where the exact timing is not so critical I decided that on balance it was the best option. Clearly if you have to report an exact time TIO is no good. It would be awesome if there were a server we could use for timing. – Anush – 2020-02-08T19:10:40.877

Answers

11

Python 3, score = big(?)

from math import exp, log, log1p

def f(b, n):
    e = n * (n - 1) / 2
    m = 0
    c = 1
    s = 0
    t = 1 << b
    for k in range(b):
        s += c
        m += exp(e * (log1p(-s / t) if 2 * s < t else log((t - s) / t)))
        c = c * (b - k) // (k + 1)
    return m

Try it online!

The Hamming distance \$D_{x, y}\$ between any two of the strings (\$1 \le x < y \le n\$) is a random variable distributed as the Binomial distribution \$B{\left(b, \tfrac12\right)}\$:

$$ \Pr[D_{x, y} > k] = 1 - \frac{1}{2^b} \sum_{i=0}^k \binom bi. $$

Furthermore, any (distinct) two of these \$\binom n2\$ random variables are independent.

If we make the incorrect but empirically useful assumption that all \$\binom n2\$ of these random variables are independent, then we can compute the distribution of their minimum \$D = \min_{1 \le x < y \le n} D_{x,y}\$ exactly:

$$ \begin{gather*} \Pr[D > k] = \left(1 - \frac{1}{2^b} \sum_{i=0}^k \binom bi\right)^{\binom n2}, \\ \mathbb E[D] = \sum_{k=0}^{b-1} \Pr[D > k]. \end{gather*} $$

This result can be computed extremely quickly and seems to be close enough. Numerical evidence suggests that the absolute error is worst at \$b = 1, n = 3\$ (where we estimate the expected minimum as \$\tfrac18\$ when it is actually \$0\$), and drops off quickly to zero as \$b\$ and/or \$n\$ get large.

Anders Kaseorg

Posted 2020-02-07T22:19:07.883

Reputation: 29 242

Awesome! And quickly done too. – Anush – 2020-02-08T11:37:38.157