Can you calculate the average Levenshtein distance exactly?

11

3

The Levenshtein distance between two strings is the minimum number of single character insertions, deletions, or substitutions to convert one string into the other one.

The challenge is to compute the average Levenshtein distance between two independent and uniformly random chosen binary strings of length n each. Your output must be exact but can be given in any easy to understand human readable form.

Examples:

These are the answer for n up to 24.

1 1/2 
2 1 
3 47/32 
4 243/128 
5 1179/512
6 2755/1024 
7 12561/4096 
8 56261/16384
9 124329/32768 
10 2175407/524288 
11 589839/131072
12 40664257/8388608 
13 174219279/33554432 
14 742795299/134217728
15 1576845897/268435456
16 13340661075/2147483648
17 14062798725/2147483648
18 59125997473/8589934592
19 123976260203/17179869184
20 259354089603/34359738368
21 8662782598909/1099511627776
22 72199426617073/8796093022208
23 150173613383989/17592186044416
24 1247439983177201/140737488355328

Score

Your score is the highest value of you can reach. Where humanly possible, I will run your code on my Linux machine for 10 minutes and then kill the job to get the score.

Notes

As always this should be a competition per language. I will maintain a leaderboard that shows the best score for each language used in an answer. I will also give a bounty of 50 points for the first answer to get to n = 20.

My CPU is an Intel(R) Xeon(R) CPU X5460.

Leaderboard

  • n = 18 in Python+numba by Shamis (timing pending...).
  • n = 19 in Java by Bob Genom (278 seconds).
  • n = 19 in C by ngn (257 seconds).
  • n = 21 in Rust by Anders Kaseorg (297 seconds). 150 point bonus awarded.

Anush

Posted 2019-12-29T20:41:36.757

Reputation: 3 202

@LuisMendo I managed to update it from my phone. – Anush – 2019-12-29T22:15:54.850

1This formulation of the question allows you to pre-create a table of output values. – Кирилл Малышев – 2019-12-30T02:53:41.533

2If you prohibit tables, it will be possible to derive a mathematical function from the table that will give the correct result for the selected n. – Кирилл Малышев – 2019-12-30T02:57:39.087

2

Interesting, even the limiting form of the average distance is not known precisely, so I doubt a closed form is known.

– xnor – 2019-12-30T06:15:16.233

4@КириллМалышев The standard loophole rules prohibit precomputimg the answers, if you manage to work out a closed formula, then you deserve both the win and the bounty that I will give you. – Anush – 2019-12-30T08:30:58.600

1"As always this should be a competition per language" why on earth would a [fastest-code] competition be per-language? To prevent people from creating magic languages that excel at calculating the average Levenshtein distance exactly and nothing else and easily winning? We're supposedly comparing the programs' performance and ignoring their code in [fastest-code], similarly to how we ignore the programs' performance and compare their code in [code-golf] challenges. The equivalent of per-language things in [code-golf] is probably a per-CPU competition in [fastest-code] or something. – my pronoun is monicareinstate – 2019-12-31T05:58:01.720

@mypronounismonicareinstate My reasoning is that we are testing the ability of the coder not the different programming languages. If you can make a Python program that is faster than anyone else’s then that is great. It’s fairly obvious that hand optimised assembly is always the fastest programming language. To compare with code-golf, we are comparing the performance and ignoring the length of the code. – Anush – 2019-12-31T06:18:01.137

1@Anush “It’s fairly obvious that hand optimised assembly is always the fastest programming language.” This is actually false, because most assembly nowadays is actually more like C bytecode than the actual primitives the processor will execute. You've got microcode, vectorisation… – wizzwizz4 – 2019-12-31T18:42:37.970

@wizzwizz4 Also optimal register allocation by hand is not very easy. – Anush – 2019-12-31T19:37:54.683

I noticed that my solution does not scale very well on your test computer:

(n my yours factor)-> {(v4 17 148 452 3.1), (v3 16 271 483 1.7), (v1 15 139 224 1.6)}. Can you confirm this? – Bob Genom – 2020-01-11T14:59:11.487

Answers

10

Rust, score ≈ 22

This uses a dynamic programming approach (I’ve added an explanation here) whose running time seems to scale as approximately \$\tilde O(2^{1.5n})\$, rather than the \$\tilde O(2^{2n})\$ of a brute-force search. On my Ryzen 7 1800X (8 cores/16 threads), it gets through \$1 \le n \le 21\$ in 1.7 minutes, \$1 \le n \le 22\$ in 5.1 minutes.

Now using SIMD for the inner loop.

src/main.rs

use fxhash::FxBuildHasher;
use itertools::izip;
use rayon::prelude::*;
use std::arch::x86_64::*;
use std::collections::HashMap;
use std::hash::{Hash, Hasher};
use std::mem;
use typed_arena::Arena;

#[global_allocator]
static ALLOC: mimallocator::Mimalloc = mimallocator::Mimalloc;

type Distance = i8;
type Count = u32;
type Total = u64;

#[derive(Debug)]
struct Distances(__m128i);

impl PartialEq for Distances {
    fn eq(&self, other: &Distances) -> bool {
        unsafe {
            let x = _mm_xor_si128(self.0, other.0);
            _mm_testz_si128(x, x) != 0
        }
    }
}

impl Eq for Distances {}

impl Hash for Distances {
    fn hash<H: Hasher>(&self, state: &mut H) {
        unsafe {
            _mm_extract_epi64(self.0, 0).hash(state);
            _mm_extract_epi64(self.0, 1).hash(state);
        }
    }
}

fn main() {
    let splat0 = unsafe { _mm_set1_epi8(0) };
    let splat1 = unsafe { _mm_set1_epi8(1) };
    let splatff = unsafe { _mm_set1_epi8(!0) };
    let splat7f = unsafe { _mm_set1_epi8(0x7f) };
    let seq = unsafe { _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0) };
    let grow0 = unsafe {
        _mm_set_epi8(
            -0x80, 0, 0x20, 0, 0x8, 0, 0x2, 0, 0, 0x40, 0, 0x10, 0, 0x4, 0, 0x1,
        )
    };
    let grow1 = unsafe {
        _mm_set_epi8(
            0, 0x40, 0, 0x10, 0, 0x4, 0, 0x1, -0x80, 0, 0x20, 0, 0x8, 0, 0x2, 0,
        )
    };

    for n in 1 as Distance.. {
        if n > 31
            || (n as Count).leading_zeros() < n as u32
            || (n as Total).leading_zeros() < 2 * n as u32
        {
            break;
        }

        let total: Total = (0u32..1 << (n - 1))
            .into_par_iter()
            .map(|a| {
                let mut a_sym = a.reverse_bits();
                a_sym ^= (a_sym >> 31).wrapping_neg();
                a_sym >>= 32 - n as usize;
                if a_sym < a {
                    return 0;
                }

                let arena = Arena::<Distances>::new();
                let stride = (n as usize + 16) / 16 * 16;
                let idx = |i: Distance, j: Distance| i as usize + stride * j as usize;
                let both = |[x, y]: [Distance; 2]| x.max(y);
                let mut worst = vec![[stride as Distance; 2]; idx(0, n + 1)];
                for j in 0..=n {
                    worst[idx(n, j)] = [n - j; 2];
                }
                for i in (0..n).rev() {
                    worst[idx(i, n)] = [n - i; 2];
                    let good = (a >> i & 1) as usize;
                    let bad = good ^ 1;
                    for j in (0..n).rev() {
                        worst[idx(i, j)][good] = both(worst[idx(i + 1, j + 1)]);
                        worst[idx(i, j)][bad] = 1 + worst[idx(i + 1, j)][bad]
                            .min(both(worst[idx(i, j + 1)]))
                            .min(both(worst[idx(i + 1, j + 1)]));
                    }
                }
                let worst: &[Distances] = arena.alloc_extend(
                    worst
                        .into_iter()
                        .map(both)
                        .collect::<Box<[Distance]>>()
                        .chunks(16)
                        .map(|chunk| {
                            Distances(unsafe {
                                _mm_loadu_si128(chunk as *const [i8] as *const __m128i)
                            })
                        }),
                );

                let mut states: HashMap<&[Distances], Count, FxBuildHasher> = HashMap::default();
                let mut new_states = HashMap::default();

                states.insert(
                    arena.alloc_extend(
                        (0..n + 1)
                            .step_by(16)
                            .map(|i| Distances(unsafe { _mm_add_epi8(_mm_set1_epi8(i), seq) })),
                    ),
                    1,
                );

                let bvs: Vec<_> = [a, !a]
                    .iter()
                    .map(|b| {
                        arena.alloc_extend((0..n + 1).step_by(16).map(|i| unsafe {
                            let x = _mm_set1_epi16(((b << 1) >> i) as i16);
                            Distances(_mm_xor_si128(
                                _mm_cmpeq_epi8(
                                    _mm_or_si128(
                                        _mm_and_si128(x, grow0),
                                        _mm_and_si128(_mm_alignr_epi8(x, x, 1), grow1),
                                    ),
                                    splat0,
                                ),
                                splatff,
                            ))
                        }))
                    })
                    .collect();

                for j in 1..=n {
                    new_states.reserve(2 * states.len());
                    let worst_slice = &worst[idx(0, j) / 16..idx(0, j + 1) / 16];
                    for (state, count) in states.drain() {
                        for bv in &bvs {
                            let mut x = j;
                            let mut y = n.into();
                            let mut bound = n;

                            let new_state: &mut [Distances] =
                                arena.alloc_extend(izip!(&**bv, state, worst_slice).map(
                                    |(&Distances(bc), &Distances(yc), &Distances(wc))| unsafe {
                                        let o = _mm_min_epi8(
                                            _mm_add_epi8(yc, splat1),
                                            _mm_sub_epi8(
                                                _mm_insert_epi8(_mm_slli_si128(yc, 1), y, 0),
                                                bc,
                                            ),
                                        );
                                        y = _mm_extract_epi8(yc, 15);
                                        let o = _mm_sub_epi8(o, seq);
                                        let o = _mm_min_epi8(o, _mm_set1_epi8(x));
                                        let o = _mm_sub_epi8(splat7f, o);
                                        let o = _mm_max_epu8(o, _mm_slli_si128(o, 1));
                                        let o = _mm_max_epu8(o, _mm_slli_si128(o, 2));
                                        let o = _mm_max_epu8(o, _mm_slli_si128(o, 4));
                                        let o = _mm_max_epu8(o, _mm_slli_si128(o, 8));
                                        let o = _mm_sub_epi8(splat7f, o);
                                        x = _mm_extract_epi8(o, 15) as i8 + 16;
                                        let o = _mm_add_epi8(o, seq);
                                        let z = _mm_add_epi8(o, wc);
                                        let z = _mm_min_epi8(z, _mm_srli_si128(z, 1));
                                        let z = _mm_min_epi8(z, _mm_srli_si128(z, 2));
                                        let z = _mm_min_epi8(z, _mm_srli_si128(z, 4));
                                        let z = _mm_min_epi8(z, _mm_srli_si128(z, 8));
                                        bound = bound.min(_mm_extract_epi8(z, 0) as i8);
                                        Distances(o)
                                    },
                                ));

                            let bound = unsafe { _mm_set1_epi8(bound) };
                            for (i, Distances(x)) in (0..).step_by(16).zip(&mut *new_state) {
                                *x = unsafe {
                                    _mm_min_epi8(
                                        *x,
                                        _mm_sub_epi8(
                                            bound,
                                            _mm_abs_epi8(_mm_add_epi8(_mm_set1_epi8(i - j), seq)),
                                        ),
                                    )
                                };
                            }

                            *new_states.entry(&*new_state).or_insert(0) += count;
                        }
                    }
                    mem::swap(&mut states, &mut new_states);
                }

                let control = unsafe { _mm_insert_epi8(splatff, (n % 16).into(), 0) };
                Total::from(
                    states
                        .into_iter()
                        .map(|(state, count)| unsafe {
                            count
                                * _mm_extract_epi8(
                                    _mm_shuffle_epi8(state[n as usize / 16].0, control),
                                    0,
                                ) as Count
                        })
                        .sum::<Count>(),
                ) * if a_sym == a { 1 } else { 2 }
            })
            .sum();

        let shift = total.trailing_zeros();
        println!(
            "{} {}/{}",
            n,
            total >> shift,
            (1 as Total) << (2 * n as u32 - 1 - shift),
        );
    }
}

Cargo.toml

[package]
name = "levenshtein"
version = "0.1.0"
authors = ["Anders Kaseorg <andersk@mit.edu>"]
edition = "2018"

[profile.release]
lto = true
codegen-units = 1

[dependencies]
fxhash = "0.2.1"
itertools = "0.8.2"
mimallocator = "0.1.3"
rayon = "1.3.0"
typed-arena = "2.0.0"

Running

RUSTFLAGS='-C target-cpu=native' cargo build --release
target/release/levenshtein

Output

(With cumulative timing data prepended by ts -s %.s.)

0.000008 1 1/2
0.000150 2 1/1
0.000219 3 47/32
0.000282 4 243/128
0.000344 5 1179/512
0.000413 6 2755/1024
0.000476 7 12561/4096
0.000538 8 56261/16384
0.000598 9 124329/32768
0.000660 10 2175407/524288
0.000721 11 589839/131072
0.000782 12 40664257/8388608
0.000843 13 174219279/33554432
0.006964 14 742795299/134217728
0.068070 15 1576845897/268435456
0.310136 16 13340661075/2147483648
1.062122 17 14062798725/2147483648
3.586745 18 59125997473/8589934592
11.265840 19 123976260203/17179869184
33.691822 20 259354089603/34359738368
101.514674 21 8662782598909/1099511627776
307.427106 22 72199426617073/8796093022208
956.299101 23 150173613383989/17592186044416
3077.477731 24 1247439983177201/140737488355328
10276.205241 25 5173410986415247/562949953421312
34550.754308 26 5356540527479769/562949953421312

Static core2 build for Anush

Anders Kaseorg

Posted 2019-12-29T20:41:36.757

Reputation: 29 242

A really great first answer! I look forward to timing it. – Anush – 2019-12-30T08:37:35.137

I have no idea why my AMD 8350 is so much slower than your CPU. In theory it should be roughly half the speed. – Anush – 2019-12-30T20:01:25.550

@Anush Curious. This is the with revised version I posted after your first comment (using a_sym to cut down the search space by a factor of 4)? And a current version of Rust (HashMap was replaced with a faster implementation a few months ago)? – Anders Kaseorg – 2019-12-31T01:47:58.700

Yes I am using the latest version and rustc 1.40.0 (73528e339 2019-12-16). I will see if I can find another computer to test it on. – Anush – 2019-12-31T10:55:55.530

For interest I made a static compilation and ran it on my computer and two others. For mine it gets to n=19. On Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz it gets to n = 18 and on Intel(R) Xeon(R) CPU X5460 @ 3.16GHz it gets to n = 16. Could you statically compile on your machine and give me the compiled executable to test by any chance? – Anush – 2019-12-31T11:30:31.793

@Anush I tried a static build with musl and it was also slow—I think musl comes with a very slow memory allocator. Try this revision, which uses mimalloc instead of the system allocator. Here’s a static binary if you need it.

– Anders Kaseorg – 2019-12-31T17:59:28.527

@Anush …which made me realize, if memory allocation overhead is an issue, then typed-arena is even better. This version is another 31% faster (new static binary).

– Anders Kaseorg – 2020-01-01T21:28:57.823

I gave up on my computer which just seems very slow for this sort of multi-threaded code for some mysterious reason (the amd 8350 is made up of 4 "modules" that each consist of 2 integer processing units and one floating point unit so I can't see why it is less than half the speed). I found a server with an Intel CPU and your code is now eligible for the 50 point bounty. – Anush – 2020-01-01T21:52:52.393

7Would you mind writing a few words about your method please? It's not obvious to me how you have formulated the problem to make it amenable to dynamic programming for example. – Anush – 2020-01-02T17:30:03.843

Could you make me a static binary again please? My CPU is -march=core2 -msse4.1 in terms of gcc features if that makes a difference. – Anush – 2020-01-08T16:48:18.163

@Anush I converted the inner loop to SIMD and added a new static binary. – Anders Kaseorg – 2020-01-09T08:34:29.147

A very nice speedup! – Anush – 2020-01-09T09:42:26.357

5

Java, score ≈ 19

My solution is a recursive approach. It is still \$\tilde O(2^{2n})\$ of a brute-force search. In other words: if n increase by 1 runtime increase by a factor of 4 (even when using multi-threading).

Which is obviously not enough to compare with Anders Kaseorg's code.

I observed and used some symmetries to squeeze out some (linear) factors.

import java.util.stream.IntStream;
// version 5.1
public class AvgLD51_MT {

    public static void main(String[] argv) {
        long t0=System.currentTimeMillis();
        for (int n=1; ;n++) {
            int VP = (1 << n) - 1; // 1m;
            int VN = 0; // 0m;
            int max=1<<(n-1);
            final int N=n;
            long sum=IntStream.range(0, max).mapToLong(p-> {
                int rp = Integer.reverse(p)>>>(32-N);
                int np = VP & ~rp;
                if (p <= rp && p <= np) {
                    if (p == rp || p == np) {
                        return 2*buildX(N, p, 0, 1, VP, VN);
                    } else {
                        return 4*buildX(N, p, 0, 1, VP, VN);
                    }
                }
                return 0;
            }).parallel().sum();
            long gcd=gcd(sum, (1L<<(2*n)));
            System.out.printf("%f   %d  %d  %d/%d\n", (double)(System.currentTimeMillis()-t0)/(1000), n, sum, sum/gcd, (1L<<(2*n))/gcd);
            sum*=2;
        }
    }

    /**
     * Myers (, Hyyrö) injected into my recursive buildX function (see version 4).
     * Pattern p is fixed. Text t is generated by recursion.
     *
     * Myers (, Hyyrö) bit-parallel LevenshteinDistance
     * taken and inferred (for gulfing e.g. m==n) from:
     * https://www.win.tue.nl/~jfg/educ/bit.mat.pdf
     * http://www.mi.fu-berlin.de/wiki/pub/ABI/RnaSeqP4/myers-bitvector-verification.pdf
     * https://www.sciencedirect.com/science/article/pii/S157086670400053X
     * https://www.researchgate.net/publication/266657812_Thread-cooperative_bit-parallel_computation_of_Levenshtein_distance_on_GPU
     */
    static long buildX(int n, int p, int t, int j, int VP, int VN){
        final int HMASK = 1 << (n - 1); // 10^(m-1)
        final int VMASK = (1<<n)-1;
        long score=0;
        int Bj, D0, HP, HN, VP1, VN1, X;

        // assume a 0 at Tj
        Bj= ~p;
        // compute diagonal delta vector
        D0 = ((VP + (Bj & VP)) ^ VP) | Bj | VN;

        // update horizontal delta values
        HN = VP & D0;
        HP = VN | ~(VP | D0);
        // Scoring and output
        // carry = rev(n, j)*(Integer.bitCount(HP & HMASK) - Integer.bitCount(HN & HMASK));
        X = (HP << 1) | 1;
        VN1 = (HN << 1) | ~(X | D0);
        VP1 = X & D0;
        if (j!=HMASK) {         
            // update vertical delta values
            score = buildX(n, p, t, 2*j, VN1, VP1);
        } else {
            score = n + Integer.bitCount(VMASK & VN1) - Integer.bitCount(VMASK & VP1);
        }

        // assume a 1 at Tj
        Bj= p;
        // compute diagonal delta vector
        D0 = ((VP + (Bj & VP)) ^ VP) | Bj | VN;

        // update horizontal delta values
        HN = VP & D0;
        HP = VN | ~(VP | D0);
        // Scoring and output
        // carry += rev(n, j)*(Integer.bitCount(HP & HMASK) - Integer.bitCount(HN & HMASK));
        X = (HP << 1) | 1;
        VN1 = (HN << 1) | ~(X | D0);
        VP1 = X & D0;
        if (j!=HMASK) {         
            // update vertical delta values
            return score + buildX(n, p, t, 2*j, VN1, VP1);
        } else {
            return n + score + Integer.bitCount(VMASK & VN1) - Integer.bitCount(VMASK & VP1);
        }
    }

    static long gcd(long numerator, long denominator) {
        long gcd = denominator;
        while (numerator != 0) {
            long tmp=numerator; numerator=gcd % numerator; gcd=tmp;
        }
        return gcd;
    }
}

Version 5.1

Like version 5 but is multi-threaded by using streams.

0.000000    1   2   1/2                           
...
0.748000    15  6307383588  1576845897/268435456
2.359000    16  26681322150 13340661075/2147483648
10.062000   17  112502389800    14062798725/2147483648
35.387000   18  473007979784    59125997473/8589934592
156.396000  19  1983620163248   123976260203/17179869184
572.525000  20  8299330867296   259354089603/34359738368

Version 5

Myers code directly injected into my recursive buildX function. As a consequence no extra call of LevenshteinDistance is needed any more.

0.000000    1   2   1/2                           
...
2.134000    15  6307383588  1576845897/268435456
7.571000    16  26681322150 13340661075/2147483648
32.705000   17  112502389800    14062798725/2147483648
119.952000  18  473007979784    59125997473/8589934592
523.186000  19  1983620163248   123976260203/17179869184

Version 4.1

Like version 4 but is multi-threaded by using streams.

0.000000    1   2   1/2                           
...
0.764000    13  348438558   174219279/33554432
1.525000    14  1485590598  742795299/134217728
4.417000    15  6307383588  1576845897/268435456
15.445000   16  26681322150 13340661075/2147483648
63.199000   17  112502389800    14062798725/2147483648
259.179000  18  473007979784    59125997473/8589934592

Version 4

Uses Myers, Hyyrö bit-parallel LevenshteinDistance.

0.000000    1   2   1/2                           
...
8.203000    15  6307383588  1576845897/268435456
35.326000   16  26681322150 13340661075/2147483648
148.577000  17  112502389800    14062798725/2147483648  
629.084000  18  473007979784    59125997473/8589934592
2615.031000 19  1983620163248   123976260203/17179869184       

Version 3

Copied and uses getLevenshteinDistance(..) from apache StringUtils. BTW: Using the threshold variant made no difference for me. (Used threshold=bitCount(s^t))

0.000000    1   2   1/2                           
...
60.190000   15  6307383588  1576845897/268435456
271.020000  16  26681322150 13340661075/2147483648
1219.544000 17  112502389800    14062798725/2147483648          

Version 2

Found more symmetries on recursion.

0.000000    1   2   1/2                           
...
105.389000  15  6307383588  1576845897/268435456          
447.617000  16  26681322150 13340661075/2147483648        
2105.316000 17  112502389800    14062798725/2147483648        

Version 1

0.000000    1   2   1/2                           
0.068000    2   16  1/1                           
0.070000    3   94  47/32                         
0.071000    4   486 243/128                       
0.073000    5   2358    1179/512                      
0.074000    6   11020   2755/1024                     
0.076000    7   50244   12561/4096                    
0.086000    8   225044  56261/16384                   
0.111000    9   994632  124329/32768                  
0.223000    10  4350814 2175407/524288                
0.640000    11  18874848    589839/131072                 
1.842000    12  81328514    40664257/8388608              
7.387000    13  348438558   174219279/33554432            
29.998000   14  1485590598  742795299/134217728           
139.217000  15  6307383588  1576845897/268435456          
581.465000  16  26681322150 13340661075/2147483648  

Bob Genom

Posted 2019-12-29T20:41:36.757

Reputation: 846

1Thank you. I look forward to timing it. – Anush – 2020-01-06T07:59:18.017

Got to 15 on my test CPU in 224 seconds (983 for n = 16). I notice the code uses a single thread. Can it be made multi-threaded? – Anush – 2020-01-06T10:56:56.460

I'm still in the mode of tuning the core algorithm. If done I'll try to progress to multi-threaded. BTW: How did you get the values for your "answer for n up to 23." – Bob Genom – 2020-01-06T17:11:29.643

For up to 16 I just used my slow Python code and for the rest I used Anders Kaseorg's code. Very glad to hear improvements are planned! – Anush – 2020-01-06T17:13:06.693

1You sir just made my day. The bit-parallel distance works like a charm. I am up two levels just by using it. – Shamis – 2020-01-11T10:03:47.323

You have achieve a great speedup! But I still want to know what Kaseorg had done in terms of dynamic programming. – Anush – 2020-01-12T10:25:00.370

1@Shamis: Because the alphabet is "0", "1" I thought there must be some specialized LevenshteinDistance version for binary numbers. Then I found Myers solution in the net and gave it a try. – Bob Genom – 2020-01-12T18:56:09.783

1@Anush: From my findings diving 2 weeks into the field of Levenshtein Dist: A chance to reach O(2^(1.5n)), is by building and using a (see) "Levenshtein automaton" for each word w. Which are 2^n automatons a[w] which can be build easy (they say) in time O(|w|) each. Each a[w] is applied to 2^n words stored in a compressed trie dictionary. The time for this is proportional to the number of nodes in the trie. I guess, when stepping from n-1 to n, maintaining and applying the dictionary can be done (e.g. by using dynamic programming) in at least O(2^(0.5n)). Does that sound plausible for you? – Bob Genom – 2020-01-13T07:45:54.080

Very interesting. I hope someone is brave enough to try implementing this in a non-Rust programming language! – Anush – 2020-01-13T09:14:58.493

I think that I know what Anders did. However my implementation, while working, is not nearly as fast as even my original solution even though it should/could be faster in theory. – Shamis – 2020-01-14T09:11:48.583

4

C

// gcc -O3 -pthread -march=native a.c && ./a.out
#define _GNU_SOURCE
#include<stdio.h>
#include<unistd.h>
#include<pthread.h>
#define _(a...){return({a;});}
#define $(x,a...)if(x){a;}
#define P(x,a...)if(x)_(a)
#define W(x,a...)while(x){a;}
#define F(i,n,a...)for(I i=0,n_=(n);i<n_;i++){a;}
#define S static
typedef void V;typedef int I;typedef long long L;typedef struct{I x,r;pthread_barrier_t*b;}A;
S I n,x1,msk,nt;S L f1(I,I,I,I);
S L gcd(L x,L y)_(W(x,L z=x;x=y%x;y=z)y)S I rev(I x)_(I r=0;F(i,n,r+=(x>>i&1)<<(n-1-i))r)
S L f0(I x,I j,I vp_,I vn_,I pm)_(I d0=(((pm&vp_)+vp_)^vp_)|pm|vn_,hp=vn_|~(d0|vp_),hp1=hp<<1|1,vp=(d0&vp_)<<1|~(d0|hp1),vn=d0&hp1;f1(x,j,vp,vn))
S L f1(I x,I j,I vp_,I vn_)_(P(!--j,__builtin_popcount(msk&vp_)-__builtin_popcount(msk&vn_))f0(x,j,vp_,vn_,x)+f0(x,j,vp_,vn_,~x))
S V*f2(A*a)_(I x=a->x;L s[3]={};W(x<x1,I rx=rev(x),nx=msk&~rx;$(x<=rx&&x<=nx,s[(x!=rx)+(x!=nx)]+=f1(x,n+1,msk,0))x+=nt)
 a->r=s[0]+2*s[1]+4*s[2];pthread_barrier_wait(a->b);NULL)
S L f3()_(L r=(L)n<<2*n;pthread_barrier_t b;pthread_barrier_init(&b,0,nt);A a[nt];pthread_t t[nt];
 F(i,nt,cpu_set_t c;CPU_ZERO(&c);CPU_SET(i,&c);pthread_attr_t h;pthread_attr_init(&h);pthread_attr_setaffinity_np(&h,sizeof(cpu_set_t),&c);
  a[i].x=i;a[i].r=0;a[i].b=&b;pthread_create(t+i,0,(V*(*)(V*))f2,a+i))
 F(i,nt,pthread_join(t[i],0);r+=a[i].r)pthread_barrier_destroy(&b);r)
I main()_(nt=2*sysconf(_SC_NPROCESSORS_CONF);
 W(1,n++;x1=1<<(n-1);msk=(1<<n)-1;L p=f3(),q=1ll<<2*n,d=gcd(p,q);printf("%d %lld/%lld\n",n,p/d,q/d);fflush(stdout))0)

ngn

Posted 2019-12-29T20:41:36.757

Reputation: 11 449

2Could you ungolf the code please? I look forward to timing it but what n do you get to? – Anush – 2020-01-07T06:45:18.313

This isn't Code Golf. – S.S. Anne – 2020-01-11T16:11:39.863

7I think this is just how ngn writes C code -- it's not golfed. – xnor – 2020-01-12T13:28:59.770

3

Python ~ 15, 17, 18 Requiem for a Dream.

So far for my attempt to decipher the algorithm. Just one simple symmetry as a result. Upside is that I managed to get up to 19. Downside is obviously a shattered hope XD. To add an insult to an injury, I think that Bob Genom already has it. (Noticed after I dismantled my previous horror of a solution to something readable-ish.) It also might be that what I considered annoying edge cases might actually be a result of me overcomplicating things. Dear oh dear. Still I think that some way of caching the Levenstein computation might be the way to go. Just not the one I tried last time.

7.215967655181885 16 13340661075 / 2147483648
24.544007539749146 17 14062798725 / 2147483648
93.72401142120361 18 59125997473 / 8589934592
379.6802065372467 19 123976260203 / 17179869184

Added Multiprocessing. The most expensive thing at the time are locks. And I have yet to figure a way to bypass the need for them. My manual attempts are slower than the Pool.Starmap which makes me slightly sad.

Tried a block processing approach with an attempt to gain another linear factor, however for some reason this slowed down the code by a lot. Overoptimization seems to backfire.

1.640207052230835 13 174219279 / 33554432
1.9370124340057373 14 742795299 / 134217728
3.1867198944091797 15 1576845897 / 268435456
9.054970979690552 16 13340661075 / 2147483648
37.539693117141724 17 14062798725 / 2147483648
158.5456690788269 18 59125997473 / 8589934592

Thanks to Bob Genom's answer and using his latest distance algorithm I managed to up the speed. Also I noted that one of the attempts at a linear symmetry backfired - the code runs faster after I removed it. Probably something to do with ranges?

...............
0.6873703002929688 13 174219279 / 33554432
2.0464255809783936 14 742795299 / 134217728
7.808838605880737 15 1576845897 / 268435456
33.9985032081604 16 13340661075 / 2147483648
145.6884548664093 17 14062798725 / 2147483648

Took me quite a while and I have run into quite the Python limitations. My attempt to parallelize was ground to a halt by the GIL. Figuring out how to make processes talk to each other will take a while. I have few more ideas to try, however my brain is starting to melt. I spent last two hours by juggling indices - my current approach is to embed symmetries directly into the loops. Recursion was much slower and Numba doesn't like to interact with Python objects. For some reason it sees nothing to parallelize in this code and I have no clue whether the parallel part does anything since the CPU is only at 20% capacity.

This approach is still bruteforcing, however with the embedded symmetries it takes the computation down a notch - many of the combinations are not even considered.

I took the liberty to start from 2nd floor. I do not consider that as a cheating since it is negligible time-wise. And it introduces a swath of very vexing edge cases.

If I have time, I will try to do these: Rewrite the code in something faster, probably C. Try to figure out a decent way to use parallelization, maybe in C. And a bit of a caching. That one will be tricky, especially in combination with the embedding.

0.0 2 1 / 1
0.483562707901001 3 47 / 32
0.483562707901001 4 243 / 128
0.483562707901001 5 1179 / 512
0.483562707901001 6 2755 / 1024
0.483562707901001 7 12561 / 4096
0.5001938343048096 8 56261 / 16384
0.5334563255310059 9 124329 / 32768
0.6999850273132324 10 2175407 / 524288
1.3333814144134521 11 589839 / 131072
3.7170190811157227 12 40664257 / 8388608
15.165801048278809 13 174219279 / 33554432
62.91589903831482 14 742795299 / 134217728
266.3912649154663 15 1576845897 / 268435456

I would love to try and give a GPU a go for this task. However I failed miserably for nowXD.

from numba import jit, cuda, prange
import time
import multiprocessing as mp

@jit(nopython=True, fastmath=True, nogil=True)#, parallel=True)
def LevenshteinDistance(n, p, t):
        np=~p
        HMASK = (1 << (n - 1))
        VP = (1 << n) - 1
        VN = 0
        score = n
        for j in range(0,n):
            if (t & (1<<j)) != 0:
                Bj = p
            else:
                Bj = np
            D0 = ((VP + (Bj & VP)) ^ VP) | Bj | VN
            HN = VP & D0
            HP = VN | ~(VP | D0)

            if ((HP & HMASK) != 0):
             score += 1;
            elif ((HN & HMASK) != 0):
             score -= 1;
            X = (HP << 1) | 1
            VN = X & D0
            VP = (HN << 1) | ~(X | D0)
        return score

@jit(nopython=True, fastmath=True)#, parallel=True)
def dispatchLev(i, level):
    halfSize = 1 << (level - 1) - 1
    iRange = halfSize
    levelSize = 1 << (level - 1)
    mask = levelSize - 1
    halfSize = levelSize >> 1
    rangeUpper = iRange - i
    indexI = i + halfSize
    baseI = indexI << 1
    sum = 0
    for indexJ in range(0, rangeUpper):
        baseJ = indexJ << 1
        if (mask ^ indexJ) == indexI:
            a = LevenshteinDistance(level, baseI + 1, baseJ)
            b = LevenshteinDistance(level, baseI, baseJ + 1)
            sum += a + b
        else:
            a = LevenshteinDistance(level, baseI + 1, baseJ)
            b = LevenshteinDistance(level, baseI, baseJ + 1)
            sum += 2 * (a + b)

    return sum

def computeSum(level):
    levelSize = 1 << (level - 1)
    halfSize = levelSize >> 1
    curSum = 0
    iRange = halfSize
    test = [(x, level) for x in range(0, iRange)]
    if len(test) > 1:
        a = myPool.starmap(dispatchLev, test)
        curSum += sum(a)
        #for x, level in test:
        #    curSum += dispatchLev(x,level)

    else:
        a = dispatchLev(0, level)
        curSum += a
    return curSum


def gcd(num, den):
    gcdRet = den
    tmp = 0
    while num != 0:
        tmp = num
        num = gcdRet % num
        gcdRet = tmp

    return gcdRet


if __name__ == '__main__':
    t1 = time.time()
    print("beginning")
    prevSum = 16
    bruteForceCarry = 6
    levelMask = 0
    target = 20
    curSum = 0
    bruteForce = 0
    myPool = mp.Pool(mp.cpu_count())
    processArray = []
    resultArray = []



    for level in range(3, target):
        levelSize = 1 << level
        halfSize = levelSize >> 1
        bruteForce = computeSum(level)
        diagonal = computeDiagonal(level)
        bruteForceCarry = 2 * bruteForceCarry + bruteForce
        curSum = prevSum + bruteForceCarry
        curSum = curSum * 2
        t2 = time.time()
        wholeSize = levelSize * levelSize
        divisor = gcd(curSum, wholeSize)
        a = int(curSum / divisor)
        b = int(wholeSize / divisor)
        print(t2 - t1, level, a, "/", b)
        prevSum = curSum

Shamis

Posted 2019-12-29T20:41:36.757

Reputation: 141

Thanks! It might be fun to try pypy and then cython too. – Anush – 2020-01-10T17:32:53.777

I am currently rewriting the code in C, I want to see how much is it Python being slow and how much it is the limitation of the approach I have chosen. – Shamis – 2020-01-10T19:50:25.783

I asked an expert and they said "numba won't automatically parallelise loops, you have to opt-in with prange and then guarantee all the safety yourself". See http://numba.pydata.org/numba-doc/latest/user/performance-tips.html#parallel-true

– Anush – 2020-01-10T20:03:18.600

I tried that more than once actually. Did seemingly the same thing as they did. And if I am not terribly wrong, there is no inherent conflict in the code. I am currently not caching any values. The version that used Pool.Apply was working without issues. However after refactoring everything, the Pool decided that it doesn't want to be friends with Numba. – Shamis – 2020-01-10T20:19:19.040

The gitter for numba https://gitter.im/numba/numba has experts who are very happy to help. It would be worth asking there.

– Anush – 2020-01-10T20:20:27.367

The main question is, can anyone work out what the dynamic programming algorithm is that Kaseorg has used? – Anush – 2020-01-11T13:55:00.767

Would this all be easier in Julia? I am following your investigations with great interest. – Anush – 2020-01-14T11:55:01.387

Not sure, don't know that one. At the start I chose Python since I use it at work and I can use it reasonably fast. I rewrote first computation with C and it was not that much faster than Numba actually. However when I once again began digging into the issue, I have run into the restrictions that the Numba imposes on the type safety. In the end, I have yet to find a way to make Python code thread unsafe so most of the time is still spend in locks. I already spent more time than intended, so I opted for posting the partial results. Work waits and sleeps is also nice :-(. – Shamis – 2020-01-14T12:58:17.447