Compute the maximum number of runs possible for as large a string as possible

24

9

[This question is a follow up to Compute the runs of a string ]

A period p of a string w is any positive integer p such that w[i]=w[i+p] whenever both sides of this equation are defined. Let per(w) denote the size of the smallest period of w . We say that a string w is periodic iff per(w) <= |w|/2.

So informally a periodic string is just a string that is made up from another string repeated at least once. The only complication is that at the end of the string we don't require a full copy of the repeated string as long as it is repeated in its entirety at least once.

For, example consider the string x = abcab. per(abcab) = 3 as x[1] = x[1+3] = a, x[2]=x[2+3] = b and there is no smaller period. The string abcab is therefore not periodic. However, the string ababa is periodic as per(ababa) = 2.

As more examples, abcabca, ababababa and abcabcabc are also periodic.

For those who like regexes, this one detects if a string is periodic or not:

\b(\w*)(\w+\1)\2+\b

The task is to find all maximal periodic substrings in a longer string. These are sometimes called runs in the literature.

A substring w is a maximal periodic substring (run) if it is periodic and neither w[i-1] = w[i-1+p] nor w[j+1] = w[j+1-p]. Informally, the "run" cannot be contained in a larger "run" with the same period.

Because two runs can represent the same string of characters occurring in different places in the overall string, we will represent runs by intervals. Here is the above definition repeated in terms of intervals.

A run (or maximal periodic substring) in a string T is an interval [i...j] with j>=i, such that

  • T[i...j] is a periodic word with the period p = per(T[i...j])
  • It is maximal. Formally, neither T[i-1] = T[i-1+p] nor T[j+1] = T[j+1-p]. Informally, the run cannot be contained in a larger run with the same period.

Denote by RUNS(T) the set of runs in string T.

Examples of runs

  • The four maximal periodic substrings (runs) in string T = atattatt are T[4,5] = tt, T[7,8] = tt, T[1,4] = atat, T[2,8] = tattatt.

  • The string T = aabaabaaaacaacac contains the following 7 maximal periodic substrings (runs): T[1,2] = aa, T[4,5] = aa, T[7,10] = aaaa, T[12,13] = aa, T[13,16] = acac, T[1,8] = aabaabaa, T[9,15] = aacaaca.

  • The string T = atatbatatb contains the following three runs. They are: T[1, 4] = atat, T[6, 9] = atat and T[1, 10] = atatbatatb.

Here I am using 1-indexing.

The task

Write code so that for each integer n starting at 2, you output the largest numbers of runs contained in any binary string of length n.

Score

Your score is the highest n you reach in 120 seconds such that for all k <= n, no one else has posted a higher correct answer than you. Clearly if you have all optimum answers then you will get the score for the highest n you post. However, even if your answer is not the optimum, you can still get the score if no one else can beat it.

Languages and libraries

You can use any available language and libraries you like. Where feasible, it would be good to be able to run your code so please include a full explanation for how to run/compile your code in Linux if at all possible.

Example optima

In the following: n, optimum number of runs, example string.

2 1 00
3 1 000
4 2 0011
5 2 00011
6 3 001001
7 4 0010011
8 5 00110011
9 5 000110011
10 6 0010011001
11 7 00100110011
12 8 001001100100
13 8 0001001100100
14 10 00100110010011
15 10 000100110010011
16 11 0010011001001100
17 12 00100101101001011
18 13 001001100100110011
19 14 0010011001001100100
20 15 00101001011010010100
21 15 000101001011010010100
22 16 0010010100101101001011

What exactly should my code output?

For each n your code should output a single string and the number of runs it contains.

My Machine The timings will be run on my machine. This is a standard ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Leading answers

  • 49 by Anders Kaseorg in C. Single threaded and run with L = 12 (2GB of RAM).
  • 27 by cdlane in C.

user9206

Posted 2016-07-06T08:49:46.147

Reputation:

11You know what do with the results from this challenge... – Martin Ender – 2016-07-08T09:46:13.363

@MartinEnder Yes! I am hoping someone will come up with something better than the current super naive effort. Or at least code it in nim :) – None – 2016-07-08T12:15:54.797

1If you only want us to consider only {0,1} -strings, please explicitly state that. Otherwise the alphabet could possibly be infinite, and I don't see why your testcases should be optimal, because it seems you only searched {0,1} strings too. – flawr – 2016-07-11T15:24:37.553

@flawr I have edited the question along the lines you suggested. The alphabet can't be any larger than the length of the string of course. – None – 2016-07-11T15:27:32.400

3@flawr, I searched strings over a ternary alphabet for n up to 12 and it never beat the binary alphabet. Heuristically I would expect that a binary string should be optimal, because adding more characters increases the minimum length of a run. – Peter Taylor – 2016-07-12T14:10:35.560

1In the optimum results above you have "12 7 001001010010" but my code pumps out "12 8 110110011011" where period 1 runs are (11, 11, 00, 11, 11), period 3 runs are (110110, 011011) and there's a period 4 run (01100110) -- where am I going wrong in my run counting? – cdlane – 2016-07-19T04:35:59.343

@cdlane I think you are right there is a bug in orlp's runs function. Can you provide answers for n =2..22 instead so I can put them in the question or a working runs function I can use by any chance please? – None – 2016-07-19T10:20:35.180

The example optima for n = 5, n = 13 are wrong; see comments on cdlane’s answer. I am confident in the optima posted on my answer; I’ve verified them against a dumb Python implementation for n ≤ 21. – Anders Kaseorg – 2016-07-20T11:28:44.093

1@cdlane 0000 has one run. Consider the period of 000... it is always 1 no matter how many zeros. – None – 2016-07-20T16:15:21.880

I now readily accept that there is only one period 1 run in '0000', but how does that negate the period 2 run that it also contains? – cdlane – 2016-07-20T16:45:03.197

1Refer back to the definition . Let per(w) denote the size of the smallest period of w . There is no substring with smallest period length 2. – None – 2016-07-20T17:11:14.263

Answers

9

C

This does a recursive search for optimal solutions, heavily pruned using an upper bound on the number of runs that could be finished by the unknown remainder of the string. The upper bound computation uses a gigantic lookup table whose size is controlled by the constant L (L=11: 0.5 GiB, L=12: 2 GiB, L=13: 8 GiB).

On my laptop, this goes up through n = 50 in 100 seconds; the next line comes at 142 seconds.

#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>

#define N (8*sizeof(unsigned long long))
#define L 13
static bool a[N], best_a[N];
static int start[N/2 + 1], best_runs;
static uint8_t end_runs[2 << 2*L][2], small[N + 1][2 << 2*L];

static inline unsigned next_state(unsigned state, int b)
{
    state *= 2;
    state += b;
    if (state >= 2 << 2*L) {
        state &= ~(2 << 2*L);
        state |= 1 << 2*L;
    }
    return state;
}

static void search(int n, int i, int runs, unsigned state)
{
    if (i == n) {
        int r = runs;
        unsigned long long m = 0;
        for (int p = n / 2; p > 0; p--) {
            if (i - start[p] >= 2*p && !(m & 1ULL << start[p])) {
                m |= 1ULL << start[p];
                r++;
            }
        }
        if (r > best_runs) {
            best_runs = r;
            memcpy(best_a, a, n*sizeof(a[0]));
        }
    } else {
        a[i] = false;
        do {
            int r = runs, bound = 0, saved = 0, save_p[N/2], save_start[N/2], p, s = next_state(state, a[i]);
            unsigned long long m = 0;
            for (p = n/2; p > i; p--)
                if (p > L)
                    bound += (n - p + 1)/(p + 1);
            for (; p > 0; p--) {
                if (a[i] != a[i - p]) {
                    if (i - start[p] >= 2*p && !(m & 1ULL << start[p])) {
                        m |= 1ULL << start[p];
                        r++;
                    }
                    save_p[saved] = p;
                    save_start[saved] = start[p];
                    saved++;
                    start[p] = i + 1 - p;
                    if (p > L)
                        bound += (n - i)/(p + 1);
                } else {
                    if (p > L)
                        bound += (n - (start[p] + p - 1 > i - p ? start[p] + p - 1 : i - p))/(p + 1);
                }
            }
            bound += small[n - i - 1][s];

            if (r + bound > best_runs)
                search(n, i + 1, r, s);
            while (saved--)
                start[save_p[saved]] = save_start[saved];
        } while ((a[i] = !a[i]));
    }
}

int main()
{
    for (int n = 0; n <= N; n++) {
        if (n <= 2*L) {
            for (unsigned state = 1U << n; state < 2U << n; state++) {
                for (int b = 0; b < 2; b++) {
                    int r = 0;
                    unsigned long long m = 0;
                    for (int p = n / 2; p > 0; p--) {
                        if ((b ^ state >> (p - 1)) & 1) {
                            unsigned k = state ^ state >> p;
                            k &= -k;
                            k <<= p;
                            if (!(k & ~(~0U << n)))
                                k = 1U << n;
                            if (!((m | ~(~0U << 2*p)) & k)) {
                                m |= k;
                                r++;
                            }
                        }
                    }
                    end_runs[state][b] = r;
                }
                small[0][state] = end_runs[state][0] + end_runs[state][1];
            }
        }

        for (int l = 2*L < n - 1 ? 2*L : n - 1; l >= 0; l--) {
            for (unsigned state = 1U << l; state < 2U << l; state++) {
                int r0 = small[n - l - 1][next_state(state, 0)] + end_runs[state][0],
                    r1 = small[n - l - 1][next_state(state, 1)] + end_runs[state][1];
                small[n - l][state] = r0 > r1 ? r0 : r1;
            }
        }

        if (n >= 2) {
            search(n, 1, 0, 2U);
            printf("%d %d ", n, best_runs);
            for (int i = 0; i < n; i++)
                printf("%d", best_a[i]);
            printf("\n");
            fflush(stdout);
            best_runs--;
        }
    }
    return 0;
}

Output:

$ gcc -mcmodel=medium -O2 runs.c -o runs
$ ./runs
2 1 00
3 1 000
4 2 0011
5 2 00011
6 3 001001
7 4 0010011
8 5 00110011
9 5 000110011
10 6 0010011001
11 7 00100110011
12 8 001001100100
13 8 0001001100100
14 10 00100110010011
15 10 000100110010011
16 11 0010011001001100
17 12 00100101101001011
18 13 001001100100110011
19 14 0010011001001100100
20 15 00101001011010010100
21 15 000101001011010010100
22 16 0010010100101101001011
23 17 00100101001011010010100
24 18 001001100100110110011011
25 19 0010011001000100110010011
26 20 00101001011010010100101101
27 21 001001010010110100101001011
28 22 0010100101101001010010110100
29 23 00101001011010010100101101011
30 24 001011010010110101101001011010
31 25 0010100101101001010010110100101
32 26 00101001011010010100101101001011
33 27 001010010110100101001011010010100
34 27 0001010010110100101001011010010100
35 28 00100101001011010010100101101001011
36 29 001001010010110100101001011010010100
37 30 0010011001001100100010011001001100100
38 30 00010011001001100100010011001001100100
39 31 001001010010110100101001011010010100100
40 32 0010010100101101001010010110100101001011
41 33 00100110010001001100100110010001001100100
42 35 001010010110100101001011010110100101101011
43 35 0001010010110100101001011010110100101101011
44 36 00101001011001010010110100101001011010010100
45 37 001001010010110100101001011010110100101101011
46 38 0010100101101001010010110100101001011010010100
47 39 00101101001010010110100101101001010010110100101
48 40 001010010110100101001011010010110101101001011010
49 41 0010100101101001010010110100101101001010010110100
50 42 00101001011010010100101101001011010110100101101011
51 43 001010010110100101001011010110100101001011010010100

Here are all optimal sequences for n ≤ 64 (not just the lexicographically first), generated by a modified version of this program and many hours of computation.

A remarkable nearly-optimal sequence

The prefixes of the infinite fractal sequence

1010010110100101001011010010110100101001011010010100101…

that is invariant under the transformation 101 ↦ 10100, 00 ↦ 101:

  101   00  101   101   00  101   00  101   101   00  101   101   00  …
= 10100 101 10100 10100 101 10100 101 10100 10100 101 10100 10100 101 …

seem to have very nearly optimal numbers of runs—always within 2 of optimal for n ≤ 64. The number of runs in the first n characters divided by n approaches (13 − 5√5)/2 ≈ 0.90983. But it turns out this is not the optimal ratio—see comments.

Anders Kaseorg

Posted 2016-07-06T08:49:46.147

Reputation: 29 242

Thank you the answer and your corrections. What do you think the prospects are for a non brute force solution ? – None – 2016-07-20T15:38:55.440

1

@Lembik I don’t know. I think my current solution is somewhat faster than o(2^N) given enough memory, but it’s still exponential. I haven’t found a direct formula that completely skips the search process, but one could exist. I conjecture that the Thue–Morse sequence is asymptotically optimal with N⋅5/6 − O(log N) runs, but it seems to stay a handful of runs behind the actual optimum.

– Anders Kaseorg – 2016-07-21T05:03:38.947

Interestingly, 42/50 > 5/6. – None – 2016-07-21T21:50:23.033

1@Lembik One should always expect asymptotic predictions to sometimes be beaten by small amounts. But actually I was completely wrong—I found a much better sequence that seems to be approaching N⋅(13 − 5√5)/2 ≈ N⋅0.90983 runs. – Anders Kaseorg – 2016-07-21T23:55:10.400

Very impressive. I think the 0.90983 conjecture is not right however. Check out https://bpaste.net/show/287821dc7214 . It has length 1558 and has 1445 runs.

– None – 2016-07-23T06:51:15.037

@Lembik That’s a fascinating sequence. Most of it consists of a repeated block of length 492 (echo edfcff|sed 's/f/ded/g;s/e/cdb/g;s/d/abc/g;s/c/babb0b/g;s/b/a11/g;s/a/010/g'), and just repeating that block arbitrarily often gives a runs-to-length ratio approaching 77/82 ≈ 0.93902. – Anders Kaseorg – 2016-07-23T08:20:16.157

Check my algorithm at stackoverflow with that it takes <1s to compute the first 64 solutions. Here is the output for 100 after 31.2 seconds: 100 88 31.2 0010100101101001010010110100101000100101001011010010100101101001010001010010110100101001011010010100

– bebbo – 2016-11-10T18:27:50.883

2

Since it's not a race if there's only one horse, I'm submitting my solution though it's only a fraction the speed of Anders Kaseorg's and a only a third as cryptic. Compile with:

gcc -O2 run-count.c -o run-count

The heart of my algorithm is a simple shift and XOR scheme:

enter image description here

A run of zeros in the XOR result that's greater than or equal to the current period/shift indicates a run in the original string for this period. From this you can tell how long the run was and where it starts and ends. The rest of the code is overhead, setting up the situation and decoding the results.

I'm hoping it will make at least 28 after two minutes on Lembik's machine. (I wrote a pthread version, but only managed to make it run even slower.)

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>

enum { START = 0, WIDTH } ;

// Compare and shuffle just one thing while storing two
typedef union {
    uint16_t token;
    uint8_t data[sizeof(uint16_t)];
} overlay_t;

#define SENTINAL (0)  // marks the end of an array of overlay_t

#define NUMBER_OF_BITS (8 * sizeof(uint64_t))

void period_runs(uint64_t xor_bits, uint8_t nbits, uint8_t period, overlay_t *results) {

    overlay_t *results_ptr = results;
    uint8_t count = 0;

    for (uint8_t position = 0; position < nbits; position++) {

        if (xor_bits & 1ULL) {

            if ((nbits - position) < period) {
                break;  // no room left to succeed further
            }

            if (count >= period) {  // we found a run

                results_ptr->data[START] = position - (count - 1);
                results_ptr->data[WIDTH] = period + count;
                results_ptr++;
            }

            count = 0;
        } else {

            count++;
        }

        xor_bits >>= 1;
    }

    if (count >= period) {  // process the final run, if any

        results_ptr->data[START] = 0;
        results_ptr->data[WIDTH] = period + count;
        results_ptr++;
    }

    results_ptr->token = SENTINAL;
}

void number_runs(uint64_t number, uint8_t bit_length, overlay_t *results) {

    overlay_t sub_results[bit_length];
    uint8_t limit = bit_length / 2 + 1;
    uint64_t mask = (1ULL << (bit_length - 1)) - 1;

    overlay_t *results_ptr = results;
    results_ptr->token = SENTINAL;

    for (uint8_t period = 1; period < limit; period++) {

        uint64_t xor_bits = mask & (number ^ (number >> period));  // heart of the code
        period_runs(xor_bits, bit_length - period, period, sub_results);

        for (size_t i = 0; sub_results[i].token != SENTINAL; i++) {

            bool stop = false;  // combine previous and current results

            for (size_t j = 0; !stop && results[j].token != SENTINAL; j++) {

                // lower period result disqualifies higher period result over the same span 
                stop = (sub_results[i].token == results[j].token);
            }

            if (!stop) {

                (results_ptr++)->token = sub_results[i].token;
                results_ptr->token = SENTINAL;
            }
        }

        mask >>= 1;
    }
}

int main() {

    overlay_t results[NUMBER_OF_BITS];

    for (uint8_t bit_length = 2; bit_length < 25; bit_length++) {

        int best_results = -1;
        uint64_t best_number = 0;

        for (uint64_t number = 1ULL << (bit_length - 1); number < (1ULL << bit_length); number++) {

            // from the discussion comments, I should be able to solve this
            // with just bit strings that begin "11...", so toss the rest
            if ((number & (1ULL << (bit_length - 2))) == 0ULL) {

                continue;
            }

            uint64_t reversed = 0;

            for (uint8_t i = 0; i < bit_length; i++) {

                if (number & (1ULL << i)) {

                    reversed |= (1ULL << ((bit_length - 1) - i));
                }
            }

            if (reversed > number) {

                continue;  // ~ 1/4 of bit_strings are simply reversals, toss 'em
            }

            number_runs(number, bit_length, results);
            overlay_t *results_ptr = results;
            int count = 0;

            while ((results_ptr++)->token != SENTINAL) {

                count++;
            }

            if (count > best_results) {

                best_results = count;
                best_number = number;
            }
        }

        char *best_string = malloc(bit_length + 1);
        uint64_t number = best_number;
        char *string_ptr = best_string;

        for (int i = bit_length - 1; i >= 0; i--) {

            *(string_ptr++) = (number & (1ULL << i)) ? '1' : '0';
        }

        *string_ptr = '\0';

        printf("%u %d %s\n", bit_length, best_results, best_string);

        free(best_string);
    }

    return 0;
}

Output:

> gcc -O2 run-count.c -o run-count
> ./run-count
2 1 11
3 1 110
4 2 1100
5 2 11000
6 3 110011
7 4 1100100
8 5 11001100
9 5 110010011
10 6 1100110011
11 7 11001100100
12 8 110110011011
13 8 1100100010011
14 10 11001001100100
15 10 110010011001000
16 11 1100100110010011
17 12 11001100100110011
18 13 110011001001100100
19 14 1101001011010010100
20 15 11010110100101101011
21 15 110010011001001100100
22 16 1100101001011010010100
23 17 11010010110100101001011
24 18 110100101001011010010100
25 19 1100100110010001001100100
26 20 11010010100101101001010010
27 21 110010011001000100110010011
28 22 1101001010010110100101001011

cdlane

Posted 2016-07-06T08:49:46.147

Reputation: 351

Welcome second horse! Small query, why do you and the other answer suggest -O2 instead of -O3? – None – 2016-07-23T06:37:29.547

@Lembik, with -O2 optimization, I could measure a time difference running the code but I couldn't measure any additional with -O3. Since we're supposedlly trading off safetly for speed, I figured the highest level that actually made a difference was best. If you believe my code will rank higher with -O3, go for it! – cdlane – 2016-07-24T15:30:16.757

-O3 isn't intended to be "unsafe". It enables auto-vectorization, but there's probably nothing to vectorize here. It can sometimes make code slower, e.g. if it uses a branchless cmov for something where a branch would have predicted very well. But usually it should help. It's usually worth trying clang as well, to see which of gcc or clang make better code for a specific loop. Also, it almost always helps to use -march=native, or at least -mtune=native if you still want a binary that runs anywhere. – Peter Cordes – 2016-12-16T22:03:19.933