Restricted memory optimization

9

The edit (or Levenshtein) distance between two strings is the minimal number of single character insertions, deletions and substitutions needed to transform one string into the other. If the two strings have length n each, it is well known that this can be done in O(n^2) time by dynamic programming. The following Python code performs this calculation for two strings s1 and s2.

def edit_distance(s1, s2):
    l1 = len(s1)
    l2 = len(s2)

    matrix = [range(l1 + 1)] * (l2 + 1)
    for zz in range(l2 + 1):
      matrix[zz] = range(zz,zz + l1 + 1)
    for zz in range(0,l2):
      for sz in range(0,l1):
        if s1[sz] == s2[zz]:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
        else:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
    return matrix[l2][l1]

In this task you have to get as close are you can to computing the edit distance but with a severe memory restriction. Your code is allowed to define one array containing 1000 32-bit integers and this is to be the only temporary storage you use in your computation. All variables and data structures are to be contained in this array. In particular, you would not be able to implement the algorithm above as for strings of length 1000 as it would require you to store at least 1,000,000 numbers. Where your language doesn't naturally have 32 bit integers (for example Python) you simply need to make sure you never store a number larger than 2^32-1 in the array.

You may read in the data using any standard library of your choice without worrying about the memory restrictions in that part. In order to make the competition fair for the main part of your code, you may only use operations that are functionally equivalent to those in the C programming language and cannot use any external libraries.

To be extra clear, the memory to store the input data or used by your language's interpreter, JVM etc. does not count towards your limit and you may not write anything to disk. You must assume the input data is read-only when in memory so you can't reuse that to gain more working space.

What do I have to implement?

Your code should read in a file in the following format. It will have three lines. The first line is the true edit distance. The second is string 1 and the third is string 2. I will test it with the sample data at https://bpaste.net/show/6905001d52e8 where the strings have length 10,000 but it shouldn't be specialised for this data. It should output the smallest edit distance it can find between the two strings.

You will also need to prove your edit distance actually comes from a valid set of edits. Your code should have a switch which turns it into a mode that may use more memory (as much as you like) and outputs the edit operations that give your edit distance.

Score

Your score will be the (optimal edit distance/divided by the edit distance you find) * 100. To start things off, notice that you can get a score by just counting the number of mismatches between the two strings.

You can use any language you like which is freely available and easy to install in Linux.

Tie break

In the case of a tie-break, I will run your code on my Linux machine and the fastest code wins.

user9206

Posted 2014-09-11T10:32:03.950

Reputation:

https://bpaste.net/show/6905001d52e8 gives me a 404 page! – sergiol – 2017-02-24T00:25:15.533

Would for(int i=0;i<=5;i++) be allowed because it's storing data in i? – Beta Decay – 2014-09-14T06:19:08.100

2@BetaDecay Yes although to follow the rules more closely you would do something like { uint32_t foo[1000]; for (foo[0] = 0; foo[0] < 5; ++foo[0]) printf("%d ", foo[0]); } This is assuming your array of 32 bit integers will be called foo. – None – 2014-09-14T06:51:43.217

What is the point of having the true edit distance in the file? Is the program actually supposed to read it? Or (what seems more sensible) is it just in there for you to see how successful the program was? – feersum – 2014-09-14T23:08:17.607

@feersum Exactly. It's just there so you can see what your score is easily. – None – 2014-09-15T05:48:50.890

Answers

4

C++, Score 92.35

Estimation Algorithm: The algorithm finds the first place the two string differ, and then tries all possible N operation permutations (insert, delete, replace - characters that match are skipped without consuming an operation). It scores each possible set of operations based on how much farther that set of operations successfully matches the two strings, plus how much it causes the string lengths to converge. After determining the highest-scoring set of N operations, the first operation in the set is applied, the next mismatch is found, and the process repeats until the end of the string is reached.

The program tries all values of N from 1-10 and selects the level that gave the best results. N=10 is generally the best now that the scoring method takes the string length into consideration. Higher values of N would probably be even better, but take exponentially more time.

Memory Usage: Since the program is purely iterative, it needs very little memory. Only 19 variables are used to track the program state. These are set by #defines to act as global variables.

Usage: The program is used the same as feersum's: the first parameter is assumed to be the file, and any additional parameters indicate that the edits should be shown. The program always prints the estimated edit distance, and the score.

Verification Output: The verification output it formatted in three rows:

11011111100101100111100110100 110 0 0000   0 01101
R I          IR     R        D   D D    DDD D     D
01 1111110010 0001110001101000110101000011101011010

The top row is the target string, the middle is the operations, and the bottom is the string being edited. Spaces in the operation line indicate that the characters match. 'R' indicates that the edit string has it's character in that position replaced with the target string's character. 'I' indicates that the edit string has the target string's character inserted at that position. 'D' indicates that the edit string has it's character in that position deleted. The edit and target strings have spaces inserted when the other has a character inserted or deleted so they line up.

#include <stdio.h>
#include <stdlib.h>
#include <string>
#include <math.h>
#include <fstream>

int memory[1000];
#define first (*(const char **)&memory[0])
#define second (*(const char **)&memory[1])
#define block_ia memory[2]
#define block_ib memory[3]
#define block_n memory[4]
#define block_op memory[5]
#define block_o memory[6]
#define block_x memory[7]
#define n memory[8]
#define opmax memory[9]
#define best_op memory[10]
#define best_score memory[11]
#define score memory[12]
#define best_counter memory[13]
#define la memory[14]
#define lb memory[15]
#define best memory[16]
#define bestn memory[17]
#define total memory[18]

// verification variables
char printline1[0xffff]={};
char *p1=printline1;
char printline2[0xffff]={};
char *p2=printline2;
char printline3[0xffff]={};
char *p3=printline3;


// determine how many characters match after a set of operations
int block(){
    block_ia=0;
    block_ib=0;
    for ( block_x=0;block_x<block_n;block_x++){
        block_o = block_op%3;
        block_op /= 3;
        if ( block_o == 0 ){ // replace
            block_ia++;
            block_ib++;
        } else if ( block_o == 1 ){ // delete
            block_ib++;
        } else { // insert
            if ( first[block_ia] ){ 
                block_ia++;
            }
        }
        while ( first[block_ia] && first[block_ia]==second[block_ib] ){ // find next mismatch
            block_ia++;
            block_ib++;
        }
        if ( first[block_ia]==0 ){
            return block_x;
        }
    }
    return block_n;
}

// find the highest-scoring set of N operations for the current string position
void bestblock(){
    best_op=0;
    best_score=0;
    la = strlen(first);
    lb = strlen(second);
    block_n = n;
    for(best_counter=0;best_counter<opmax;best_counter++){
        block_op=best_counter;
        score = n-block();
        score += block_ia-abs((la-block_ia)-(lb-block_ib));
        if ( score > best_score ){
            best_score = score;
            best_op = best_counter;
        }
    }
}

// prepare edit confirmation record
void printedit(const char * a, const char * b, int o){
    o%=3;
    if ( o == 0 ){ // replace
        *p1 = *a;
        if ( *b ){
            *p2 = 'R';
            *p3 = *b;
            b++;
        } else {
            *p2 = 'I';
            *p3 = ' ';
        }
        a++;
    } else if ( o == 1 ){ // delete
        *p1 = ' ';
        *p2 = 'D';
        *p3 = *b;
        b++;
    } else { // insert
        *p1 = *a;
        *p2 = 'I';
        *p3 = ' ';
        a++;
    }
    p1++;
    p2++;
    p3++;
    while ( *a && *a==*b ){
        *p1 = *a;
        *p2 = ' ';
        *p3 = *b;
        p1++;
        p2++;
        p3++;
        a++;
        b++;
    }
}


int main(int argc, char * argv[]){

    if ( argc < 2 ){
        printf("No file name specified\n");
        return 0;
    }

    std::ifstream file(argv[1]);
    std::string line0,line1,line2;
    std::getline(file,line0);
    std::getline(file,line1);
    std::getline(file,line2);

    // begin estimating Levenshtein distance
    best = 0;
    bestn = 0;
    for ( n=1;n<=10;n++){ // n is the number of operations that can be in a test set
        opmax = (int)pow(3.0,n);
        first = line1.c_str();
        second = line2.c_str();
        while ( *first && *first == *second ){
            first++;
            second++;
        }
        total=0;
        while ( *first && *second ){
            bestblock();
            block_n=1;
            block_op=best_op;
            block();
            total ++;
            first += block_ia;
            second += block_ib;
        }
        // when one string is exhausted, all following ops must be insert or delete
        while(*second){
            total++;
            second++;
        }
        while(*first){
            total++;
            first++;
        }
        if ( !best || total < best ){
            best = total;
            bestn = n;
        }
    }
    // done estimating Levenshtein distance

    // dump info to prove the edit distance actually comes from a valid set of edits
    if ( argc >= 3 ){
        p1 = printline1;
        p2 = printline2;
        p3 = printline3;
        n = bestn;
        opmax = (int)pow(3.0,n);
        first = line1.c_str();
        second = line2.c_str();
        while ( *first && *first == *second ){
            *p1 = *first;
            *p2 = ' ';
            *p3 = *second;
            p1++;
            p2++;
            p3++;
            first++;
            second++;
        }
        while ( *first && *second){
            bestblock();
            block_n=1;
            block_op=best_op;
            block();
            printedit(first,second,best_op);
            first += block_ia;
            second += block_ib;
        }
        while(*second){
            *p1=' ';
            *p2='D';
            *p3=*second;
            p1++;
            p2++;
            p3++;
            second++;
        }
        while(*first){
            *p1=*first;
            *p2='I';
            *p3=' ';
            p1++;
            p2++;
            p3++;
            first++;
        }

        p1 = printline1;
        p2 = printline2;
        p3 = printline3;
        int ins=0;
        int del=0;
        int rep=0;
        while ( *p1 ){
            int a;
            for ( a=0;a<79&&p1[a];a++)
                printf("%c",p1[a]);
            printf("\n");
            p1+=a;
            for ( a=0;a<79&&p2[a];a++){
                ins += ( p2[a] == 'I' );
                del += ( p2[a] == 'D' );
                rep += ( p2[a] == 'R' );
                printf("%c",p2[a]);
            }
            printf("\n");
            p2+=a;
            for ( a=0;a<79&&p3[a];a++)
                printf("%c",p3[a]);
            printf("\n\n");
            p3+=a;
        }
        printf("Best N=%d\n",bestn);
        printf("Inserted = %d, Deleted = %d, Replaced=%d, Total = %d\nLength(line1)=%d, Length(Line2)+ins-del=%d\n",ins,del,rep,ins+del+rep,line1.length(),line2.length()+ins-del);
    }

    printf("%d, Score = %0.2f\n",best,2886*100.0/best);
    system("pause");
    return 0;
}

Sir_Lagsalot

Posted 2014-09-11T10:32:03.950

Reputation: 4 898

7

C++   75.0

The program is designed to work with arbitrary text strings. They can be of any different lengths as long as neither exceeds 13824 characters. It uses 1,897 16-bit integers, which is equivalent to 949 32-bit integers. At first I was writing it in C, but then realized there was no function for reading a line.

The first command-line argument should be a filename. If a second argument exists, a summary of the edits is printed. The first line in the file is ignored while the second and third are the strings.

The algorithm is a doubly blocked version of the usual algorithm. It performs basically the same number of operations, but is of course much less accurate, since if a common subsequence gets split over the edge of a block, much of the potential savings are lost.

#include <cstring>
#include <inttypes.h>
#include <iostream>
#include <fstream>

#define M 24
#define MAXLEN (M*M*M)
#define SETMIN(V, X) if( (X) < (V) ) { (V) = (X); }
#define MIN(X, Y) ( (X) < (Y) ? (X) : (Y) )

char A[MAXLEN+1], B[MAXLEN+1];
uint16_t d0[M+1][M+1], d1[M+1][M+1], d2[M+1][M+1];

int main(int argc, char**argv)
{

    if(argc < 2)
        return 1;

    std::ifstream fi(argv[1]);

    std::string Astr, Bstr;
    for(int i = 3; i--;)
        getline(fi, i?Bstr:Astr);
    if(!fi.good()) {
        printf("Error reading file");
        return 5;
    }
    if(Astr.length() > MAXLEN || Bstr.length() > MAXLEN) {
        printf("String too long");
        return 7;
    }

    strcpy(A, Astr.c_str());
    strcpy(B, Bstr.c_str());

    uint16_t lA = Astr.length(), lB = Bstr.length();
    if(!lA || !lB) {
        printf("%d\n", lA|lB);
        return 0;
    }
    uint16_t nbA2, nbB2, bA2, bB2, nbA1, nbB1, bA1, bB1, nbA0, nbB0, bA0, bB0; //block, number of blocks
    uint16_t iA2, iB2, iA1, iB1, jA2, jB2, jA1, jB1; //start, end indices of block

    nbA2 = MIN(M, lA);
    nbB2 = MIN(M, lB);
    for(bA2 = 0; bA2 <= nbA2; bA2++) {
        iA2 = lA * (bA2-1)/nbA2,  jA2 = lA * bA2/nbA2;
        for(bB2 = 0; bB2 <= nbB2; bB2++) {
            if(!(bA2|bB2)) {
                d2[0][0] = 0;
                continue;
            }
            iB2 = lB * (bB2-1)/nbB2,  jB2 = lB * bB2/nbB2;
            d2[bA2][bB2] = ~0;
            if(bB2)
                SETMIN(d2[bA2][bB2], d2[bA2][bB2-1] + (jB2-iB2));
            if(bA2)
                SETMIN(d2[bA2][bB2], d2[bA2-1][bB2] + (jA2-iA2));

            if(bA2 && bB2) {
                nbA1 = MIN(M, jA2-iA2);
                nbB1 = MIN(M, jB2-iB2);
                for(bA1 = 0; bA1 <= nbA1; bA1++) {
                    iA1 = iA2 + (jA2-iA2) * (bA1-1)/nbA1, jA1 = iA2 + (jA2-iA2) * bA1/nbA1;
                    for(bB1 = 0; bB1 <= nbB1; bB1++) {
                        if(!(bA1|bB1)) {
                            d1[0][0] = 0;
                            continue;
                        }
                        iB1 = iB2 + (jB2-iB2) * (bB1-1)/nbB1, jB1 = iB2 + (jB2-iB2) * bB1/nbB1;
                        d1[bA1][bB1] = ~0;
                        if(bB1)
                            SETMIN(d1[bA1][bB1], d1[bA1][bB1-1] + (jB1-iB1));
                        if(bA1)
                            SETMIN(d1[bA1][bB1], d1[bA1-1][bB1] + (jA1-iA1));

                        if(bA1 && bB1) {
                            nbA0 = jA1-iA1;
                            nbB0 = jB1-iB1;
                            for(bA0 = 0; bA0 <= nbA0; bA0++) {
                                for(bB0 = 0; bB0 <= nbB0; bB0++) {
                                    if(!(bA0|bB0)) {
                                        d0[0][0] = 0;
                                        continue;
                                    }
                                    d0[bA0][bB0] = ~0;
                                    if(bB0)
                                        SETMIN(d0[bA0][bB0], d0[bA0][bB0-1] + 1);
                                    if(bA0)
                                        SETMIN(d0[bA0][bB0], d0[bA0-1][bB0] + 1);
                                    if(bA0 && bB0)
                                        SETMIN(d0[bA0][bB0], d0[bA0-1][bB0-1] + (A[iA1 + nbA0 - 1] != B[iB1 + nbB0 - 1]));
                                }
                            }
                            SETMIN(d1[bA1][bB1], d1[bA1-1][bB1-1] + d0[nbA0][nbB0]);
                        }
                    }
                }

                SETMIN(d2[bA2][bB2], d2[bA2-1][bB2-1] + d1[nbA1][nbB1]);
            }
        }
    }
    printf("%d\n", d2[nbA2][nbB2]);

    if(argc == 2)
        return 0;

    int changecost, total = 0;
    for(bA2 = nbA2, bB2 = nbB2; bA2||bB2; ) {
        iA2 = lA * (bA2-1)/nbA2,  jA2 = lA * bA2/nbA2;
        iB2 = lB * (bB2-1)/nbB2,  jB2 = lB * bB2/nbB2;
        if(bB2 && d2[bA2][bB2-1] + (jB2-iB2) == d2[bA2][bB2]) {
            total += changecost = (jB2-iB2);
            char tmp = B[jB2];
            B[jB2] = 0;
            printf("%d %d deleted {%s}\n", changecost, total, B + iB2);
            B[jB2] = tmp;
            --bB2;
        } else if(bA2 && d2[bA2-1][bB2] + (jA2-iA2) == d2[bA2][bB2]) {
            total += changecost = (jA2-iA2);
            char tmp = B[jA2];
            A[jA2] = 0;
            printf("%d %d inserted {%s}\n", changecost, total, A + iA2);
            A[jA2] = tmp;
            --bA2;
        } else {
            total += changecost = d2[bA2][bB2] - d2[bA2-1][bB2-1];
            char tmpa = A[jA2], tmpb = B[jB2];
            B[jB2] = A[jA2] = 0;
            printf("%d %d changed {%s} to {%s}\n", changecost, total, B + iB2, A + iA2);
            A[jA2] = tmpa, B[jB2] = tmpb;
            --bA2, --bB2;
        }
    }


    return 0;
}

feersum

Posted 2014-09-11T10:32:03.950

Reputation: 29 566

Thank you for being the first answerer! What is your score? – None – 2014-09-15T05:50:18.477

@Lembik OK, I have calculated the score, assuming it is based only on the one example. – feersum – 2014-09-15T09:46:14.993

This is great. Do you think it's possible to get a much higher score? – None – 2014-09-15T16:06:38.917

3

Python, 100

I did manage to calculate the edit distance perfectly in the allotted memory limit. Sadly, this entry violates two rules of the challenge, in letter if not in spirit.

First, I have not actually stored my data in 1000 32-bit ints. For 10000-character strings, my program creates two 10000-element arrays that will only contain +1, 0, or -1. At 1.585 bits per ternary number, it would be possible to pack those 20000 trits into 31700 bits, leaving 300 bits as more than enough for my 7 remaining 16-bit integers.

Second, I have not implemented the required mode for showing edits. I have, alternately, implemented a mode that prints out the full edit matrix. It is absolutely possible to calculate the edit path from that matrix, but I don't have time right now to implement it.

#!/usr/bin/env python

import sys

# algorithm originally from
# https://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows

print_rows = False
if len(sys.argv) > 2:
    print_rows = True

def LevenshteinDistance(s, t):
    # degenerate cases
    if s == t:
        return 0
    if len(s) == 0:
        return len(t)
    if len(t) == 0:
        return len(s)

    # create two work vectors of integer distance deltas

    # these lists will only ever contain +1, 0, or -1
    # so they COULD be packed into 1.585 bits each
    # 15850 bits per list, 31700 bits total, leaving 300 bits for all the other variables

    # d0 is the previous row
    # initialized to 0111111... which represents 0123456...
    d0 = [1 for i in range(len(t)+1)]
    d0[0] = 0        
    if print_rows:
        row = ""
        for i in range(len(t)+1):
            row += str(i) + ", "
        print row

    # d1 is the row being calculated
    d1 = [0 for i in range(len(t)+1)]

    for i in range(len(s)-1):
        # cummulative values of cells north, west, and northwest of the current cell
        left = i+1
        upleft = i
        up = i+d0[0]
        if print_rows:
            row = str(left) + ", "
        for j in range(len(t)):
            left += d1[j]
            up += d0[j+1]
            upleft += d0[j]
            cost = 0 if (s[i] == t[j]) else 1
            d1[j + 1] = min(left + 1, up + 1, upleft + cost) - left
            if print_rows:
                row += str(left+d1[j+1]) + ", "

        if print_rows:
            print row

        for c in range(len(d0)):
            d0[c] = d1[c]

    return left+d1[j+1]

with open(sys.argv[1]) as f:
    lines = f.readlines()

perfect = lines[0]
string1 = lines[1]
string2 = lines[2]
distance = LevenshteinDistance(string1,string2)
print "edit distance: " + str(distance)
print "score: " + str(int(perfect)*100/distance) + "%"

example input:

2
101100
011010

example verbose output:

0, 1, 2, 3, 4, 5, 6,
1, 1, 1, 2, 3, 4, 5,
2, 1, 2, 2, 2, 3, 4,
3, 2, 1, 2, 3, 2, 3,
4, 3, 2, 1, 2, 3, 3,
5, 4, 3, 2, 1, 2, 3,
6, 5, 4, 3, 2, 2, 2,
edit distance: 2
score: 100%

Sparr

Posted 2014-09-11T10:32:03.950

Reputation: 5 758