Compress data with context free grammars

9

It is possible to compress some kinds of data, such as human text or source code, with straight-line grammars. You basically create a grammar whose language has exactly one word – the uncompressed data. In this task, you have to write a program that implements this method of data compession.

Input

The input is a string of not more than 65535 bytes length. It is guaranteed, that the input matches the regular expression [!-~]+ (i.e. at least one printable ASCII character excluding whitespace).

An example input is

abcabcbcbcabcacacabcabab

Output

The output is a set of rules that form a grammar that describes exactly one word (the input). Each nonterminal is denoted by a decimal number greater than 9. The start symbol is symbol number ten. An example output corresponding to the example input is given below; its syntax is described further below:

10=11 11 12 12 11 13 13 11 14 14
11=a 12
12=b c
13=a c
14=a b

Each rule has the form <nonterminal>=<symbol> <symbol> ... with an arbitrary whitespace-separated number of symbols on the right side. Each output that obeys the following restrictions and derives exactly the input string is valid.

Restrictions

In order to stop people from doing strange things, a number of restrictions are taking place:

  • Each nonterminal must appear at least twice on the right side of a rule. For instance, the following grammar for the input abcabc is illegal because rule 12 appears only once:

    10=12
    11=a b c
    12=11 11
    
  • No sequence of two adjacent symbols may appear more than once in all right-hand sides of all rules, except if they overlap. For instance, the following grammar for the input abcabcbc is illegal since the sequence bc appears twice:

    10=11 11 b c
    11=a b c
    

    A valid grammar would be:

    10=11 11 12
    11=a 12
    12=b c
    
  • Your program must terminate in less than one minute for each valid input that is not longer than 65535 bytes.

  • As usual, you may not use any facility of your language or any library function that makes the solution trivial or implements a large part of it.

Sample input

Generate sample input with the following C program.

#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv) {
  unsigned int i,j = 0,k;

  if (argc != 3
     || 2 != sscanf(argv[1],"%u",&i)
      + sscanf(argv[2],"%u",&k)) {
    fprintf(stderr,"Usage: %s seed length\n",argv[0]);
    return EXIT_FAILURE;
  }

  srand(i);

  while(j < k) {
    i = rand() & 0x7f;
    if (i > 34 && i != 127) j++, putchar(i);
  }

  return EXIT_SUCCESS;
}

The sample input generated by the program above will usually not result in good compression results. Consider using human text or source code as example input.

Winning criteria

This is code golf; the program with the shortest source code wins. For extra credit, write a program that reconstructs the input from the output.

FUZxxl

Posted 2012-10-28T21:00:35.447

Reputation: 9 656

Ha ha. I already have a few versions of this implemented (but not golfed) in Java for the [tag:kolmogorov-complexity] questions... – Peter Taylor – 2012-10-28T23:18:35.377

@PeterTaylor Which questions exactly? – FUZxxl – 2012-10-29T17:41:00.237

It doesn't necessarily find a short enough answer to be worth posting (I'm adding grammar generating strategies and grammar engines slowly), but the core script in that project tries them out on http://codegolf.stackexchange.com/questions/1682 , http://codegolf.stackexchange.com/questions/6043 , http://codegolf.stackexchange.com/questions/4191 , http://codegolf.stackexchange.com/questions/4356 and a few components of other questions.

– Peter Taylor – 2012-10-30T07:54:53.897

Answers

3

GolfScript, 111 108 characters

1/{.}{:^1<{^1$/,2>.{;,)^<.0?)!}*}do-1<.,1>{^1$/[10):10]*0+\+}{;^}if(\}while][0]%.,,]zip{))9+`"="+\~" "*+}%n*

This is a quite clumsy approach using GolfScript. The second version performs much better than the initial one. It is much longer than the intended code but my implementation had nested do-loops and this caused issues with the interpreter.

Examples:

> abcba
10=a b c b a

> abcabcbc
10=11 11 12
11=a 12
12=b c

> abcabcbcbcabcacacabcabab
10=11 12 12 13 14 14 c 11 15
11=15 13
12=c b
13=14 b
14=c a
15=a b

Howard

Posted 2012-10-28T21:00:35.447

Reputation: 23 109

1

Retracted - algorithm can't handle all cases. C, 422 (fixed to remove dups in output, and dropped character)

Initial implementation, will begin golfing it.

Since the rules don't require it to actually compress this brute force naïve approach will do...

Can process 65535 length within 10 seconds

n,m[99999];
c,r[99999][2];

g,i,s,t;

main(){
    for(;(m[n]=getchar())>32;n++);

    while(!g){ // loop until no further changes
        g=1;
        for(s=0;s<n-1;s++) {
            for(t=s+2;t<n-1;t++)if(m[s]==m[t]&&m[s+1]==m[t+1]){
                // create rule
                r[c][0]=m[s];
                r[c++][1]=m[s+1];
                g=0;
                // substitute
                for(i=t=s;i<n;i++){
                    if(m[i]==r[c-1][0]&&m[i+1]==r[c-1][1]){
                        m[t++]=-c;
                        i++;
                    }else
                        m[t++]=m[i];
                }
                n=t;
            }
        }
    }

    for(s=-1;s<c;s++){
        printf("%d=",s+11);
        for(t=0;t<(s<0?n:2);t++){
            i=(s<0?m:r[s])[t];
            i<0?printf("%d ",10-i):printf("%c ",i);
        }
        printf("\n");
    }

}

Sample run:

echo abcabcbcbcabcacacabcabab | a.out
10=11 12 13 13 12 14 14 12 12 11 
11=a b 
12=c 11 
13=c b 
14=c a

baby-rabbit

Posted 2012-10-28T21:00:35.447

Reputation: 1 623

Your code doe not work according to the specification. It generates output that violates the rule no sequence of two characters may appear twice; consider the input abcdabcd. Also, your code apparently removes the last byte from the input stream. Look here for an example to observe both effects: http://ideone.com/3Xvtyv

– FUZxxl – 2012-10-29T21:55:24.180

Also, your sample output is apparently wrong. – FUZxxl – 2012-10-29T21:57:24.240

You're right - I failed - I'll look at it when I get back from work :P – baby-rabbit – 2012-10-29T22:01:48.840

It's not removing the last byte from the input for me - and my sample output is correct (for me).. Let's play "spot the bug"! – baby-rabbit – 2012-10-29T22:03:02.987

The sample output you posted definitly is. The expanded form of rule 10 ends with rule 14 which in turn ends with "ca". The last c is actually 5 positions before the end. – FUZxxl – 2012-10-29T22:12:10.543

both bugs fixed – baby-rabbit – 2012-10-29T22:24:44.113

Unfortunately, it doesn't work for input abcabc. Yields 10=12 12; 11=a b; 12=11 c and thus violates Each nonterminal must appear at least twice on the right side of a rule.. – Howard – 2012-10-30T07:52:38.537

ah! Well spotted, so my thinking is wrong, this can't be solved by making all subsequent rules lengths 2, e.g. for "abcabc" the only valid result (for the competition rules) is "10=11 11, 11=a b c" – baby-rabbit – 2012-10-30T08:20:40.787