Lossless English language text compression challenge

12

7

Challenge:

Your challenge (should you choose to accept it) is to compress and decompress the 5MB "Complete Works of William Shakespeare" as found here: http://www.gutenberg.org/cache/epub/100/pg100.txt

(MD5: a810f89e9f8e213aebd06b9f8c5157d8)

Rules:

  • You must take input via STDIN and output via STDOUT...
  • ...and you must provide an identical decompressed result to the input.
    • (This is to say you must be able to cat inpt.txt | ./cmprss | ./dcmpress | md5 and get the same MD5 as above.)
    • (Anything via STDERR is to be discarded.)
  • You must use less than 2048 characters for your total source code.
    • (This is not code-golf. You are not being scored based on length of source-code. This is was just a rule to keep things finite.)
    • (Take the concatenated length of all source code if you have split it out.)
  • You must be able to (theoretically) process similar plain-text inputs too.
    • (e.g. hard coding a mechanism which is only capable of outputting the provided Shakespeare input is unacceptable.)
    • (The compressed size of other documents is irrelevant - provided the decompressed result is identical to the alternative input.)
  • You may use any choice of language(s).
    • (e.g. feel free to compress using awk and decompress using java)
  • You may write two separate programs or combine them with some form of "switch" as you please.
    • (There must be clear demonstrations of how to invoke both the compression and decompression modes)
  • You may not use any external commands (e.g. through exec()).
    • (If you are using a shell language - sorry. You'll have to make do with built-ins. You're welcome to post an "unacceptable" answer for the sake of sharing and enjoyment - but it won't be judged!)
  • You may not use any built-in or library provided functions who's stated purpose is to compress data (like gz, etc)
    • (Altering the encoding is not considered compression in this context. Some discretion may be applied here. Feel free to argue the acceptableness of your solution in the submission.)
  • Please try to have fun if choose to participate!

All good competitions have an objective definition of winning; ergo:

  • Provided all rules are adhered to, the smallest compressed output (in STDOUT bytes) wins.
    • (Report your output please via ./cmprss | wc -c)
  • In the event of a draw (identical output sizes), the most community up-voted wins.
  • In the event of a second draw (identical community up-votes), I'll pick a winner based on completely subjective examination of elegance and pure genius. ;-)

How to submit:

Please format your entry using this template:

<language>, <compressed_size>
-----------------------------

<description>  (Detail is encouraged!)

    <CODE...
    ...>

<run instructions>

I would encourage readers and submitters to converse through comments - I believe there's real opportunity for people to learn and become better programmers through codegolf.stack.

Winning:

I'm away on vacation soon: I may (or may not) be monitoring submissions over the next few weeks and will draw the challenge to a close on the 19th September. I hope this offers a good opportunity for people to think and submit - and for a positive sharing of techniques and ideas.

If you've learned something new from participating (as a reader or submitter) please leave a comment of encouragement.

wally

Posted 2015-08-17T21:07:03.477

Reputation: 263

Question was closed 2015-08-18T20:34:03.270

1You should tag this code-challenge. – kirbyfan64sos – 2015-08-17T21:08:38.073

Thank-you @kirbyfan64sos! First time submitting. Appreciate the (extremely prompt) feedback. – wally – 2015-08-17T21:09:55.850

1Is taking the input as an function argument allowed? E.g. A solution in languages such as JavaScript couldn't be run from the command line, AFAIK. In my case, it would simply be much easier to run in the browser. – ETHproductions – 2015-08-17T22:04:45.177

We could change the rules for that. I think for JS specifically it could be done in node though? Thoughts? – wally – 2015-08-17T22:24:42.580

Can we remove the disclaimer part of the txt? – Roberto Anić Banić – 2015-08-17T23:06:11.633

1Why the template? Are you going to create a stack snippet which depends on it? – Peter Taylor – 2015-08-18T06:17:27.113

@PeterTaylor - programmer's habit I guess! Thought it would keep things tidier - and allow an easy statistic report if desired. – wally – 2015-08-18T07:08:41.937

@RobertoNicbaAnićBanić - I thought about this myself. Given keeping it in makes the challenge slightly harder (increases dictionary sizes etc) and given the license detail etc, I figured there was no harm keeping it. Makes the challenge simpler to describe too (just take the raw .txt and have fun). – wally – 2015-08-18T07:11:35.707

1@ETHproductions nodejs has a readline plugin... – Daniel Alder – 2015-08-18T07:20:08.833

Could you remove the 2048 bytes limit? It might be too low for really verbose or joke languages if someone wants to answer with them and do something competitive – Fatalize – 2015-08-18T07:52:49.140

@Fatalize - I'm more than happy to - should we keep some sort of limit to keep things sane? What would you propose? – wally – 2015-08-18T08:04:44.353

Well I personally don't see the benefit of having any limit at all for this challenge – Fatalize – 2015-08-18T08:12:42.163

@Fatalize - I did some playing, and it seems I was being hugely optimistic to have suggestions in even a C-like language in <2k chars. Along with your recommendation, I've dropped the rule. Thanks! – wally – 2015-08-18T08:29:42.043

1@wally while 2k might be optimisitc... I find having no limit is actually worse for three reasons: 1. makes it difficult to determine know when to stop, 2. there is too much possibility, so hard to start, 3. necessity is the mother of invention... so without a limit, invention suffers. I would suggest perhaps changing the scoring to be something like (bytes saved)/(size of source code) so that there is benefit for more efficient code – Moogie – 2015-08-18T10:03:59.297

@Moogie - that's where I started when I posted yesterday. The loop I infinitely traverse is "code-golf makes the code hard to read and there's already many such puzzles posted" (voteing against any code length limit/scoring) but equally (as you say) "no code length limit makes the task extremely broad". I also don't want to unfairly penalize naturally-verbose languages. Arguably for compression the speed and memory requirements are most important in "real world scenarios". Maybe I should have scored on those too. Perhaps someone would care to follow this challenge with a better follow-up? – wally – 2015-08-18T10:45:13.620

@wally - i understand. I will try to participate with the original 2048 characters just to constrain myself :P – Moogie – 2015-08-18T11:36:35.433

2If there is no code size limit, what stops me from writing a compress program that prints 0 bytes, and a decompress program that's hard-coded to print the entire works of Shakespeare? – Lynn – 2015-08-18T14:47:10.387

4A rule could be added that says that the code should theoretically work with other inputs, which solves the problem @Mauris pointed out. – kirbyfan64sos – 2015-08-18T16:48:52.780

@kirbyfan64sos - done! Thank you for the help. – wally – 2015-08-18T20:03:14.803

1

That doesn't fix the problem at all. The only sensible way to handle it is to score on length of code + length of compressed text, but then it's a duplicate of this question.

– Peter Taylor – 2015-08-18T20:34:31.920

Answers

5

Perl 5, 3651284

Just a simple word-based dictionary scheme. Analyzes the word frequency of the corpus, and uses that to determine whether to use one or two bytes of overhead per word. Uses two special symbols for the bytes \0 and \1 since they do not appear in the corpus. There are plenty of other symbols that could be used. This was not done. Does not do any huffman encoding or any of that jazz.

Compression script shakespeare.pl:

use strict;
use warnings;
use bytes;

my $text = join "", <>;
my @words = split/([^a-zA-Z0-9]+)/, $text;


my %charfreq;
for( my $i = 0; $i<length($text); ++$i ) {
    $charfreq{ substr($text, $i, 1) }++
}
for my $i ( 0..255 ) {
    my $c = chr($i);
    my $cnt = $charfreq{$c} // 0;
}



my %word_freq;
foreach my $word ( @words ) {
    $word_freq{ $word }++;
}


my $cnt = 0;
my ( @dict, %rdict );
foreach my $word ( sort { $word_freq{$b} <=> $word_freq{$a} || $b cmp $a } keys %word_freq ) {
    last if $word_freq{ $word } == 1; 


    my $repl_length = $cnt < 127 ? 2 : 3;
    if( length( $word ) > $repl_length ) {
        push @dict, $word;
        $rdict{ $word } = $cnt;
        $cnt++;
    }
}


foreach my $index ( 0..$
    print "$dict[$index]\0";
}
print "\1";


foreach my $word ( @words ) {
    my $index = $rdict{ $word };
    if ( defined $index && $index <= 127 ) {
        print "\0" . chr( $index );
    } elsif ( defined $index ) {
        my $byte1 = $index & 127;
        my $byte2 = $index >> 7;
        print "\1" . chr( $byte2 ) . chr( $byte1 );
    } else {
        print $word;
    }
}

Decompression script deshakespeare.pl:

use strict;
use warnings;
use bytes;

local $/;
my $compressed = <>;
my $text = $compressed;
$text =~ s/^.+?\x{1}//ms;
my $dictionary = $compressed;
$dictionary =~ s/\x{1}.*$//ms;


my $cnt = 0;
my @dict;
foreach my $word ( split "\0", $dictionary ) {

    push @dict, $word;
}


my @words = split /(\x{0}.|\x{1}..)/ms, $text;
foreach my $word ( @words ) {
    if( $word =~ /^\x{0}(.)/ms ) {
        print $dict[ ord( $1 ) ];
    } elsif( $word =~ /^\x{1}(.)(.)/ms ) {
        my $byte1 = ord( $1 );
        my $byte2 = ord( $2 );
        my $index = ( $byte1 << 7 ) + $byte2;
        print $dict[ $index ];
    } else {
        print $word;
    }
}

Run using:

perl shakespeare.pl < pg100.txt >pg100.txt.compressed
perl deshakespeare.pl <pg100.txt.compressed >pg100.txt.restored
diff pg100.txt pg100.txt.restored

skibrianski

Posted 2015-08-17T21:07:03.477

Reputation: 1 197