Text compression and decompression — "Nevermore."

38

13

With the recent discussion about the use of compression tools in code golf, I thought it would be a nice challenge to write your own text compressor and decompressor.

Challenge:

Write two programs: one to compress ASCII text into a sequence of bytes, and another to decompress it. The programs need not be in the same language.

The first program should read a piece of ASCII text (from a file or from standard input, or using whatever mechanism is most natural to the language) and output a compressed version of it. (The compressed output may consist or arbitrary bytes; it need not be readable.) The second program should read the output of the first and recreate the original input text.

Scoring:

The score of a solution will be the sum of the following three counts:

  1. The length of the compressor program in characters.
  2. The length of the output of the compressor, given the test input below, in bytes.
  3. The length of the decompressor program (if different from the compressor) in characters.

You should note all three counts and their sum in your answer. Since this is code golf, the lower the score, the better.

Rules and restrictions:

  • You may not use any pre-existing compression or decompression tools or libraries, even if they come bundled with your chosen language. If in doubt about whether a given tool or function is allowed, please ask.

  • Your compressor program must be capable of handling input consisting of any printable ASCII text, including tabs (ASCII 9) and line feeds (ASCII 10). You may, but are not required to, handle arbitrary Unicode and/or binary input.

  • Your decompressor program must produce exactly the same output as was given to the compressor as input. In particular, take care not to output a trailing line feed if the input did not have one. (The test input below does have a trailing line feed, so you'll need to test for this separately. Tip for GolfScript: '':n.)

  • Your compressor and decompressor may be the same program (with the appropriate mode selected e.g. with a command line switch). In that case, its length is only counted once.

  • The programs should not be exceedingly slow or memory-hungry. If either compressing or decompressing the test input takes more than a minute on my not-so-new desktop (2.2GHz AMD Athlon64 X2) or consumes more than a gigabyte of RAM, I'm going to rule the solution invalid. These limits are deliberately lax — please try not to push them. (See amendment below: you need to be able to handle at least 100 kB of input within these limits.)

  • Even though only the test input matters for scoring, you should at least make an effort at compressing arbitrary input text. A solution that achieves a decent compression ratio only for the test input, and for nothing else, is technically valid but isn't going to get an upvote from me.

  • Your compressor and decompressor programs should be self-contained. In particular, if they depend on being able to read some file or network resource that is not part of your chosen language's standard runtime environment, the length of that file or resource should be counted as part of the length of the program(s). (This is to disallow "compressors" that compare the input to a file on the web and output zero bytes if they match. Sorry, but that's not a new trick anymore.)

Amendments and clarifications:

  • Your compressor must be able to handle files consisting of at least 100 kB of typical English text within reasonable time and memory usage (at most one minute and one GB of memory). Your decompressor must be able to decompress the resulting output within the same limits. Of course, being able to handle files longer than that is perfectly fine and commendable. It's OK to split long input files into chunks and compress them individually, or to use other means to trade off compression efficiency for speed for long inputs.

  • Your compressor may require its input to be given using your preferred platform's native newline representation (LF, CR+LF, CR, etc.), as long as your decompressor uses the same newline representation in its output. Of course, it's also fine for the compressor to accept any kind of newlines (or even only Unix newlines regardless of platform), as long as your decompressor then outputs the same kind of newlines as in the original input.

Test input:

To judge the compression efficiency of the answers, the following test input (The Raven by Edgar Allan Poe, courtesy of Project Gutenberg) will be used:

Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore,
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
"'T is some visiter," I muttered, "tapping at my chamber door--
                                          Only this, and nothing more."

Ah, distinctly I remember it was in the bleak December,
And each separate dying ember wrought its ghost upon the floor.
Eagerly I wished the morrow:--vainly I had sought to borrow
From my books surcease of sorrow--sorrow for the lost Lenore--
For the rare and radiant maiden whom the angels name Lenore--
                                          Nameless here for evermore.

And the silken sad uncertain rustling of each purple curtain
Thrilled me--filled me with fantastic terrors never felt before;
So that now, to still the beating of my heart, I stood repeating
"'T is some visiter entreating entrance at my chamber door
Some late visiter entreating entrance at my chamber door;--
                                          This it is, and nothing more."

Presently my soul grew stronger; hesitating then no longer,
"Sir," said I, "or Madam, truly your forgiveness I implore;
But the fact is I was napping, and so gently you came rapping,
And so faintly you came tapping, tapping at my chamber door,
That I scarce was sure I heard you"--here I opened wide the door;--
                                          Darkness there, and nothing more.

Deep into that darkness peering, long I stood there wondering, fearing,
Doubting, dreaming dreams no mortal ever dared to dream before;
But the silence was unbroken, and the darkness gave no token,
And the only word there spoken was the whispered word, "Lenore!"
This I whispered, and an echo murmured back the word, "Lenore!"
                                          Merely this and nothing more.

Back into the chamber turning, all my soul within me burning,
Soon again I heard a tapping, somewhat louder than before.
"Surely," said I, "surely that is something at my window lattice;
Let me see, then, what thereat is, and this mystery explore--
Let my heart be still a moment and this mystery explore;--
                                          'T is the wind and nothing more!"

Open here I flung the shutter, when, with many a flirt and flutter,
In there stepped a stately Raven of the saintly days of yore.
Not the least obeisance made he; not a minute stopped or stayed he;
But, with mien of lord or lady, perched above my chamber door--
Perched upon a bust of Pallas just above my chamber door--
                                          Perched, and sat, and nothing more.

Then this ebony bird beguiling my sad fancy into smiling,
By the grave and stern decorum of the countenance it wore,
"Though thy crest be shorn and shaven, thou," I said, "art sure no craven,
Ghastly grim and ancient Raven wandering from the Nightly shore,--
Tell me what thy lordly name is on the Night's Plutonian shore!"
                                          Quoth the Raven, "Nevermore."

Much I marvelled this ungainly fowl to hear discourse so plainly,
Though its answer little meaning--little relevancy bore;
For we cannot help agreeing that no living human being
Ever yet was blessed with seeing bird above his chamber door--
Bird or beast upon the sculptured bust above his chamber door,
                                          With such name as "Nevermore."

But the Raven, sitting lonely on the placid bust, spoke only
That one word, as if his soul in that one word he did outpour.
Nothing further then he uttered--not a feather then he fluttered--
Till I scarcely more than muttered, "Other friends have flown before--
On the morrow _he_ will leave me, as my hopes have flown before."
                                          Then the bird said, "Nevermore."

Startled at the stillness broken by reply so aptly spoken,
"Doubtless," said I, "what it utters is its only stock and store,
Caught from some unhappy master whom unmerciful Disaster
Followed fast and followed faster till his songs one burden bore--
Till the dirges of his Hope that melancholy burden bore
                                          Of 'Never--nevermore.'"

But the Raven still beguiling all my sad soul into smiling,
Straight I wheeled a cushioned seat in front of bird and bust and door;
Then, upon the velvet sinking, I betook myself to linking
Fancy unto fancy, thinking what this ominous bird of yore--
What this grim, ungainly, ghastly, gaunt and ominous bird of yore
                                          Meant in croaking "Nevermore."

This I sat engaged in guessing, but no syllable expressing
To the fowl whose fiery eyes now burned into my bosom's core;
This and more I sat divining, with my head at ease reclining
On the cushion's velvet lining that the lamplight gloated o'er,
But whose velvet violet lining with the lamplight gloating o'er
                                          _She_ shall press, ah, nevermore!

Then, methought, the air grew denser, perfumed from an unseen censer
Swung by seraphim whose foot-falls tinkled on the tufted floor.
"Wretch," I cried, "thy God hath lent thee--by these angels he hath sent thee
Respite--respite and nepenthe from thy memories of Lenore!
Quaff, oh quaff this kind nepenthe, and forget this lost Lenore!"
                                          Quoth the Raven, "Nevermore."

"Prophet!" said I, "thing of evil!--prophet still, if bird or devil!--
Whether Tempter sent, or whether tempest tossed thee here ashore,
Desolate yet all undaunted, on this desert land enchanted--
On this home by Horror haunted--tell me truly, I implore--
Is there--_is_ there balm in Gilead?--tell me--tell me, I implore!"
                                          Quoth the Raven, "Nevermore."

"Prophet!" said I, "thing of evil--prophet still, if bird or devil!
By that Heaven that bends above, us--by that God we both adore--
Tell this soul with sorrow laden if, within the distant Aidenn,
It shall clasp a sainted maiden whom the angels name Lenore--
Clasp a rare and radiant maiden whom the angels name Lenore."
                                          Quoth the Raven, "Nevermore."

"Be that word our sign of parting, bird or fiend!" I shrieked, upstarting--
"Get thee back into the tempest and the Night's Plutonian shore!
Leave no black plume as a token of that lie thy soul hath spoken!
Leave my loneliness unbroken!--quit the bust above my door!
Take thy beak from out my heart, and take thy form from off my door!"
                                          Quoth the Raven, "Nevermore."

And the Raven, never flitting, still is sitting, still is sitting
On the pallid bust of Pallas just above my chamber door;
And his eyes have all the seeming of a demon's that is dreaming,
And the lamplight o'er him streaming throws his shadow on the floor;
And my soul from out that shadow that lies floating on the floor
                                          Shall be lifted--nevermore!

The correct test input (encoded with Unix-style LF newlines) should be 7043 bytes long, and have the hexadecimal MD5 hash 286206abbb7eca7b1ab69ea4b81da227. (md5sum -t should produce the same hash value even if you use CR+LF newlines on DOS/Windows.) The output of your decompressor should have the same length and hash.

Ps. Keep in mind that this challenge is only as hard as you make it. Really, anything under 7043 counts as a good score. (At the other end of the scale, I'll be extremely impressed if anyone achieves a score under 2500.)

Ilmari Karonen

Posted 2012-01-26T20:59:11.013

Reputation: 19 513

So I take it you don't want to see any lossy compression? – Mr. Llama – 2012-01-27T17:27:56.937

2Preemptive note for people who can't get the MD5 hash to match: the text file has Unix newlines for line enders. Also, be sure you have the final newline in the file for the full 7043 byte length. – Mr. Llama – 2012-01-27T17:31:52.797

@GigaWatt: Yeah, I should've been more explicit about the newlines. Since I've restricted the input to ASCII text only, I guess I could allow people to use whatever newline convention feels most natural for them, as long as they use it consistently. I'll try to think of a nice way to phrase that in the challenge. And no, the compressor should not be lossy. – Ilmari Karonen – 2012-01-27T18:46:07.510

How about file length, is it required to run (in acceptable time) only for files in the order of size of the example, or also for much bigger files (>some MB)? – ceased to turn counterclockwis – 2012-01-27T19:33:26.170

@leftaroundabout: I'll... actually have to think about that a bit more. My first impulse was to say that only the test input matters, but that's not really where I want to go with this challenge. I think I'll set some reasonable minimum input length that you must be able to handle, maybe something like 100 kB or 1 MB. Of course, you could always split long input files into chunks of manageable size and compress them separately; that's what many general purpose compressors do. – Ilmari Karonen – 2012-01-27T19:59:51.893

1If the output is given as a program in the same language as the compressor, can we count the length of the decompressor as zero? – Peter Taylor – 2013-07-03T08:19:30.040

@PeterTaylor: That seems reasonable enough. – Ilmari Karonen – 2013-07-03T08:46:51.153

Answers

19

Perl, 3502 = 133 + 3269 + 100

The encoder:

#!/usr/bin/perl -0
$_=<>;for$e(map{~chr}0..255){++$p{$_}for/..|.\G./gs;
%p=$s=(sort{$p{$a}<=>$p{$b}}keys%p)[-1];$d.=/\Q$e/?$/:s/\Q$s/$e/g&&$s}print$_,$d

And the decoder:

#!/usr/bin/perl -0777
sub d{($p=$d{$_})?d(@$p):print for@_}
sub r{%d=map{chr,ord($c=pop)&&[pop,$c]}0..255;&d}r<>=~/./gs

For purists who prefer to avoid using command-line switches: You can remove the shebang line and add $/=chr; to the encoder and $/=$,; to the decoder to get the same effect. (This would bring the score up to 3510.)

This code uses a very primitive compression scheme:

  • Find the two-char bigram that appears most frequently in the source text.
  • Replace the bigram with a currently-unused byte value.
  • Repeat until there are no more repeated bigrams (or no more unused byte values).

Someone out there may recognize this as a simplified version of "re-pair" compression (short for recursive pairs).

It's not a very good general compression scheme. It only does well with things like ASCII text, where there are a lot of unused byte values, and even then it typically gets no more than a 45-50% ratio. However, it has the advantage of being implementable with a minimum of code. The decompressor in particular can be quite compact. (Most of the chars in my decoder script are for retrieving the bigram dictionary.)

Here's an ungolfed version of the code:

#!/usr/bin/perl
use strict;
use warnings;
# Run with -d to decode.
if ($ARGV[0] eq "-d") {
    shift;
    $_ = join "", <>;
    my @in = split //;
    my %dict;
    foreach my $n (0 .. 255) {
        my $c = shift @in;
        $dict{chr $n} = [ $c, shift @in ] if ord $c;
    }
    sub decode {
        foreach (@_) {
            if ($dict{$_}) {
                decode(@{$dict{$_}});
            } else {
                print $_;
            }
        }
    }
    decode @in;
} else {
    $_ = join "", <>;
    my @dict;
    for (my $n = 255 ; $n >= 0 ; --$n) {
        my $symbol = chr $n;
        if (!/\Q$symbol/) {
            my %pop;
            ++$pop{$_} for /../gs, /(?!^)../gs;
            my $str = (sort { $pop{$b} <=> $pop{$a} } keys %pop)[0];
            s/\Q$str/$symbol/g;
            $dict[$n] = $str;
        }
    }
    for (0..255) { $dict[$_] ||= "\0" }
    print @dict, $_;
}

One expression in the golfed encoder requires explanation, I think, and that is (sort{$p{$a}<=>$p{$b}}keys%p)[-1], to get the key with the highest value. That looks like it should be written as (sort{$p{$b}<=>$p{$a}}keys%p)[0], which does the same thing and is one character shorter. The reason I didn't write it that way is that it alters the selected key in the case when there are multiple keys with the highest value. By sheer chance, this caused the resulting output for the test input to be 10 bytes longer. I hated to take on the useless extra character, but not enough to sacrifice 9 points from my score.

In your face, Golfscript! (Haha, Golfscript would totally come over here and kick my ass if it could hear me.)

breadbox

Posted 2012-01-26T20:59:11.013

Reputation: 6 893

3

Wow, that's pretty impressive! Ps. This seems to be the generally accepted answer regarding the counting of command line switches.

– Ilmari Karonen – 2012-05-12T11:22:43.760

Dang, I read that earlier but I failed to notice that bit in the middle. It sounds like the upshot is that: you don't count the initial hyphen character (because you can just add it to the -e option bundle), unless your code contains a single-quote character, in which case you do count the hyphen (because now you have to run it from a file with a shebang line to avoid paying for escaping the single-quote on the command-line). – breadbox – 2012-05-12T18:26:03.850

1

Technique is also called Byte pair encoding. Nice implementation

– roblogic – 2019-09-02T04:37:17.343

@roblogic Thanks for the reference; that's good to know. – breadbox – 2019-09-02T19:32:14.920

20

Python, 3514 = 294 + 2894 + 326

Basically a bzip2 implementation. It does a Burrows-Wheeler transform, a move-to-front transform, a simple Huffman encoding into a bit stream, converts that bit stream to an integer and writes out bytes.

Encoder:

import sys
S=range(128)
H={0:'0'}
for b in range(7):
 for i in range(1<<b,2<<b):H[i]='1'*b+'10'+bin(i)[3:]
I=sys.stdin.read()+'\0'
N='1'
for x in sorted(I[i:]+I[:i]for i in range(len(I))):i=S.index(ord(x[-1]));N+=H[i];S=[S[i]]+S[:i]+S[i+1:]
N=int(N,2)
while N:sys.stdout.write(chr(N%256));N>>=8

S is the move-to-front queue, H is the Huffman encoder, and N is the bitstream.

The encoding reduces the test input to about 41% of its original size.

Decoder:

import sys
N=0
b=1
for c in sys.stdin.read():N+=ord(c)*b;b<<=8
N=bin(N)[3:]
S=range(128)
L=''
while N:
 n=N.find('0')
 if n:i=2**n/2+int('0'+N[n+1:2*n],2);N=N[2*n:]
 else:i=0;N=N[1:]
 L+=chr(S[i]);S=[S[i]]+S[:i]+S[i+1:]
S=''
i=L.find('\0')
for j in L:S=L[i]+S;i=L[:i].count(L[i])+sum(c<L[i]for c in L)
sys.stdout.write(S[:-1])

Keith Randall

Posted 2012-01-26T20:59:11.013

Reputation: 19 865

1I was tempted to implement the BWT and do a true form of compression but got too lazy. :P – Mr. Llama – 2012-01-27T22:25:44.237

8

8086 Assembler / MS_DOS

Compressor: 155

jNiAxBCO2I7AM/+9/QW5AAGK2TPAq4rDqv7D4va6AQkz9lK0BrL/zSFadDK7
/f+DwwM733QNOTd19ThHAnXwid7r34k1iEUC6BMAtACKRQJr8AODxwPryrQC
zSHrxFIz0ovGuwMA9/Nai9iKztPL0ePQ0nMWgPr+cgtSsv60Bs0hWoDq/rQG
zSGyAf7JdeA5/XUHA+2DxQP+xsM=

Data: 3506

Decompressor: 203

ieWD7CCM2IDEEI7YjsAz/7kAAYrZM8CrisOq/sPi9rYJxkb0Abn9BehtAIl2
/uhTAOhkAIl28Dv3cy3oRgCLRv6JBYt28Il2/oM8AHQEizTr94pEAohFAoPH
AznPddL+xgPJg8ED68mLdv6JNYM8AHQEizTr94pEAohFAol+/on+aFgBgzwA
dAdWizTo9f9etAaKVALNIcMz9ojz/k70dRu0BrL/zSF0IDz+cgi0BrL/zSEE
/sZG9AiIRvLQZvLR1v7Ldddr9gPDzSA=

Total: 3864

Use this Base64 decoder and save the binary files as 'compress.com' and 'decompress.com' and then do:

compress < source > compressed_file
decompress < compressed_file > copy_of_source

in a DOS shell (tested with WinXP). There's no error checking so compressing big files will create incorrect results. A few small additions and it could cope with any sized file. Also, it can't decompress to binary as it can't output a 0xff value (the compressed data escapes the 0xff value as 0xfe 0xff with 0xfe escaped as 0xfe 0xfe). Using command line filenames would overcome the binary output problem, but would be a bigger executable.

Skizz

Posted 2012-01-26T20:59:11.013

Reputation: 2 225

What kind of compression algorithms does the program use? – Sir_Lagsalot – 2012-02-02T14:50:35.680

@Sir_Lagsalot: It uses a variable bit width LZW (the one used in GIF files). – Skizz – 2012-02-02T15:05:49.587

6

Bash Poem (566+117) + 4687 = 5370

For fun I have disguised a compressor as a poem:

for I in my chamber nodded, nearly napping, suddenly heard rapping, tapping upon my door    \
"'T is some visiter" \ I\  muttered, o\'er lamplight "nothing more" \
just this sainted maiden whom the angels name Lenore    \
And "Prophet!" said me "thing of evil" -- "prophet still, if bird or devil!"    \
Leave no token of that lie thy soul hath spoken and sitting take thy ore from This floor    \
But you velvet bird from some shore above   \
here this with sad raven before his word still spoke nothing    \
"                                          " Quoth the Raven Never more;                    do C=$[C+1];E=`perl -e "print chr($C+128)"`;echo "s/$I/$E/g">>c;echo "s/$E/$I/g">>d;done;LANG=C sed -f $1;rm c d

This is a unified compressor: run with the option "c" it will compress, and with "d" it will decompress. It has two parts: a 566 byte "readers digest" version of the poem and (2) a 117 byte suffix where all the "real" bash is done.

With some care (e.g. starting the poem with "for I in") bash will interpret the "lossy" version of poem as an array. It replaces each element of the array with a non-ASCII character (we assume the input is ASCII so there are no collisions). One minor advantage of this solution: since we make use of the fact that we can assume the input is ASCII, the output of this compression will never be longer than its input, regardless of what the input and/or lossy part is.

The rule that this comes closest to violating is the rule about providing a decent compression ratio on other texts. However, It shaves 1386 bytes off the GPL V2 text, well over it's own size, which seems to match the OPs definition of decent. Thus it seems to provide so-called decent compression on general texts. This is because pretty much any English text will have "the" "that" etc. Clearly it will work better if you replace the "lossy" part with text resembling the original you want to losslessly compress.

Splitting pictures and audio into lossy and non-lossy parts is a known technique. This doesn't work as well for text: 4687 bytes isn't that great even if we exclude the 566 bytes from the lossy version and we can't really automatically generate a lossy version of text the same we can for audio. On the plus side this means every time you compress something with this compressor, you can have the fun of creating a lossy version by hand. So this seems like a reasonable "for fun" solution.

gmatht

Posted 2012-01-26T20:59:11.013

Reputation: 676

5

C++, 4134 bytes (code = 1357, compressed = 2777)

This does a Burrows-Wheeler transform + a Move-To-Front like Keith Randall's, but then compresses the resulting byte sequence using an adaptive Range Coder. Unfortunately, the improved compression from the range coder isn't enough to offset C++'s verbosity. I could golf this code some more, namely use a different input/output method, but it wouldn't be enough to beat the other submissions with the current algorithm. The code is Windows specific, and only ascii text is supported.
To compress: "C text_file compressed_file"
To decompress: "D compressed_file uncompressed_file"
Pretty much any command line error or file error will crash the program, and it takes the better part of a minute to encode or decode the poem.

#include <windows.h>
#include <algorithm>
typedef DWORD I;typedef BYTE u;
#define W while
#define A(x)for(a=0;a<x;a++)
#define P(x)*o++=x;
I q,T=1<<31,B=T>>8,a,l,f[257],b,G=127,p=G,N=255;I Y(u*i,u*j){return
memcmp(i,j,l)<0;}I E(u*i,u*o){b=0;I L=0,h=0,R=T;u*c=o,*e=i+l;W(i<e){I
r=R/p,s=0;A(*i)s+=f[a];s*=r;L+=s;R=*i<N?r*f[*i++]++:R-s;p++;W(R<=B){if((L>>23)<N){for(;h;h--)P(N)P(L>>23)}else{if(L&T){o[-1]++;for(;h;h--)P(0)P(L>>23)}else
h++;}R<<=8;L<<=8;L&=T-1;}}P(L>>23)P(L>>15)P(L>>7)return
o-c;}void D(u*i,u*o){I R=128,L=*i>>1;u*e=o+l;W(o<e){W(R<=B){L<<=8;L|=((*i<<7)|(i++[1]>>1))&N;R<<=8;}I
h=R/p,m=L/h,x=0,v=0;W(v<=m)v+=f[x++];P(--x);L-=h*(v-f[x]);R=h*f[x]++;p++;}}void
main(I Z,char**v){u d[1<<16];I c=*v[1]<68,s;HANDLE F=CreateFileA(v[2],T,0,0,3,0,0),o=CreateFileA(v[3],T/2,0,0,2,0,0);ReadFile(F,d,GetFileSize(F,0),&l,0);l=c?l:*(I*)d;A(G)f[a]=1;u M[256];A(G)M[a]=a+1;u*g=new u[l*3],*h=g+l;if(c){memcpy(d+l,d,l);u**R=new
u*[l];A(l)R[a]=d+a;std::sort(R,R+l,Y);A(l){b=R[a][l-1];I
i=strchr((char*)M,b)-(char*)M;memmove(M+1,M,i);*M=g[a]=b;h[a]=i;}s=E(h,d+l+8);}else{D(d+8,g);A(l){I
k=g[a];g[a]=M[k];memmove(M+1,M,k);*M=g[a];}}u**j=new u*[l];A(l)j[a]=new
u[l*2],memset(j[a],0,l*2),j[a]+=l;A(l){for(b=0;b<l;)*--j[b]=g[b++];std::sort(j,j+l,Y);}if(c){A(l){if(!memcmp(j[a],d,l)){I*t=(I*)(d+l);*t=l;t[1]=a;g=d+l,l=s+8;}}}else
g=j[*(I*)(d+4)];WriteFile(o,g,l,&q,0);}

Sir_Lagsalot

Posted 2012-01-26T20:59:11.013

Reputation: 4 898

5

JavaScript, 393 (code) + 3521 (test) = 3914 (total)

This program iteratively substitutes unused byte values for 2- to 4- character chunks of the input. Each substitution is scored based on the original chunk's frequency and length, and the best substitution is chosen each time around. I would add a final Huffman coding stage if I could figure out how to do it in a relatively small number of characters. Decompression is essentially a series of find-and-replace operations.

Usage

C() provides compression; U() provides decompression. As JavaScript's strings are based on 16-bit Unicode code units, only the least significant 8 bits of each code unit are used in the compressed data format; this is compatible with Firefox's btoa() and atob() functions for Base64 encoding. (usage example)

This program may only work in Firefox because of a non-standard "g" option to .replace().

Code

Golfed code:

S=String.fromCharCode;function C(c){h=[];for(f=0;129>f;++f){g='';i=0;for(e=2;5>e;++e){d={};for(a=0;a<=c.length-e;a+=e)b="K"+c.substr(a,e),d[b]=d[b]?d[b]+1:1;for(b in d)a=d[b],a=a*e-(1+e+a),a>i&&(g=b.slice(1),i=a)}if(!g)break;h[f]=g;c=c.replace(g,S(127+f),"g")}return h.join("\1")+"\1"+c}function U(a){c=a.split("\1");a=c.pop();for(b=c.length,d=127+b;b--;)a=a.replace(S(--d),c[b],"g");return a}

Before golfing:

function compress(str) {

    var hash, offset, match, iteration, expansions, bestMatch, bestScore, times, length, score;

    expansions = [];

    for (iteration = 0; iteration < 129; ++iteration) {

        bestMatch = null;
        bestScore = 0;

        for (length = 2; length < 5; ++length) {

            hash = {};

            for (offset = 0; offset <= str.length - length; offset += length) {
                match = 'K' + str.substr(offset, length);
                hash[match] = hash[match] ? hash[match] + 1 : 1;
            }

            for (match in hash) {
                times = hash[match];
                score = times * length - (1 + length + times);
                if (score > bestScore) {
                    bestMatch = match.slice(1);
                    bestScore = score;
                }
            }

        }

        if (!bestMatch) {
            break;
        }

        expansions[iteration] = bestMatch;
        str = str.replace(bestMatch, String.fromCharCode(127 + iteration), 'g');

    }

    return expansions.join('\u0001') + '\u0001' + str;
}

function uncompress(str) {
    var i, j, expansions;

    expansions = str.split('\u0001');
    str = expansions.pop();

    for (j = expansions.length, i = 127 + j; j--;) {
        str = str.replace(String.fromCharCode(--i), expansions[j], 'g');
    }

    return str;
}

PleaseStand

Posted 2012-01-26T20:59:11.013

Reputation: 5 369

Why do I get C(text).length=7301? (FF 60.0.2) – l4m2 – 2018-10-15T09:27:38.790

3

PHP, (347 + 6166 + 176) = 6689

So I went with a simplistic dictionary + substitution approach.

If a word appears multiple times and it's shorter to (encode the word + save the substitution entry) then it makes the replacement. If the "word" happens to be a number, it does it anyways to prevent accidental substitutions during decompression. The "dictionary" of substitutions is joined by null bytes, followed by two null bytes, followed by the body the substitution works on.

Possible improvements:

  • Windows doesn't like piping more than 4kb of data around, so find a better way than using files.
  • The ability to match long strings of whitespace and count them as "words" without adding too much code.
  • Coming up with something better substitutions instead of using numbers.

Usage: the compressor looks for a file called "i" and writes the compressed data to "o". The decompressor looks for "o" and writes the uncompressed data to "d". This is my shoddy workaround to Windows not liking to pipe boats of data around.


compress.php (347)

<?$d=file_get_contents('i');$z=chr(0);preg_match_all('|\b(\w+)\b|',$d,$m);$n=0;foreach($m[0]as$w){$l=strlen($w);$q[$w]=isset($q[$w])?$q[$w]+$l:$l;}arsort($q);foreach($q as$w=>$s){$l=strlen($w);$c=$s/$l;if($c*strlen($n)+$l<$s||is_int($w)){$d=preg_replace('|\b'.preg_quote($w).'\b|',$n++,$d);$f[]=$w;}}file_put_contents('o',implode($z,$f).$z.$z.$d);

Expanded version with comments and explanation.


Output sample without dictionary. Kinda funny to look at.
Normal size: 6166.

Ah, distinctly I remember it 45 in 0 bleak December,
25 each separate dying ember wrought its ghost 39 0 37.
Eagerly I wished 0 88:--vainly I had sought to borrow
From 9 books surcease of 43--43 for 0 lost 8--
For 0 rare 1 67 40 54 0 26 38 8--
                                          Nameless 63 for evermore.

25 0 silken sad uncertain rustling of each purple curtain
Thrilled me--filled me 19 fantastic terrors never felt 17;
So 4 now, to 13 0 beating of 9 64, I stood repeating
"'T is 57 31 36 49 at 9 2 5
Some late 31 36 49 at 9 2 5;--
                                          58 it is, 1 10 16."

decompress.php (176)

<?$z=chr(0);$d=file_get_contents('o');list($w,$d)=explode($z.$z,$d);$w=explode($z,$w);$n=0;foreach($w as$r){$d=preg_replace('|\b'.$n++.'\b|',$r,$d);};file_put_contents('d',$d);

Expanded version with explanation.


Any suggestions for improvement welcome.

Edit: Added the "unrolled" versions of the code and added tons of comments. Should be easy to follow.

Mr. Llama

Posted 2012-01-26T20:59:11.013

Reputation: 2 387

Gah! Same language and method as I was using! Dammit. Though I hadn't got as far as skipping single words. – Gareth – 2012-01-27T18:45:28.867

what happens when there are numbers within the text? it would end up replacing the original numbers with an out-of-place word. Though I took a similar approach (regex splits, finding common words to substitute and producing a replacement dictionary and gluing it on with nulls), I used unicode characters instead of numbers (starting from chr(128), since anything after that is nonprintable in standard ascii) – Blazer – 2012-01-27T19:56:51.723

@Blazer: Actually, there's code in place (namely ||is_int($w)) to handle numbers by always adding them to the dictionary, but it seems to be buggy: after compressing and decompressing the whole Gutenberg E-text, the output starts with The 4 3 EBook 2 The Raven, by Edgar Allan Poe. :-( I suspect the problem is that something is getting replaced twice; you might want consider using strtr() instead to avoid that issue. – Ilmari Karonen – 2012-01-27T20:40:31.003

@Ilmari if you have a number-heavy document, adding those numbers to the dictionary could result in the compression being larger than it was originally. to store several 1-2 character long items is not effective. like if you were to replace the word 'a' in the document – Blazer – 2012-01-27T21:35:02.810

@Blazer - For all compression algorithms there's certain inputs that will result in a larger output. It's inherent in lossless compression, just like the inability to reliably compress entropic data. – Mr. Llama – 2012-01-27T22:24:22.630

@IlmariKaronen - I'll look into it. I specifically included word boundary markers for the matching/replacing, so it shouldn't be caused by overlapping "symbols". If anything, the actual numbers should have made it into the dictionary and been back-substituted correctly. – Mr. Llama – 2012-01-27T22:24:43.817

@GigaWatt: I think the problem is that you're doing the substitution on multiple passes. So, for example, you first replace of with 2 and then, because it's also in the dictionary, 2 with, say, 747. The decompressor sees 747 and decodes it back to 2 (but not to of, since that dictionary entry has been processed already). – Ilmari Karonen – 2012-01-27T23:04:39.077

3

GolfScript, 3647 (compressed size 3408 + code size 239)

128,{[.;]''+}%:d;8:k;{2k?={1k+:k;}*}:|;{2base}:b;{.[0]*@b+0@->}:$;.0=
{'':&,:i;1/{.d&@+?.0<{;d,i@d&@:&.0=:i;[+]+:d;k$\|}{:i;&\+:&;}if}%[0]k*+[]*8/{b}%"\0"\+}
{1>{8$}/][]*:^;{^k<b^k>:^;}:r~{.}{d,|d=:&r..d,<{d=}{;&}if[1<&\+]d\+:d;}while;}if

The algorithm used is LZW compression with variable-width codes. The first line is shared code, the second is the compression code and the third is the decompression code.

It handles files with ASCII characters in the range 1-127, and it recognizes compressed files automatically (they start with a 0 byte), so there are no parameters required to decompress.

Example run:

$ md5sum raven.txt
286206abbb7eca7b1ab69ea4b81da227  raven.txt
$ ruby golfscript.rb compress.gs < raven.txt > raven.lzw
$ ls -l raven.lzw
-rw-r--r-- 1 ahammar ahammar 3408 2012-01-27 22:27 raven.lzw
$ ruby golfscript.rb compress.gs < raven.lzw | md5sum
286206abbb7eca7b1ab69ea4b81da227  -

Note: I started on this long before the requirement to handle 100kb was added, so I haven't tested it on input of that size. However, it takes about 30 seconds to compress the test input and 5 seconds to decompress it, using about 20MB of memory at its peak.

hammar

Posted 2012-01-26T20:59:11.013

Reputation: 4 011

Compressing a 76 kB file seems to take about 19 minutes, while decompressing it takes 10. That is kind of slow, but then again, it does pass the original rules, so... I dunno. Seems kind of unfair not to allow it under the circumstances. I guess I could invoke an implicit "grandfather clause" for you or something. – Ilmari Karonen – 2012-01-27T23:15:01.433

3

Haskell, 3973

Late to the party, and not going to win, but I had fun writing it so I might as well post it.

It's a straightforward variable-width implementation of LZW, with a dictionary explicitely limited to printable ASCII, tab and linefeed. Run with no arguments, it compresses standard input to file C. Run with any argument (but "--decompress" would be a reasonable bet), it decompresses file C to standard output.

import List
import System
import Data.Binary
q=head
main=getArgs>>=m
m[]=getContents>>=encodeFile"C".s 97 128 1 0.e 97h
m _=decodeFile"C">>=putStr.d tail""96h.u 97 128
h=zip[0..].map(:[])$"\t\n"++[' '..'~']
e _ _[]=[]
e n s y=c:e(n+1)((n,take(1+l)y):s)(drop(l)y)where{Just(c,p)=find((`isPrefixOf`y).snd)s;l=length p}
d _ _ _ _[]=""
d f p n s(x:y)=t++d id t(n+1)(f$(n,p++[q t]):s)y where t=maybe(p++[q p])id$lookup x s
s _ _ _ a[]=a::Integer
s n w o a y|n>w=s n(2*w)o a y|0<1=s(n+1)w(o*w)(a+o*q y)(tail y)
u _ _ 0=[]
u n w x|n>w=u n(2*w)x|0<1=(x`mod`w::Integer):u(n+1)w(x`div`w)
  • code size: 578
  • compressed sample size: 3395

J B

Posted 2012-01-26T20:59:11.013

Reputation: 9 638