Lossy Text Compression

9

2

Background

Of the 256 possible characters that a byte can represent, only a few of these are used under most circumstances. Couldn't we somehow take advantage of this, and make our text files smaller by eliminating the need for the rarely used letters?

Many letters don't add any value in most situations, and can be replaced by more common letters. For example, a lower-case "L", capital "I", and the number "1" look nearly identical in most situations, so they can be consolidated.

There is little need for capital letters, so they can be dispensed with. The decompression/display program could even automatically capitalize the first letter of every sentence, common names, etc.

Rules

Entries will be judged on:

  • compression ratio
  • readability after de-compression

Entries will be tested against the plain text version of this article: http://en.wikipedia.org/wiki/Babbage and a randomly selected BBC News article.

Extra marks will be awarded for; preserving any mark-up, beautifying after de-compression (i.e. Capitalising sentences etc).

Languages

  • Any you like, but must easily compile (or be interpreted) on a basic *nix box.

rjstelling

Posted 2011-04-13T14:55:45.830

Reputation: 191

Assuming the input consists only of printable ASCII (7-bit) characters, we don't need any of the non-printable ASCII chars. Replacing \r\n with \n, replacing \t with some spaces, converting all upper-case letters to lower-case, and ignoring or substituting 6 seldomly used special characters, we're left with 64 characters that we need to encode -- in other words, a 4-bit encoding would be sufficient. We could therefore store 2 characters per byte, saving up to 50%. If we take 8-bit input (like ISO-8859 letters with diacritic marks) we would need to convert them to plain ASCII beforehand. – tmh – 2015-12-22T21:44:12.807

So PowerShell is out? Bummer. – Joey – 2011-04-13T16:44:54.837

1Haskell: main = interact (\x -> take 90 x ++ " yada yada yada") – Joey Adams – 2011-04-13T19:26:26.053

1Note also that "readability after decompression" is a fairly subjective criterion. – Joey – 2011-04-13T20:02:18.280

Especially on a Unix-Box, we need the distinction upper case, lower case. :) And finding the beginning of a sent. Isn't trivial, if the u. Uses abbrev.! :) – user unknown – 2011-04-21T03:34:54.550

Do we want to compress the alphabet or the text? :) L = l = 1 compresses the characters needed to represent our thoughts. But "one apple" = "1 apl" compresses the text. – anemgyenge – 2011-04-22T17:00:19.857

remove all vowels and reduce inflections, then in decompression, use dictionary to find the correct full word – Ming-Tang – 2011-04-24T02:01:03.737

Answers

11

Perl

Very inefficient and has bad rates. Requires /usr/share/dict/words.

Compressor

#!/usr/bin/perl

$M = 2;
$N = 1;
$Min = 3;
$Max = 8;

while (<>) {
  for (split /\s+/) {
    s/[^a-z]//i;
    ($p) = m/([^a-z]*)$/;
    $_ = lc $_;
    $l = (length $_) - (length $p);
    s/^and$/A/;
    s/^he$/H/;
    s/^in$/I/;
    s/^of$/O/;
    s/^you$/U/;
    s/^the$/Z/;
    if (length $_ >= $Min) {
      if (length $_ <= $Max) {
        s/ed/D/g;
        s/ing\b/N/g;
        s/er/R/g;
        s/'s/S/g;
        s/th/T/g;
        s/[aeo]{1,2}//g;
        $_ .= $l;
      } else {
        s/^(.{$M})(.+)(\w{$N})$/$1.(length$2).$3/e;
      }
    }
    $a .= $_ . $p . ' ';
  }
}
print $a;

Decompressor

#!/usr/bin/perl

$M = 2;
$N = 1;

open D, '/usr/share/dict/words';
chomp, push @W, $_ while <D>;
close D;

while (<>) {
  for (split /\s+/) {
    ($_, $p) = m/^(.+)([^a-z]*)$/;
    s/^A$/and/;
    s/^H$/he/;
    s/^I$/in/;
    s/^O$/of/;
    s/^U$/you/;
    s/^Z$/the/;
    if ($_ =~ m/^(\w{$M})(\d+)(\w{$N})$/) {
      $r = '^' . quotemeta($1) . ('\w' x $2) . quotemeta($3) . '$';
      ($_) = (grep /$r/, @W);
      $_ .= $4;
    } else {
      ($_, $l) = m/^(.+)(\d+)$/;
      s/D/ed/g;
      s/N/ing/g;
      s/R/er/g;
      s/S/'s/g;
      s/T/th/g;
      $r = '[aeo]{0,2}';
      for $y(split //) { $r .= (quotemeta $y) . '[aiueo]{0,2}' }
      ($_) = (grep /^(?=[a-z]{$l})$r$/, @W);
    }
    $a .= $_ . $p . ' ';
  }
}
print $a;

Ming-Tang

Posted 2011-04-13T14:55:45.830

Reputation: 5 383

3

Perl, 0 chars

Compression ratio of infinity, though not that readable after decompression so it will lose some marks.

Ry-

Posted 2011-04-13T14:55:45.830

Reputation: 5 283

2

Bash, 5 chars

My lazy entry that just might win:

bzip2

Lossless, so it preserves readability perfectly and gets all the extra marks! Compression ratio on the Babbage html is 4.79x (153804 to 32084 bytes).

Keith Randall

Posted 2011-04-13T14:55:45.830

Reputation: 19 865

Somehow I knew that was coming with that challenge ;-) – Joey – 2011-04-14T11:07:48.613

That's going to be hard to beat. – Lowjacker – 2011-04-18T15:57:25.797

Hah! I beat it in both length and compression ratio ;) – Ry- – 2011-04-21T01:03:02.427

2xz, even shorter and better ratio :) – OneOfOne – 2011-04-24T08:52:48.340