9
2
Background
Of the 256 possible characters that a byte can represent, only a few of these are used under most circumstances. Couldn't we somehow take advantage of this, and make our text files smaller by eliminating the need for the rarely used letters?
Many letters don't add any value in most situations, and can be replaced by more common letters. For example, a lower-case "L", capital "I", and the number "1" look nearly identical in most situations, so they can be consolidated.
There is little need for capital letters, so they can be dispensed with. The decompression/display program could even automatically capitalize the first letter of every sentence, common names, etc.
Rules
Entries will be judged on:
- compression ratio
- readability after de-compression
Entries will be tested against the plain text version of this article: http://en.wikipedia.org/wiki/Babbage and a randomly selected BBC News article.
Extra marks will be awarded for; preserving any mark-up, beautifying after de-compression (i.e. Capitalising sentences etc).
Languages
- Any you like, but must easily compile (or be interpreted) on a basic *nix box.
Assuming the input consists only of printable ASCII (7-bit) characters, we don't need any of the non-printable ASCII chars. Replacing
\r\n
with\n
, replacing\t
with some spaces, converting all upper-case letters to lower-case, and ignoring or substituting 6 seldomly used special characters, we're left with 64 characters that we need to encode -- in other words, a 4-bit encoding would be sufficient. We could therefore store 2 characters per byte, saving up to 50%. If we take 8-bit input (like ISO-8859 letters with diacritic marks) we would need to convert them to plain ASCII beforehand. – tmh – 2015-12-22T21:44:12.807So PowerShell is out? Bummer. – Joey – 2011-04-13T16:44:54.837
1Haskell:
main = interact (\x -> take 90 x ++ " yada yada yada")
– Joey Adams – 2011-04-13T19:26:26.0531Note also that "readability after decompression" is a fairly subjective criterion. – Joey – 2011-04-13T20:02:18.280
Especially on a Unix-Box, we need the distinction upper case, lower case. :) And finding the beginning of a sent. Isn't trivial, if the u. Uses abbrev.! :) – user unknown – 2011-04-21T03:34:54.550
Do we want to compress the alphabet or the text? :) L = l = 1 compresses the characters needed to represent our thoughts. But "one apple" = "1 apl" compresses the text. – anemgyenge – 2011-04-22T17:00:19.857
remove all vowels and reduce inflections, then in decompression, use dictionary to find the correct full word – Ming-Tang – 2011-04-24T02:01:03.737