Who wants to be a Kolmogorov complexity winner?

22

5

You mission today is to invent a text compressor.

Task

You'll write two functions:

  • The packer is a function that accepts a string of ASCII characters (U+0000 to U+007F) and outputs a Unicode string (U+0000 to U+10FFFF), containing the fewest characters possible.

  • The unpacker is a function that accepts an encoded Unicode string and outputs exactly the original ASCII string.

Input

The only authorized input is the ASCII string (for the packer) and the packed Unicode string (for the unpacker). No user input, no internet connection, no use of file system.

Your functions can have access to this list of english words. You can use this list as a local txt file, or copy its content in your source code as a string or an array of strings.

You can't hardcode the snippets below in your functions.

Output

The only authorized output for both functions is a string.

The output of the unpacker must contain exactly the same characters as the input of the packer.

Your inputs and outputs can use any character encoding supporting all Unicode (UTF-8/16/32, GB18030, ...), as your score will only depend on the number of Unicode characters in the output. Please precise which encoding you're using, though.

To count the number of Unicode characters in your output, you can use this tool: http://mothereff.in/byte-counter

Scoring

Your entry must be able to pack and unpack the 10 following text snippets (that I took on this forum).

Your score will be the sum of the sizes of your 10 packed strings (in Unicode characters) + the size of your two functions (in Unicode characters, too)

Don't count the size of the dictionnary if you use it.

Please include in your entries the "score" of each snippet and their packed version.

Lowest score wins.

Data

Here are the snippets to encode to compute your score:

1: Rick Roll lyrics (1870b): We're no strangers to code golf, you know the rules, and so do I

We're no strangers to love
You know the rules and so do I
A full commitment's what I'm thinking of
You wouldn't get this from any other guy
I just wanna tell you how I'm feeling
Gotta make you understand

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

We've known each other for so long
Your heart's been aching but
You're too shy to say it
Inside we both know what's been going on
We know the game and we're gonna play it
And if you ask me how I'm feeling
Don't tell me you're too blind to see

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

(Ooh, give you up)
(Ooh, give you up)
(Ooh)
Never gonna give, never gonna give
(Give you up)
(Ooh)
Never gonna give, never gonna give
(Give you up)

We've know each other for so long
Your heart's been aching but
You're too shy to say it
Inside we both know what's been going on
We know the game and we're gonna play it

I just wanna tell you how I'm feeling
Gotta make you understand

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

2: The Golfer (412b): Golfing ASCII-art

      '\                   .  .                        |>18>>
        \              .         ' .                   |
       O>>         .                 'o                |
        \       .                                      |
        /\    .                                        |
       / /  .'                                         |
 jgs^^^^^^^`^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

3: the number diamond (233b): Print this diamond

        1
       121
      12321
     1234321
    123454321
   12345654321
  1234567654321
 123456787654321
12345678987654321
 123456787654321
  1234567654321
   12345654321
    123454321
     1234321
      12321
       121
        1

4: the alphabet four times (107b): Print the alphabet four times

abcdefghijklmnopqrstuvwxyz
qwertyuiopasdfghjklzxcvbnm
pyfgcrlaoeuidhtnsqjkxbmwvz
zyxwvutsrqponmlkjihgfedcba

5: Old McDonald's lyrics (203b): Old MacDonald function

Old MacDonald had a farm, E-I-E-I-O,
And on that farm he had a cow, E-I-E-I-O,
With a moo moo here and a moo moo there,
Here a moo, there a moo, everywhere a moo moo,
Old MacDonald had a farm, E-I-E-I-O!

6: Rock around the clock lyrics (144b): Rock Around the Clock

1, 2, 3 o'clock, 4 o'clock rock,
5, 6, 7 o'clock, 8 o'clock rock,
9, 10, 11 o'clock, 12 o'clock rock,
We're gonna rock around the clock tonight.

7: Hello World (296b): Say "Hello" to the world in ASCII art

 _   _      _ _                             _     _ _
| | | | ___| | | ___    __      _____  _ __| | __| | |
| |_| |/ _ \ | |/ _ \   \ \ /\ / / _ \| '__| |/ _` | |
|  _  |  __/ | | (_) |   \ V  V / (_) | |  | | (_| |_|
|_| |_|\___|_|_|\___( )   \_/\_/ \___/|_|  |_|\__,_(_)
                    |/

8: Irish blessing (210b): An Old Irish Blessing

May the road rise up to meet you
May the wind be always at your back
May the sun shine warm upon your face
The rains fall soft upon your fields
And until we meet again
May God hold you in the hollow of His hand

9: There was an old lady lyrics (1208b): There Was an Old Lady

There was an old lady who swallowed a fly.  
I don't know why she swallowed that fly,  
Perhaps she'll die.

There was an old lady who swallowed a spider,  
That wriggled and iggled and jiggled inside her.  
She swallowed the spider to catch the fly,  
I don't know why she swallowed that fly,  
Perhaps she'll die.

There was an old lady who swallowed a bird,  
How absurd to swallow a bird.  
She swallowed the bird to catch the spider,  
She swallowed the spider to catch the fly,  
I don't know why she swallowed that fly,  
Perhaps she'll die.

There was an old lady who swallowed a cat,  
Imagine that to swallow a cat.  
She swallowed the cat to catch the bird,  
She swallowed the bird to catch the spider,  
She swallowed the spider to catch the fly,  
I don't know why she swallowed that fly,  
Perhaps she'll die.

There was an old lady who swallowed a dog,  
What a hog to swallow a dog.  
She swallowed the dog to catch the cat,  
She swallowed the cat to catch the bird,  
She swallowed the bird to catch the spider,  
She swallowed the spider to catch the fly,  
I don't know why she swallowed that fly,  
Perhaps she'll die.

There was an old lady who swallowed a horse,  
She died of course.

10: gettysburg address (1452b): How random is the Gettysburg Address

Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us-that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion-that we here highly resolve that these dead shall not have died in vain-that this nation, under God, shall have a new birth of freedom-and that government of the people, by the people, for the people, shall not perish from the earth.

Total (uncompressed): 6135 chars/bytes.

Have fun!

xem

Posted 2014-07-17T21:13:02.447

Reputation: 5 523

Not immediately obvious how this is a programming language. The 'compiler' doesn't take source code. I guess rather it creates a program which is then interpreted – Claudiu – 2014-07-17T21:48:10.317

7This isn't a challenge to invent a language, this is a challenge to compress something. – Justin – 2014-07-17T21:56:57.147

2I think not including the size of the compiler/executor (compressor/decompressor) in the score makes this challenge a little open-ended. At some point, the line between dictionary and hard-coding will become very thin. – Dennis – 2014-07-17T23:12:13.750

2Darn, and here I was already typing

`private static final String RICK_ROLL_RETURN = "We're no strangers to love...`
 – Graph Theory  – 2014-07-18T00:16:54.377

To make it a real language creation challenge, maybe specify that it has to be Turing-complete and the executor can't use "eval" or similar. – histocrat – 2014-07-18T02:36:24.317

Okay, I removed the mention of "language". now it's just a packer/unpacker. – xem – 2014-07-18T05:40:46.163

Hey guys, I've made the rules clearer (I hope) by providing an authorized dictionnary of english words. People can use it in their packer/unpacker, but they can't use any other external input. Hope this helps. – xem – 2014-07-18T17:37:34.830

1I don't think you've addressed Dennis' observation. – Peter Taylor – 2014-07-18T20:00:22.377

1@xem I think the post could be improved by organizing the information into sections such like #Task, #Input, #Output, #Scoring. I also think that the size of the source code for the compressor and the decompressor should be included in the score. This doesn't hurt anything, but it solves the problem Dennis pointed out. – Rainbolt – 2014-07-18T21:30:56.903

UTF8 to UTF32 correct? – Milo – 2014-07-19T00:52:56.183

@Milo I do not think so since the score is based on the number of unicode characters of the encoded output. UTF8 to UTF32 would have no effect on the score. – gxtaillon – 2014-07-19T01:40:33.893

@gxtaillon well it changes the characters we can use. Our input is 0-127 but our output is much larger because U+10FFFF is UTF-32 only I am pretty sure. Unless you use four UTF-8 unicode characters to make one UTF-32 – Milo – 2014-07-19T02:17:36.427

@xem what order should the snippets be output? better yet, instead of listing the links to the snippets, please provide the official expected output. – Moogie – 2014-07-19T04:16:21.320

Hi all, I've edited all the question, it's now a code-golf challenge. The size of your code counts in the score. I've clearly explained which input and output are expected and I've provided the dictionnary as an array of strings, if anyoune needs it. – xem – 2014-07-19T06:58:30.853

@moogie no particular order, you should pack them separately – xem – 2014-07-19T06:59:38.717

@Milo & gxtaillon, I've made this point clearer: use the encoding that you want. – xem – 2014-07-19T07:01:55.413

Answers

6

Haskell - 5322 points

Bytes of code : 686

Original size : 6147 = 1871+415+234+108+204+145+297+211+1209+1453

Encoded size : 4636 = 1396+233+163+92+153+115+197+164+979+1144

Score : 686+ 4636

Character count compression : ~25%

Code

Optimizations aside, this stores values between 0 and 7f in unicode characters as their prime factors.

It does not lower the number of bytes of the encoded output, it only lowers the number of unicode characters. For instance, test #4 contains 108 characters and the encoded output, 92. Their respective sizes are however 108 and 364 bytes.

import Data.Bits
import Data.List
import Data.Numbers.Primes
import qualified Data.Text as T
a=nub$concat$map(primeFactors)[0..127]
d(a:r)c=let s=shift c 5in if s<=0x10ffffthen d r(s+a)else c:d r a
d[]c=[c]
f(a:r)=let o=a.&.0x1fin(if o/=a then f((shiftR a 5):r)else f r)++[o]
f[]=[]
g=T.pack.map(toEnum).(\(a:r)->d r a).concatMap j.map(map(\x->head$elemIndices x a)).map(primeFactors.fromEnum).T.unpack
h=T.pack.map(toEnum.product.map((!!)a)).i.f.reverse.map(fromEnum).T.unpack
i(a:r)=let z=a`clearBit`4;x=if a`testBit`4then(take z$repeat$head r,tail r)else splitAt z r in[fst x]++i(snd x)
i[]=[]
j x@(a:_)=let l=length x in if(take l(repeat a))==x then[l`setBit`4,a]else[l]++x
j[]=[0]

How it works

  • Encoding

    1. Each character is converted to its numeric equivalent, lets call this number n.
    2. n is then converted into a list of its prime factors, ps.
      • It conveniently happens that the numbers 0 through 127 have 32 common prime factors, excluding 1. This mean they, the factors, can be stored as indexes on as little as 5 bits.
      • 1 is a special case and is represented by an empty list.
    3. With ps the encoding can now start.
      1. Each number of ps is converted into its index in the list of 32 unique factors (In the above code this list is identified as a).
      2. (Keep in mind at this point we are dealing with a list of list of indexes of prime factors) To proceed to the next step, ps needs to be flattened. To preserve the integrity of the data, each list is converted into another list of two parts
        1. The first element stores its length and if the it is composed of the same factor.
          • There are at most 6 prime factors per list, this information is stored on the rightmost 3 bits. The fifth bit is used as a flag to indicate if the list is comprised of a single factor.
        2. The remaining elements are the indexes themselves or a single index if there is less than two different factors in the list.
      3. These lists are then concatenated into a single flattened list, fs.
    4. The elements of fs can then be packed into unicode characters using bit shifting.
  • Decoding

    • Do the encoding steps in reverse.
    • If you are wondering how 1 fits into this, I would like to remind you that product [] == 1.

Tests

Using this interface for testing would be painful so I used this function to provide the results below.

edTest f = do
    t <- readFile f
    let txt = T.pack t
        enc = g txt
        dec = h enc
        tst = txt == dec
    putStrLn $ (show $ T.length txt) ++ "," ++ (show $ T.length enc) ++ "," ++ (show $ T.length dec)++","++(show tst)
    putStrLn $ if not tst then T.unpack txt ++ "\n---NEXT---\n" ++ T.unpack dec else ""


λ> edTest "1"
1871,1396,1871,True

λ> edTest "2"
412,233,412,True

λ> edTest "3"
234,163,234,True

λ> edTest "4"
108,92,108,True

λ> edTest "5"
204,153,204,True

λ> edTest "6"
145,115,145,True

λ> edTest "7"
297,197,297,True

λ> edTest "8"
211,164,211,True

λ> edTest "9"
1209,979,1209,True

λ> edTest "10"
1453,1144,1453,True

Sample

The output of the encoding function g for test #4 is this
"\99429\582753\135266\70785\35953\855074\247652\1082563\68738\49724\164898\68157\99429\67973\1082404\587873\73795\298017\330818\198705\69861\1082435\595009\607426\36414\69873\855074\265249\346275\67779\68738\77985\1082513\821353\132131\101410\247652\1082562\49724\164898\67649\594977\34915\67746\50273\135265\103997\563265\103457\1086021\99399\584802\70753\73889\34882\582722\411459\67779\68740\1084516\1082563\1091681\103491\313282\49724\164897\68705\135741\69858\50241\607426\35905\608421\1082435\69858\50274\71777\43075\298018\280517\1082404\67971\36017\955425\67665\919600\100452\132129\214883\35057\856097\101474\70753\135737"
or if you are an adept of gibberish, this
豱숼踾숼衣쑡衂숼쑁豁쑢ꡃ貱裱

Additionnal notes

  • Using http://mothereff.in/byte-counter, directory listings and edTest the size of the tests are all consistent but still differs from the indicated size in the question.
  • Test #10 contains a couple of EM DASHes () that I replaced with - since they are outside of the 0-7f range.
  • Further compression could be achieved using the remaining fourth bit while flattening, for instance, 00 base case, 01 repeat all, 10 repeat except last, 11 repeat except last two.
  • The test files and the code are all available here https://github.com/gxtaillon/codegolf/tree/master/Kolmogorov

gxtaillon

Posted 2014-07-17T21:13:02.447

Reputation: 577

Hello, thanks for this answer! :) I didn't understand what happens in binary when you convert abcdefghijklm... to 豱..., could you explain it a little bit more please? Also, I've fixed the char counts and converted the the em-dashes in #10, in the question. My char counts are still different than yours, though. Dunno why, I used the mothereff.in tool. – xem – 2014-07-20T07:38:53.753

@xem The intricate details have been revealed. – gxtaillon – 2014-07-20T18:22:42.797

My mind is (figuratively) literally blown by the idea that the numbers 0 and 2-127 can be encoded on 5 bits. Did you find this by yourself or was it something known? Bonus question: how many bits do you need to store only the printable ascii characters, i.e. 95 different chars? – xem – 2014-07-20T18:43:33.777

@xem The numbers are not encoded on 5 bits, each of their factors are. I would be very happy if I had found a way of encoding 7 bits on only 5. As for ascii characters, using this method they would still need 5 bits each. – gxtaillon – 2014-07-20T19:15:58.020

ahah okay, that makes more sense. So, you use a few bits (4 or 5?) to specify how many factors you'll use, then you add the indexes of those factors, encoded on 5 bits. Right? – xem – 2014-07-20T21:35:17.527

1Since in the range you specified there are at most 6 factors per number, the length uses 3 out of a 5 bits "block". Then the indexes are encoded on 5 bits, yes. In this implementation, one of the 2 unused bits in the length block is used to get additional compression. – gxtaillon – 2014-07-21T02:41:05.837

4

C++ (C++11), 2741 points

This answer uses UTF-32 as the encoding for the compressed text.

#include <cstdio>
#include <iostream>
#include <locale>
#include <string>
#define L locale
using namespace std;long b,n,i,l;void c(){string d;char x;while((x=cin.get())!=EOF)d+=x;b=(d.size()*7)%20;n=5;wcout.imbue(L());for(char y:d){b=b<<7|y&127;n+=7;if(n>=20)wcout.put(b>>(n-=20)&0xFFFFF);}if(n)wcout.put(b<<20-n&0xFFFFF);}void d(){wstring d;wchar_t w;wcin.imbue(L());while((w=wcin.get())!=EOF)d+=w;l=-1;for(wchar_t y:d){b=b<<20|y;n+=20;if(l<0)l=b>>15&31,n-=5;while(n>=7&(i<d.size()-1|n>20-l))cout.put(b>>(n-=7)&127);++i;}}int main(int t,char**a){L::global(L("en_US.utf8"));**++a<'d'?c():d();}

Char counts and scoring

Code: 593 chars (the trailing newline is stripped)

Compressed texts (unicode characters): 654 + 145 + 82 + 38 + 51 + 104 + 73 + 423 + 506 = 2148 (counted with wc -m for the number of unicode characters rather than bytes, the byte counts are, as as with @gxtaillon's answer, higher than the originals, 8413 bytes in total, as counted with wc -c).

Compression ratio (ASCII to unicode): 35.01% (using the 6135 bytes from the question (same as wc -c))

Beware:

A lot of shells cannot handle the unicode characters this program produces. Thus, decompressing may lead to the text being cut off somewhere when the shell cannot handle a character, as input is taken via stdin from the shell.

Compiling

It should compile with clang++ and g++ -std=c++11, but it will show some warnings about operator precedence, as an expression like b<<20-n&0xFFFFF will not be treated as ((b << 20) - n) & 0xFFFFF, as one might expect, but rather as (b << (20 - n)) & 0xFFFFF.

Usage

  • Compile the program into an executable, e.g. ./compress.
  • Run the program as ./compress c to compress or ./compress d to decompress. (Careful, leaving out the option gives a SEGFAULT (error checking is so character expensive...) and other options (such as using D instead of d) may give unexpected results
  • Input is read from stdin and output written to stdout

How it works

Ungolfed

#include <cstdio>
#include <iostream>
#include <locale>
#include <string>

using namespace std;

long b, n, i, l;

// Compress
void c() {
    string d;
    char x;
    // Read from STDIN until EOF
    while((x = cin.get()) != EOF)
        d += x;
    // Calculate the number of bits used from the last unicode character
    // (maximum 19) and store it into the first 5 bits
    b = (d.size() * 7) % 20;
    n = 5;
    // Set the output locale to allow unicode
    wcout.imbue(locale());
    // For each character in the input...
    for (char y : d) {
        // Add its bit representation (7 bits) to the right of the buffer
        // by shifting the buffer left and ORing with the character code
        b = (b << 7) | (y & 127);
        // Add 7 to the bit counter
        n += 7;
        // If enough data is present (20 bits per output character),
        // calculate the output character by shifting the buffer right,
        // so that the last 20 bits are the left 20 bits of the buffer.
        // Also decrement the bit counter by 20, as 20 bits are removed.
        if (n >= 20)
            wcout.put((b >> (n -= 20)) & 0xFFFFF);
    }
    // If there are still bits in the buffer, write them to the front of
    // another unicode character
    if (n)
        wcout.put((b << (20 - n)) & 0xFFFFF);
}

// Decompress
void d() {
    wstring d;
    wchar_t w;
    // Set STDIN to UNICODE
    wcin.imbue(locale());
    // Read wide characters from STDIN (std::wcin) until EOF
    while ((w = wcin.get()) != EOF)
        d += w;
    // `l' represents the number of bits used in the last unicode character.
    // It will be set later
    l = -1;
    // For each input character...
    for (wchar_t y : d) {
        // Add it to the buffer and add 20 to the bit counter
        b = (b << 20) | y;
        n += 20;
        // If the number of bits in the last unicode character has not been
        // set yet, read the first 5 buffer bits into `l'. This is
        // necessary because the last character may contain more than 7
        // (one entire uncompressed character) unused bits which may
        // otherwise be interpreted as garbage.
        if (l < 0) {
            l = (b >> 15) & 31;
            n -= 5;
        }
        // As long as there is data to turn into output characters
        // (at least 7 bits in the buffer and either not the last
        // unicode character or before the unused bits)
        while (n >= 7 && ((i < d.size() - 1) || (n > (20 - l)))
            cout.put((b >> (n -= 7)) & 127); // Output the left 7 bits in the buffer as an ASCII character
        ++i; // Increment the character index, so that we know when we reach the last input character
    }
}
int main(int t, char**a) {
    // Set the default locale to en_US.utf8 (with unicode)
    locale::global(locale("en_US.utf8"));
    // Decide whether to compress or decompress.
    // This is just fancy pointer stuff for a[1][0] < 'd' ? c() : d()
    (**(++a) < 'd') ? c() : d();
}

Explanation

As all unicode characters from U+0000 to U+10FFFF are allowed, we can use 20 bits per unicode char: U+FFFFF uses 20 bits and is still included in the allowed range. Thus, we just try to cram all the individual ASCII char bits into the unicode characters to store multiple ASCII characters in one unicode character. However, we also need to store the number of bits used in the last unicode character, because unused garbage bits may otherwise be interpreted as proper compressed ASCII characters. As the maximum number of used bits in the last unicode character is 20, we will need 5 bits for that, which are placed in the beginning of the compressed data.

Example output

This is the output for example #4 (as given by less):

<U+4E1C5><U+8F265><U+CD9F4><U+69D5A><U+F66DD><U+DBF87><U+1E5CF><U+A75ED>
<U+DFC79><U+F42B8><U+F7CBC><U+BA79E><U+BA77F>쏏<U+A356B><U+D9EBC><U+63ED8>
<U+B76D1><U+5C3CE><U+6CF8F><U+96CC3><U+BF2F5><U+D3934><U+74DDC><U+F8EAD>
<U+7E316><U+DEFDB><U+D0AF5><U+E7C77><U+EDD7A><U+73E5C><U+786FD><U+DB766>
<U+BD5A7><U+467CD><U+97263><U+C5840>

( give <U+C3CF><U+266CF> as character codes, but I might have gotten that wrong)

hlt

Posted 2014-07-17T21:13:02.447

Reputation: 181

2

Python 3, 289+818=1107 points

Only lightly golfed.

import zlib as Z
def p(s):
 z=int.from_bytes(Z.compress(s),'big');o=''
 while z:
  z,d=divmod(z,1<<20)
  if d>0xd000:d+=1<<16
  o+=chr(d)
 return o[::-1]
def u(s):
 i=0
 for c in s:
  d=ord(c)
  if d>0xe000:d-=1<<16
  i=(i<<20)+d
 return Z.decompress(i.to_bytes(i.bit_length()//8+1,'big'))

The total code size is 289 bytes, and encodes the given 6135 bytes into 818 Unicode characters -- the total output byte count is 3201 bytes, significantly smaller than the original inputs.

Encodes using zlib, then secondarily using unicode encoding. Some extra logic was needed to avoid surrogates (which Python really hates).

Example output from #4, as seen by less (37 Unicode chars):

x<U+AC0DC><U+BB701><U+D0200><U+D00B0><U+AD2F4><U+EEFC5><U+F4F34>멍<U+3C63A><U+2F62C><U+BA5B6><U+4E70A><U+F7D88><U+FF138><U+40CAE>
<U+CB43E><U+C30F5><U+6FFEF><U+698BE><U+9D73A><U+95199><U+BD941><U+10B55E><U+88889><U+75A1F><U+4C4BB><U+5C67A><U+1089A3><U+C75A7>
<U+38AC1><U+4B6BB><U+592F0>ᚋ<U+F2C9B>

Driver program for testing:

if __name__ == '__main__':
    import os
    total = 0
    for i in range(1,10+1):
        out = p(open('data/%d.txt'%i,'rb').read())
        total += len(out)
        open('out/%d.bin'%i,'w',encoding='utf8').write(out)
    print(total)
    for i in range(1,10+1):
        out = u(open('out/%d.bin'%i,'r',encoding='utf8').read())
        open('data2/%d.txt'%i,'wb').write(out)

Output byte counts:

 607 out/1.bin
 128 out/2.bin
 101 out/3.bin
 143 out/4.bin
 177 out/5.bin
 145 out/6.bin
 186 out/7.bin
 222 out/8.bin
 389 out/9.bin
1103 out/10.bin
3201 total

nneonneo

Posted 2014-07-17T21:13:02.447

Reputation: 11 445

1Isn't the fact that this is using a compression library cheating a bit? – Beta Decay – 2014-11-01T10:53:19.817

@BetaDecay: It doesn't restrict that in the question, so I figured it was fair game. – nneonneo – 2014-11-01T17:59:12.710

Also, you need to include a decompresser. – Beta Decay – 2014-11-04T07:07:00.070

@BetaDecay: p is the packer, u is the unpacker. – nneonneo – 2014-11-04T15:04:42.780

1

Python 2 - 1141 points

from zlib import *;v=256
def e(b):
 x=0
 for c in compress(b,9):x=(x*v)+ord(c)
 b=bin(x)[2:]
 return "".join(unichr(int("1"+b[a:a+19],2))for a in range(0,len(b),19))
def d(s):
 y=int("".join(bin(ord(a))[3:]for a in s),2);x=""
 while y:y,d=(y/v,chr(y%v));x=d+x
 return decompress(x)

Code size is 281 bytes and it encodes the 6135 bytes into 860 unicode characters.

How it works:

To Encode:

  1. Compress the string to be encoded.
  2. Interpret the compressed string as a base 256 number.
  3. Convert the number to binary.
  4. Split the binary into groups of 19 bits, add a 1 bit to the beginning of each of them, and then convert to Unicode characters.

Decoding is the reverse.

Note that some versions of python can only handle unicode characters up to 0xFFFF, and thus this code will raise a ValueError.

pppery

Posted 2014-07-17T21:13:02.447

Reputation: 3 987