Create self-extractor script (relaxed rules)

8

2

This is the target language-relaxed version of Generate a self-extractor application

Create a program in any language, that takes an arbitrary binary input, and generates a valid program (also in any language), which when compiled and run (or interpreted) will print this binary input.

So you could do this:

$ cat file.dat | python program.py > gen.rb
$ ruby gen.rb > file.dat.new
$ diff file.dat file.dat.new
$ echo $?
0

The scoring depends on the byte count of the original code (O), and the square of the byte count of the generated codes for the following inputs:

  • A: The whole lyrics of Never gonna give you up
  • B: The uncompressed grayscale 32x32 BMP version of Lena Söderberg
  • C: The original source code (the generator, O)
  • D: The generated source code for the A input

Score is O + max(A, O)^2 + max(B, O)^2 + max(C, O)^2 + max(D, O)^2.

Score explained:

  1. The values for the generated variants are squared to encourage optimizing the size of the generated code (and the size of the original code is reflected in C as well).
  2. Also to encourage minimizing the length of the generator code (as opposed to the generated code), if the generated code is smaller than the generator, then the size of the generator will be used instead.

Lowest score wins.

Clarification of the rules:

  • You can use the standard libraries of the chosen language(s), except for libraries that are designed (or can be used directly) for compression and/or decompression (like zlib, bz2, etc.). Using other kinds of libraries, like base64 are fine, but they have to be part of the standard library of the chosen language(s).
  • The generated code has to be able to print the original data without any external help, like network connections or pipes, etc. (e.g. if you move the gen.exe to another, isolated computer it should be able to generate the output there as well). You also have to assume that the generated code (if you are using an interpreted target language) will be run the following way: $ langint script.scr > output where langint is the interpreter of your chosen language.
  • The code generator should also work for any other input (generates code that regenerates the original input), but compression performance for them is not measured in the score (this means that optimizing the generator only for the test inputs are okay). However if there is (or anyone finds) at least one input with a maximum size of 65535 for which the code generator does not work, then the score of the submission is infinite.
  • The input files are supplied in base64 to make sure everyone will run the app on the same inputs. Please note that the lyrics have windows style line endings. Base64 decoding of the input is not part of the challenge, it can be done separately (eg the file.dat on the example is already the base64 decoded file).

File inputs in base 64:

Never gonna give you up

V2UncmUgbm8gc3RyYW5nZXJzIHRvIGxvdmUNCllvdSBrbm93IHRoZSBydWxlcyBhbmQgc28gZG8g
SQ0KQSBmdWxsIGNvbW1pdG1lbnQncyB3aGF0IEknbSB0aGlua2luZyBvZg0KWW91IHdvdWxkbid0
IGdldCB0aGlzIGZyb20gYW55IG90aGVyIGd1eQ0KSSBqdXN0IHdhbm5hIHRlbGwgeW91IGhvdyBJ
J20gZmVlbGluZw0KR290dGEgbWFrZSB5b3UgdW5kZXJzdGFuZA0KDQpOZXZlciBnb25uYSBnaXZl
IHlvdSB1cA0KTmV2ZXIgZ29ubmEgbGV0IHlvdSBkb3duDQpOZXZlciBnb25uYSBydW4gYXJvdW5k
IGFuZCBkZXNlcnQgeW91DQpOZXZlciBnb25uYSBtYWtlIHlvdSBjcnkNCk5ldmVyIGdvbm5hIHNh
eSBnb29kYnllDQpOZXZlciBnb25uYSB0ZWxsIGEgbGllIGFuZCBodXJ0IHlvdQ0KDQpXZSd2ZSBr
bm93biBlYWNoIG90aGVyIGZvciBzbyBsb25nDQpZb3VyIGhlYXJ0J3MgYmVlbiBhY2hpbmcgYnV0
DQpZb3UncmUgdG9vIHNoeSB0byBzYXkgaXQNCkluc2lkZSB3ZSBib3RoIGtub3cgd2hhdCdzIGJl
ZW4gZ29pbmcgb24NCldlIGtub3cgdGhlIGdhbWUgYW5kIHdlJ3JlIGdvbm5hIHBsYXkgaXQNCkFu
ZCBpZiB5b3UgYXNrIG1lIGhvdyBJJ20gZmVlbGluZw0KRG9uJ3QgdGVsbCBtZSB5b3UncmUgdG9v
IGJsaW5kIHRvIHNlZQ0KDQpOZXZlciBnb25uYSBnaXZlIHlvdSB1cA0KTmV2ZXIgZ29ubmEgbGV0
IHlvdSBkb3duDQpOZXZlciBnb25uYSBydW4gYXJvdW5kIGFuZCBkZXNlcnQgeW91DQpOZXZlciBn
b25uYSBtYWtlIHlvdSBjcnkNCk5ldmVyIGdvbm5hIHNheSBnb29kYnllDQpOZXZlciBnb25uYSB0
ZWxsIGEgbGllIGFuZCBodXJ0IHlvdQ0KDQpOZXZlciBnb25uYSBnaXZlIHlvdSB1cA0KTmV2ZXIg
Z29ubmEgbGV0IHlvdSBkb3duDQpOZXZlciBnb25uYSBydW4gYXJvdW5kIGFuZCBkZXNlcnQgeW91
DQpOZXZlciBnb25uYSBtYWtlIHlvdSBjcnkNCk5ldmVyIGdvbm5hIHNheSBnb29kYnllDQpOZXZl
ciBnb25uYSB0ZWxsIGEgbGllIGFuZCBodXJ0IHlvdQ0KDQooT29oLCBnaXZlIHlvdSB1cCkNCihP
b2gsIGdpdmUgeW91IHVwKQ0KKE9vaCkNCk5ldmVyIGdvbm5hIGdpdmUsIG5ldmVyIGdvbm5hIGdp
dmUNCihHaXZlIHlvdSB1cCkNCihPb2gpDQpOZXZlciBnb25uYSBnaXZlLCBuZXZlciBnb25uYSBn
aXZlDQooR2l2ZSB5b3UgdXApDQoNCldlJ3ZlIGtub3cgZWFjaCBvdGhlciBmb3Igc28gbG9uZw0K
WW91ciBoZWFydCdzIGJlZW4gYWNoaW5nIGJ1dA0KWW91J3JlIHRvbyBzaHkgdG8gc2F5IGl0DQpJ
bnNpZGUgd2UgYm90aCBrbm93IHdoYXQncyBiZWVuIGdvaW5nIG9uDQpXZSBrbm93IHRoZSBnYW1l
IGFuZCB3ZSdyZSBnb25uYSBwbGF5IGl0DQoNCkkganVzdCB3YW5uYSB0ZWxsIHlvdSBob3cgSSdt
IGZlZWxpbmcNCkdvdHRhIG1ha2UgeW91IHVuZGVyc3RhbmQNCg0KTmV2ZXIgZ29ubmEgZ2l2ZSB5
b3UgdXANCk5ldmVyIGdvbm5hIGxldCB5b3UgZG93bg0KTmV2ZXIgZ29ubmEgcnVuIGFyb3VuZCBh
bmQgZGVzZXJ0IHlvdQ0KTmV2ZXIgZ29ubmEgbWFrZSB5b3UgY3J5DQpOZXZlciBnb25uYSBzYXkg
Z29vZGJ5ZQ0KTmV2ZXIgZ29ubmEgdGVsbCBhIGxpZSBhbmQgaHVydCB5b3UNCg0KTmV2ZXIgZ29u
bmEgZ2l2ZSB5b3UgdXANCk5ldmVyIGdvbm5hIGxldCB5b3UgZG93bg0KTmV2ZXIgZ29ubmEgcnVu
IGFyb3VuZCBhbmQgZGVzZXJ0IHlvdQ0KTmV2ZXIgZ29ubmEgbWFrZSB5b3UgY3J5DQpOZXZlciBn
b25uYSBzYXkgZ29vZGJ5ZQ0KTmV2ZXIgZ29ubmEgdGVsbCBhIGxpZSBhbmQgaHVydCB5b3UNCg0K
TmV2ZXIgZ29ubmEgZ2l2ZSB5b3UgdXANCk5ldmVyIGdvbm5hIGxldCB5b3UgZG93bg0KTmV2ZXIg
Z29ubmEgcnVuIGFyb3VuZCBhbmQgZGVzZXJ0IHlvdQ0KTmV2ZXIgZ29ubmEgbWFrZSB5b3UgY3J5
DQpOZXZlciBnb25uYSBzYXkgZ29vZGJ5ZQ0KTmV2ZXIgZ29ubmEgdGVsbCBhIGxpZSBhbmQgaHVy
dCB5b3U=

Lena

Qk02CAAAAAAAADYEAAAoAAAAIAAAACAAAAABAAgAAAAAAAAEAAB1CgAAdQoAAAABAAAAAAAAAAAA
AAEBAQACAgIAAwMDAAQEBAAFBQUABgYGAAcHBwAICAgACQkJAAoKCgALCwsADAwMAA0NDQAODg4A
Dw8PABAQEAAREREAEhISABMTEwAUFBQAFRUVABYWFgAXFxcAGBgYABkZGQAaGhoAGxsbABwcHAAd
HR0AHh4eAB8fHwAgICAAISEhACIiIgAjIyMAJCQkACUlJQAmJiYAJycnACgoKAApKSkAKioqACsr
KwAsLCwALS0tAC4uLgAvLy8AMDAwADExMQAyMjIAMzMzADQ0NAA1NTUANjY2ADc3NwA4ODgAOTk5
ADo6OgA7OzsAPDw8AD09PQA+Pj4APz8/AEBAQABBQUEAQkJCAENDQwBEREQARUVFAEZGRgBHR0cA
SEhIAElJSQBKSkoAS0tLAExMTABNTU0ATk5OAE9PTwBQUFAAUVFRAFJSUgBTU1MAVFRUAFVVVQBW
VlYAV1dXAFhYWABZWVkAWlpaAFtbWwBcXFwAXV1dAF5eXgBfX18AYGBgAGFhYQBiYmIAY2NjAGRk
ZABlZWUAZmZmAGdnZwBoaGgAaWlpAGpqagBra2sAbGxsAG1tbQBubm4Ab29vAHBwcABxcXEAcnJy
AHNzcwB0dHQAdXV1AHZ2dgB3d3cAeHh4AHl5eQB6enoAe3t7AHx8fAB9fX0Afn5+AH9/fwCAgIAA
gYGBAIKCggCDg4MAhISEAIWFhQCGhoYAh4eHAIiIiACJiYkAioqKAIuLiwCMjIwAjY2NAI6OjgCP
j48AkJCQAJGRkQCSkpIAk5OTAJSUlACVlZUAlpaWAJeXlwCYmJgAmZmZAJqamgCbm5sAnJycAJ2d
nQCenp4An5+fAKCgoAChoaEAoqKiAKOjowCkpKQApaWlAKampgCnp6cAqKioAKmpqQCqqqoAq6ur
AKysrACtra0Arq6uAK+vrwCwsLAAsbGxALKysgCzs7MAtLS0ALW1tQC2trYAt7e3ALi4uAC5ubkA
urq6ALu7uwC8vLwAvb29AL6+vgC/v78AwMDAAMHBwQDCwsIAw8PDAMTExADFxcUAxsbGAMfHxwDI
yMgAycnJAMrKygDLy8sAzMzMAM3NzQDOzs4Az8/PANDQ0ADR0dEA0tLSANPT0wDU1NQA1dXVANbW
1gDX19cA2NjYANnZ2QDa2toA29vbANzc3ADd3d0A3t7eAN/f3wDg4OAA4eHhAOLi4gDj4+MA5OTk
AOXl5QDm5uYA5+fnAOjo6ADp6ekA6urqAOvr6wDs7OwA7e3tAO7u7gDv7+8A8PDwAPHx8QDy8vIA
8/PzAPT09AD19fUA9vb2APf39wD4+PgA+fn5APr6+gD7+/sA/Pz8AP39/QD+/v4A////AEmnoKA4
OUVOT0tDSHCAhIiNkpOUnrDNtmJ3dWBUZFk/aJSbp0MwOkU9W0IpUGSCi4+QjJKiu9ieXliNaExV
ZUVZiZerTS45ND1jSyhDY2qNj4uNmazH2H9SS5i1WFZhTlB8l6tMMTk1QVtcOzRkaYSOiZKgt8/K
foCJj8eRWWBbXHaZqkcwRDM8a3RSOkxneYuMmKi/26B8lYeO2bNjZWJycpqsTi5CO0twc0ozPVVz
i5Cer8fcZG6WjIKt0nBhZn5TmqxaNUhNV3ZqPzU7RXKJjaLEzok/c5SEcXbQnktZOU+kpWI9QWBu
dU06MjU9ZIGUpJlbOk9whX1+e8bXi0UuY6qhXEk/aXJ4PTU0OkJjh5qmYShASW+dmZaFvdzWujZe
raRoVUNpYJV1MDZCYoOKfo2HPDhDaaGalYi02tXWQF6soWZdQlJgbJJRLkp6i4uIm6hlNEFjmZiT
i6rT0tVKV6igYV9MXV09U3xHRXyNmpeHqI09PmSGnJWOoczQ0klKpZ1lalRhQjdWWH5AeJWll3ur
oVU9YGahmJGRvszQT1GooF94WlNGPFw3e3tzobKjh7ejZERcSJqalYugxsheW6uhX29zWUlTWCo/
pI+IkZ+GsotURl81ip2Zlo2ftWBkqp1ZX3OTU0pMNi9rwGxMbYGyZz5GWy5woZqbmZCMY2uqm1hk
aoFyY0U5Nz2Rs4l2oMG5cUlRMVScop+ZlY5ka6qaWWVrdYZ/UjY3P0+bs6mvxtuHUUQ2OouhnJmZ
mWdtqZlVYWZwp317WD1FOkqYuKy+xmZwQjEvZp+ko6CcZ3OqllZhYX2zf36AZEQ7RVmku7i4doeP
QipGmaehnZtjdqyVVmJcmKl5f4SIdVhpUFCkxMPFpoiCPy58pZ+bml5yq5VXYl6YoHWCh46Ukpea
lp6yv8fPw590JlKcoJ2dYHCqlFhjYJOKan2Mk6Crrq61uLK2wsrQxbE2L3mjn5xhb6aQVmRli3hs
eYKYq7GyucG/yMemnr3W1U4pQ5WhnFlqpIxUYmh+e218goeluL3DycvZoo+BSKvQRjAuYZ6hV2yk
jFdjanZ5b3eFiIquw8nM1r1nlYczd5tZLTM0eKNpaKeLV2RpdXpzc3uIjqe8ydbJamSZgmKTko89
LTBCg5Rupo1VY2p0e313foaUr77GtH9ncJWNjJKPlI08LzE9pY+ijlplbXl+gIKCjZ6wsp18eXRy
jpeYk42S1nUpMy2ipaSMX2dsen6AgYKBgoWBfX+Ad3WTnp2bkLm7dVgwMZ2hp41haG17gYKDg4N/
f4ODgIB3dZyinpigzIJ2glsun5yokmJqbnyDhIWFhoSDhYOBgnSCnp2bmMKob3t9gWU=

SztupY

Posted 2014-01-12T20:24:17.730

Reputation: 3 639

Yay! A new challenge :) – Timtech – 2014-01-12T21:31:03.877

1Max(D,O)? Did you mean Min(D,O)? Otherwise we'll be given the score of the uncompressed file unless we manage to make it even larger. – John Dvorak – 2014-01-12T21:43:10.060

@JanDvorak No... say I have a 25 char generator (O). And D is what O generated given A. So if A was 100 and O made it 50 (50 = D), Max(D,O) = 50 – Timtech – 2014-01-12T21:54:42.553

Answers

1

PHP - 141.047.262 45.922.273 13.895.520 13.779.516

My code applies a handcrafted huffmann encoding to the input, see below for an uncompressed and commented version.

  • O: 1145
  • A: 1901 (42 bytes smaller)
  • B: 2180 (78 bytes larger)
  • C: 1223 (78 bytes larger)
  • D: 1979

Demo:

[timwolla@~/workspace/php]cat lena.txt |php huffmann.php |php|md5sum && md5sum lena.txt
ebcc2e2a501121b37dcbac3e438be58b  -
ebcc2e2a501121b37dcbac3e438be58b  lena.txt
[timwolla@~/workspace/php]cat rick.txt |php huffmann.php |php|md5sum && md5sum rick.txt
549d9635795ef37857e0b3dc02e52e53  -
549d9635795ef37857e0b3dc02e52e53  rick.txt

Minified version:

<?php
$q=file_get_contents('php://stdin');$_='str_split';$i=$_($q);$t=[];$c=array_count_values($i);foreach($c as$k=>$v)$t[]=['c'=>$v,'v'=>$k];do{usort($t,function($a,$b){return$a['c']-$b['c'];});$a=array_shift($t);$b=array_shift($t);$t[]=['c'=>$a['c']+$b['c'],'l'=>$a,'r'=>$b];}while(count($t)>1);$r=[];function t($t,&$r,$j=''){if(isset($t['v']))$r[$t['v']]=$j;else{t($t['l'],$r,$j.'1');t($t['r'],$r,$j.'0');}}t($t[0],$r);uksort($r,function($a,$b){return$a<$b;});$z=$_(array_reduce($i,function($a,$b)use($r){return $a.$r[$b];},""),8);$z[]=str_pad(array_pop($z),8,0);$s="";for($i=0;$i<256;$i++)$s.=isset($r[chr($i)])?$r[chr($i)].';':';';$w='<?php
$_=<<<\'EOT\'
'.array_reduce($z,function($a,$b){return $a.chr(bindec($b));},"").'
EOT;
$s=explode(";","'.$s.'");$r=[];foreach($s as$k=>$v)$v?$r[chr($k)]=$v:1;$o=implode("",array_map(function($a){return str_pad(decbin(ord($a)),8,0,0);},'.$_.'($_)));for(;$l=strlen($o);){for($i=1;$i<$l;$i++){$p=substr($o,0,$i);if(($k=array_search($p,$r,true))!==false){echo$k;$o=substr($o,$i);continue 2;}}die;}';
do{$x=md5(mt_rand());}while(strpos($q,$x));
$q="<?=<<<'x$x'
$q
x$x;
";echo strlen($w)<strlen($q)?$w:$q;

Uncompressed and commented version: (Note that it will surely spit out the unencoded version, because of the comment)

<?php
$q=file_get_contents('php://stdin'); // read input
$_='str_split'; // shorter name (saves some bytes in the end)
$i=$_($q); // convert input into array of chars
$c=$t=[];
$c=array_count_values($i); // calculate frequency table
foreach($c as$k=>$v){
    // build bottom of huffmann tree
    $t[]=[
        'c'=>$v, // count
        'v'=>$k // value (= the character)
    ];
}
do{
    usort($t,function($a,$b){return$a['c']-$b['c'];}); // sort elements of the tree-to-be
    $a=array_shift($t); // lowest count
    $b=array_shift($t); // 2nd lowest count

    // and merge them
    $t[]=[
        'c'=>$a['c']+$b['c'], // cumulated cost of this part of the tree
        'l'=>$a, // left
        'r'=>$b // right
    ];
}
while(count($t)>1);
$r=[];
// recursive tree traverser to build the bits of the encoding
function t($t,&$r,$j='') {
    if(isset($t['v'])) {
        // abort once we found a value
        $r[$t['v']]=$j;
    }
    else {
        // going left = 1
        t($t['l'],$r,$j.'1');
        // going right = 0
        t($t['r'],$r,$j.'0');
    }
}
t($t[0],$r);
// sort the bit-mapping by ascii value of the key
uksort($r,function($a,$b){return$a<$b;});

// encode the input by concatenating the bits and split the result into chunks of 8 bits
$z=$_(array_reduce($i,function($a,$b)use($r){return $a.$r[$b];},""),8);
// ensure the last part is 8 bit long
$z[]=str_pad(array_pop($z),8,0);

// encode the mapping into a semicolon separated string, unmapped characters are empty (that's why I sorted the mapping above)
$s="";
for($i=0;$i<256;$i++) {
    $s.=isset($r[chr($i)])?$r[chr($i)].';':';';
}

// build the decoder
$w='<?php
$_=<<<\'EOT\'
'.array_reduce($z,function($a,$b){return $a.chr(bindec($b));},"" /* // convert the 8-bit chunks into characters */).'
EOT;

// split the encoded map at the semicolons
$s=explode(";","'.$s.'");
$r=[];
foreach($s as$k=>$v) {
    $v?$r[chr($k)]=$v:1; // if there are some bits save them into the map
}
// convert the encoded characters back into a long string of 0 and 1
$o=implode("",array_map(function($a){return str_pad(decbin(ord($a)),8,0,0);},'.$_.'($_)));

for(;$l=strlen($o);){
    // search the whole string for an character in the map
    for($i=1;$i<$l;$i++){
        $p=substr($o,0,$i);
        if(($k=array_search($p,$r,true))!==false){
            // character found, spit it out
            echo$k;
            // and remove the bits
            $o=substr($o,$i);
            continue 2;
        }
    }
    // nothing found, we are finished, the remaining characters are a result of the padding to multiples of 8
    die;
}';
// search a NOWDOC delimiter that does not occur in the input
do{
    $x=md5(mt_rand());
}
while(strpos($q,$x));

// build a program that simply spits out the input unchanged
$q="<?=<<<'x$x'
$q
x$x;
";

// check which one is shorter :-)
echo strlen($w)<strlen($q)?$w:$q;

TimWolla

Posted 2014-01-12T20:24:17.730

Reputation: 1 878