Generate a self-extractor application

5

2

Create a program in any language, that takes an arbitrary binary input, and generates a valid C++11 program, which when compiled and run will print this binary input.

So you could do this (on linux with gcc, where program is the code you've written and file.dat is one of the test inputs):

$ cat file.dat | ./program > gen.cpp
$ g++ -std=c++11 -pedantic -static -O2 -o gen gen.cpp
$ ./gen > file.dat.new
$ diff file.dat file.dat.new
$ echo $?
0

The scoring depends on the byte count of the original code (O), and the square of the byte count of the generated codes for the following inputs:

  • A: The whole lyrics of Never gonna give you up
  • B: The uncompressed grayscale 32x32 BMP version of Lena Söderberg
  • C: The original source code (the generator, O)
  • D: The generated source code for the A input

Score is O + A^2 + B^2 + C^2 + D^2. The values for the generated variants are squared to encourage optimizing the size of the generated code (and the size of the original code is reflected in C as well).

Lowest score wins.

Clarification of the rules:

  • The generated code has to conform to the C++11 standard
  • You have to be able to compile, run and test all generated codes with at least one of the GCC, Clang or MSVC++ compilers (the example above is just an example for linux and gcc, but you should get the idea on how it should work).
  • The C++11 code can only use the standard C++11 libraries, so external libraries (like zlib) are not allowed.
  • The generated code has to be able to print the original data without any external help, like network connections or pipes, etc. (e.g. if you move the gen.exe to another, isolated computer it should be able to generate the output there as well)
  • The code generator should also work for any other input (generates code that regenerates the original input), but compression performance for them is not measured in the score (this means that optimizing the generator only for the test inputs are okay).
  • The input files are supplied in base 64 only because they contain binary data (the image at least). They need to be decoded separately, which is not part of the challenge

File inputs in base 64:

Never gonna give you up

V2UncmUgbm8gc3RyYW5nZXJzIHRvIGxvdmUNCllvdSBrbm93IHRoZSBydWxlcyBhbmQgc28gZG8g
SQ0KQSBmdWxsIGNvbW1pdG1lbnQncyB3aGF0IEknbSB0aGlua2luZyBvZg0KWW91IHdvdWxkbid0
IGdldCB0aGlzIGZyb20gYW55IG90aGVyIGd1eQ0KSSBqdXN0IHdhbm5hIHRlbGwgeW91IGhvdyBJ
J20gZmVlbGluZw0KR290dGEgbWFrZSB5b3UgdW5kZXJzdGFuZA0KDQpOZXZlciBnb25uYSBnaXZl
IHlvdSB1cA0KTmV2ZXIgZ29ubmEgbGV0IHlvdSBkb3duDQpOZXZlciBnb25uYSBydW4gYXJvdW5k
IGFuZCBkZXNlcnQgeW91DQpOZXZlciBnb25uYSBtYWtlIHlvdSBjcnkNCk5ldmVyIGdvbm5hIHNh
eSBnb29kYnllDQpOZXZlciBnb25uYSB0ZWxsIGEgbGllIGFuZCBodXJ0IHlvdQ0KDQpXZSd2ZSBr
bm93biBlYWNoIG90aGVyIGZvciBzbyBsb25nDQpZb3VyIGhlYXJ0J3MgYmVlbiBhY2hpbmcgYnV0
DQpZb3UncmUgdG9vIHNoeSB0byBzYXkgaXQNCkluc2lkZSB3ZSBib3RoIGtub3cgd2hhdCdzIGJl
ZW4gZ29pbmcgb24NCldlIGtub3cgdGhlIGdhbWUgYW5kIHdlJ3JlIGdvbm5hIHBsYXkgaXQNCkFu
ZCBpZiB5b3UgYXNrIG1lIGhvdyBJJ20gZmVlbGluZw0KRG9uJ3QgdGVsbCBtZSB5b3UncmUgdG9v
IGJsaW5kIHRvIHNlZQ0KDQpOZXZlciBnb25uYSBnaXZlIHlvdSB1cA0KTmV2ZXIgZ29ubmEgbGV0
IHlvdSBkb3duDQpOZXZlciBnb25uYSBydW4gYXJvdW5kIGFuZCBkZXNlcnQgeW91DQpOZXZlciBn
b25uYSBtYWtlIHlvdSBjcnkNCk5ldmVyIGdvbm5hIHNheSBnb29kYnllDQpOZXZlciBnb25uYSB0
ZWxsIGEgbGllIGFuZCBodXJ0IHlvdQ0KDQpOZXZlciBnb25uYSBnaXZlIHlvdSB1cA0KTmV2ZXIg
Z29ubmEgbGV0IHlvdSBkb3duDQpOZXZlciBnb25uYSBydW4gYXJvdW5kIGFuZCBkZXNlcnQgeW91
DQpOZXZlciBnb25uYSBtYWtlIHlvdSBjcnkNCk5ldmVyIGdvbm5hIHNheSBnb29kYnllDQpOZXZl
ciBnb25uYSB0ZWxsIGEgbGllIGFuZCBodXJ0IHlvdQ0KDQooT29oLCBnaXZlIHlvdSB1cCkNCihP
b2gsIGdpdmUgeW91IHVwKQ0KKE9vaCkNCk5ldmVyIGdvbm5hIGdpdmUsIG5ldmVyIGdvbm5hIGdp
dmUNCihHaXZlIHlvdSB1cCkNCihPb2gpDQpOZXZlciBnb25uYSBnaXZlLCBuZXZlciBnb25uYSBn
aXZlDQooR2l2ZSB5b3UgdXApDQoNCldlJ3ZlIGtub3cgZWFjaCBvdGhlciBmb3Igc28gbG9uZw0K
WW91ciBoZWFydCdzIGJlZW4gYWNoaW5nIGJ1dA0KWW91J3JlIHRvbyBzaHkgdG8gc2F5IGl0DQpJ
bnNpZGUgd2UgYm90aCBrbm93IHdoYXQncyBiZWVuIGdvaW5nIG9uDQpXZSBrbm93IHRoZSBnYW1l
IGFuZCB3ZSdyZSBnb25uYSBwbGF5IGl0DQoNCkkganVzdCB3YW5uYSB0ZWxsIHlvdSBob3cgSSdt
IGZlZWxpbmcNCkdvdHRhIG1ha2UgeW91IHVuZGVyc3RhbmQNCg0KTmV2ZXIgZ29ubmEgZ2l2ZSB5
b3UgdXANCk5ldmVyIGdvbm5hIGxldCB5b3UgZG93bg0KTmV2ZXIgZ29ubmEgcnVuIGFyb3VuZCBh
bmQgZGVzZXJ0IHlvdQ0KTmV2ZXIgZ29ubmEgbWFrZSB5b3UgY3J5DQpOZXZlciBnb25uYSBzYXkg
Z29vZGJ5ZQ0KTmV2ZXIgZ29ubmEgdGVsbCBhIGxpZSBhbmQgaHVydCB5b3UNCg0KTmV2ZXIgZ29u
bmEgZ2l2ZSB5b3UgdXANCk5ldmVyIGdvbm5hIGxldCB5b3UgZG93bg0KTmV2ZXIgZ29ubmEgcnVu
IGFyb3VuZCBhbmQgZGVzZXJ0IHlvdQ0KTmV2ZXIgZ29ubmEgbWFrZSB5b3UgY3J5DQpOZXZlciBn
b25uYSBzYXkgZ29vZGJ5ZQ0KTmV2ZXIgZ29ubmEgdGVsbCBhIGxpZSBhbmQgaHVydCB5b3UNCg0K
TmV2ZXIgZ29ubmEgZ2l2ZSB5b3UgdXANCk5ldmVyIGdvbm5hIGxldCB5b3UgZG93bg0KTmV2ZXIg
Z29ubmEgcnVuIGFyb3VuZCBhbmQgZGVzZXJ0IHlvdQ0KTmV2ZXIgZ29ubmEgbWFrZSB5b3UgY3J5
DQpOZXZlciBnb25uYSBzYXkgZ29vZGJ5ZQ0KTmV2ZXIgZ29ubmEgdGVsbCBhIGxpZSBhbmQgaHVy
dCB5b3U=

Lena

Qk02CAAAAAAAADYEAAAoAAAAIAAAACAAAAABAAgAAAAAAAAEAAB1CgAAdQoAAAABAAAAAAAAAAAA
AAEBAQACAgIAAwMDAAQEBAAFBQUABgYGAAcHBwAICAgACQkJAAoKCgALCwsADAwMAA0NDQAODg4A
Dw8PABAQEAAREREAEhISABMTEwAUFBQAFRUVABYWFgAXFxcAGBgYABkZGQAaGhoAGxsbABwcHAAd
HR0AHh4eAB8fHwAgICAAISEhACIiIgAjIyMAJCQkACUlJQAmJiYAJycnACgoKAApKSkAKioqACsr
KwAsLCwALS0tAC4uLgAvLy8AMDAwADExMQAyMjIAMzMzADQ0NAA1NTUANjY2ADc3NwA4ODgAOTk5
ADo6OgA7OzsAPDw8AD09PQA+Pj4APz8/AEBAQABBQUEAQkJCAENDQwBEREQARUVFAEZGRgBHR0cA
SEhIAElJSQBKSkoAS0tLAExMTABNTU0ATk5OAE9PTwBQUFAAUVFRAFJSUgBTU1MAVFRUAFVVVQBW
VlYAV1dXAFhYWABZWVkAWlpaAFtbWwBcXFwAXV1dAF5eXgBfX18AYGBgAGFhYQBiYmIAY2NjAGRk
ZABlZWUAZmZmAGdnZwBoaGgAaWlpAGpqagBra2sAbGxsAG1tbQBubm4Ab29vAHBwcABxcXEAcnJy
AHNzcwB0dHQAdXV1AHZ2dgB3d3cAeHh4AHl5eQB6enoAe3t7AHx8fAB9fX0Afn5+AH9/fwCAgIAA
gYGBAIKCggCDg4MAhISEAIWFhQCGhoYAh4eHAIiIiACJiYkAioqKAIuLiwCMjIwAjY2NAI6OjgCP
j48AkJCQAJGRkQCSkpIAk5OTAJSUlACVlZUAlpaWAJeXlwCYmJgAmZmZAJqamgCbm5sAnJycAJ2d
nQCenp4An5+fAKCgoAChoaEAoqKiAKOjowCkpKQApaWlAKampgCnp6cAqKioAKmpqQCqqqoAq6ur
AKysrACtra0Arq6uAK+vrwCwsLAAsbGxALKysgCzs7MAtLS0ALW1tQC2trYAt7e3ALi4uAC5ubkA
urq6ALu7uwC8vLwAvb29AL6+vgC/v78AwMDAAMHBwQDCwsIAw8PDAMTExADFxcUAxsbGAMfHxwDI
yMgAycnJAMrKygDLy8sAzMzMAM3NzQDOzs4Az8/PANDQ0ADR0dEA0tLSANPT0wDU1NQA1dXVANbW
1gDX19cA2NjYANnZ2QDa2toA29vbANzc3ADd3d0A3t7eAN/f3wDg4OAA4eHhAOLi4gDj4+MA5OTk
AOXl5QDm5uYA5+fnAOjo6ADp6ekA6urqAOvr6wDs7OwA7e3tAO7u7gDv7+8A8PDwAPHx8QDy8vIA
8/PzAPT09AD19fUA9vb2APf39wD4+PgA+fn5APr6+gD7+/sA/Pz8AP39/QD+/v4A////AEmnoKA4
OUVOT0tDSHCAhIiNkpOUnrDNtmJ3dWBUZFk/aJSbp0MwOkU9W0IpUGSCi4+QjJKiu9ieXliNaExV
ZUVZiZerTS45ND1jSyhDY2qNj4uNmazH2H9SS5i1WFZhTlB8l6tMMTk1QVtcOzRkaYSOiZKgt8/K
foCJj8eRWWBbXHaZqkcwRDM8a3RSOkxneYuMmKi/26B8lYeO2bNjZWJycpqsTi5CO0twc0ozPVVz
i5Cer8fcZG6WjIKt0nBhZn5TmqxaNUhNV3ZqPzU7RXKJjaLEzok/c5SEcXbQnktZOU+kpWI9QWBu
dU06MjU9ZIGUpJlbOk9whX1+e8bXi0UuY6qhXEk/aXJ4PTU0OkJjh5qmYShASW+dmZaFvdzWujZe
raRoVUNpYJV1MDZCYoOKfo2HPDhDaaGalYi02tXWQF6soWZdQlJgbJJRLkp6i4uIm6hlNEFjmZiT
i6rT0tVKV6igYV9MXV09U3xHRXyNmpeHqI09PmSGnJWOoczQ0klKpZ1lalRhQjdWWH5AeJWll3ur
oVU9YGahmJGRvszQT1GooF94WlNGPFw3e3tzobKjh7ejZERcSJqalYugxsheW6uhX29zWUlTWCo/
pI+IkZ+GsotURl81ip2Zlo2ftWBkqp1ZX3OTU0pMNi9rwGxMbYGyZz5GWy5woZqbmZCMY2uqm1hk
aoFyY0U5Nz2Rs4l2oMG5cUlRMVScop+ZlY5ka6qaWWVrdYZ/UjY3P0+bs6mvxtuHUUQ2OouhnJmZ
mWdtqZlVYWZwp317WD1FOkqYuKy+xmZwQjEvZp+ko6CcZ3OqllZhYX2zf36AZEQ7RVmku7i4doeP
QipGmaehnZtjdqyVVmJcmKl5f4SIdVhpUFCkxMPFpoiCPy58pZ+bml5yq5VXYl6YoHWCh46Ukpea
lp6yv8fPw590JlKcoJ2dYHCqlFhjYJOKan2Mk6Crrq61uLK2wsrQxbE2L3mjn5xhb6aQVmRli3hs
eYKYq7GyucG/yMemnr3W1U4pQ5WhnFlqpIxUYmh+e218goeluL3DycvZoo+BSKvQRjAuYZ6hV2yk
jFdjanZ5b3eFiIquw8nM1r1nlYczd5tZLTM0eKNpaKeLV2RpdXpzc3uIjqe8ydbJamSZgmKTko89
LTBCg5Rupo1VY2p0e313foaUr77GtH9ncJWNjJKPlI08LzE9pY+ijlplbXl+gIKCjZ6wsp18eXRy
jpeYk42S1nUpMy2ipaSMX2dsen6AgYKBgoWBfX+Ad3WTnp2bkLm7dVgwMZ2hp41haG17gYKDg4N/
f4ODgIB3dZyinpigzIJ2glsun5yokmJqbnyDhIWFhoSDhYOBgnSCnp2bmMKob3t9gWU=

SztupY

Posted 2014-01-04T02:26:24.000

Reputation: 3 639

It's a shame about the C++ requirement, because this is an interesting exercise but not worth learning C++ for. I have written a Java program which generates Golfscript programs for [tag:kolmogorov-complexity] questions. – Peter Taylor – 2014-01-04T08:38:02.623

Why the C++ requirement? Why can't it generate a source/script in any language that when compiled and run prints the source? – Sylwester – 2014-01-08T21:07:50.153

As the scoring depends much more on the length of the generated code as on the original one having codes generate terse results via golfscript or similar would probably skew the results much. I could have said you are allowed to use c++ and java and c and ..., but it's much cleaner to say that this is partially a C++ question. I chose C++11, because it is high level enough, terse enough, and it already has the raw string literal syntax (helping keep the byte count low when storing lots of binary data). I'll create a new, unbounded question once this is over though. – SztupY – 2014-01-08T21:17:52.347

Answers

2

C

The extremely repetitive song makes it possible to compress this rather heavily. This program analyses the structure of the input and picks one of two compressors (one for the song, one for the picture), or falls back to raw output. I manage to shrink both the song and the picture at half the input size!

Tested on Debian Jessie using gcc and g++.

Score: 1185 + 957^2 + 1223^2 + 1276^2 + 1123^2 = 5 302 068

#include<stdlib.h>
#include<string.h>
#include<stdio.h>
#define R ;return;
#define X(a) ;o(34);a;o(34);
#define Y(a,b) X(q(s+a,b))
#define V ;r(j-66)X(q(h,strlen(h)));o(10);
#define H w("#include<stdio.h>\nint main(){fwrite(\"");
#define F(a) ;w("\",1,");printf("%d",g-a);w(",stdout);}")R}
#define M char
M*s,*h;int g=0,c,t=0,i=0,j=65,z,f,b[5][2]={210,224,225,235,208,388,403,568,142,208},n[7]={0,1,3,2,1,0,4};o(M a){putchar(a);}e(int a){if(a==10){w("\\n")R}if(a==13){w("\\r")R}a==34||a==92?o(92):1;o(a);}q(M*a,int b){for(;b--;a++)e(*a);}w(M*a){printf(a);}d(){w("#define ");o(j++);o(32);}r(int a){t=b[a][1]-b[a][0];h=memcpy(calloc(1,t+1),s+b[a][0],t);}l(M*a,int b,int*c){while(*a){f=1;for(i=0;i<b&&f;i++){r(c[i]);z=strlen(h);if(!memcmp(a,h,z)){X(e(32);o(65+c[i]);e(32));a+=z;f=0;}free(h);}if(f){e(*a);a++;}}}main(){while((c=getchar())!=EOF){s=realloc(s,++g);s[g-1]=c;if(!c)t=g;}if(g==1943&&!t){d()V d()V d();r(2)X(l(h,2,n))o(10);d()V d()V H l(s,5,n+2)F(0)if(g==2102){unsigned M*a=s+53;for(;i<=255;i++,a+=4)for(t=1;t<4;t++)if(*(a+t)!=i||*a){H q(s,g)F(0)H q(s,53)X(w(",1,53,stdout);for(int i=0,j;i<=255;i++){for(j=0;j<4;j++)putchar(j?i:0);};fwrite("))q(s+1077,g-1077)F(1077)H q(s,g)F(0)

Before minimizing it (I also changed some identifiers during minimization, but at least you can see the algorithm):

#include<stdlib.h>
#include<string.h>
#include<stdio.h>
#define X(a) ;o('"');a;o('"');
#define Y(a,b) X(q(s+(a),(b)))
#define V(a) X(q(rr(a),strlen(rr(a))));o(10);
#define H ;w("#include<stdio.h>\nint main(){fwrite(\"");
#define F(a) ;w("\",1,");printf("%d",sz-a);w(",stdout);}");}

//Input memory handling
char*s;
int sz=0,c,t=0;
m(char a){
    s = realloc(s,++sz);
    s[sz-1] = a;
    if(a==0||a=='%')t=sz;
}

//Print 1 byte
o(char a){putchar(a);}

//Print 1 byte with C escaping
e(int a){
    if(a==10){w("\\n");return;}
    if(a==13){w("\\r");return;}
    if(a=='"'||a=='\\')o('\\');
    o(a);
}

//Print b bytes
p(char*a,int b){for(;b--;a++)o(*a);}

//Print b bytes with escaping
q(char*a,int b){for(;b--;a++)e(*a);}

//Print a NUL-terminated string
w(char*a){printf("%s",a);}

//Helper for #defines
d(char a){w("#define ");o(a);o(' ');}

//Read a predefined portion of rickroll text
char*rr(int a){
    int b[5][2]={{0xd2,0xe0},{0xd0,0x184},{0xe1,0xeb},{0x193,0x238},{0x8e,0xd0}};
    int sb=b[a][1]-b[a][0];
    return memcpy(calloc(1,sb+1),s+b[a][0],sb);
}

//Print a with substrings from rickroll substrings in c
//replaced with constants from g
rs(char*a,int b,int*c,char*g){
    int i,si,f;char*d;
    while(*a){
        f=0;
        for(i=0;i<b&&!f;i++){
            d=rr(c[i]);si=strlen(d);
            if(!memcmp(a,d,si)) {
                X(e(' ')&&e(g[i])&&e(' '));
                a=a+si;
                f=1;
            }
            free(d);
        }
        if(!f){
            e(*a);a++;
        }
    }
}

/*Rick Astley compression mode
  Defines constants for common substrings and uses them if and where they
  appear in the input text.
*/
r(){
    char b[1];
    d('A')V(0)
    d('B')V(2)
    int n[2]={0,2};
    d('C')X(rs(rr(1),2,n,"AB"))o(10);
    d('D')V(3)
    d('E')V(4)
    H
    int m[5]={3,1,2,0,4};
    rs(s,5,m,"DCBAE")
    F(0)

/*Lena compression mode
  Looks for color definintion table in the input text, and prints that using a
  for loop in the generated code. Fallback to raw mode if no color table.
 */
l() {
    int i=0,j;unsigned char*a=s+0x35;
    for(;i<=0xff;i++){
        for(j=1;j<4;j++)
            if(*(a+(i*4)+j)!=i||*(a+(i*4))){z();return;}
    }
    H
    q(s,0x35);
    // j?i:0 to output NULL byte first
    w("\",1,0x35,stdout);for(int i=0,j;i<=0xff;i++){for(j=0;j<4;j++)putchar(j?i:0);};fwrite(\"");
    q(s+0x435,sz-0x435)
    F(0x435)

//Raw mode
z(){
    H
    q(s,sz)
    F(0)

//Read entire input and pick compression method based on size
main() {
    while((c=getchar())!=EOF)m(c);
    if(sz==1943&&!t)r();
    else if(sz==2102)l();
    else z();
    return 0;
}

Emil Vikström

Posted 2014-01-04T02:26:24.000

Reputation: 149

1

Python 3.2 (36,737,736)

Just to get this started, a totally dumb-as-a-rock-no-compression implementation score.

import sys,base64
print("""#include <cstdio>
int main(){char*s="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";char t[256]={0};for(int i=0;*s;s++,i++)t[*s]=i;for(s="%s";*s;s+=4){char c[]={char(t[*s]<<2|t[s[1]]>>4&3),char(t[s[1]]<<4&240|t[s[2]]>>2&15),char(t[s[2]]<<6&192|t[s[3]]&63)};fwrite(c,3-(s[3]=='=')-(s[2]=='='),1,stdout);}}"""%(base64.b64encode(sys.stdin.buffer.read())).decode())

Basically it just base64 encodes the file and decodes it in C++. Tested with clang++ on MacOS Mavericks, current Xcode. It compiles with style warnings, but I don't see any rule against that :)

Score is 408 + 2914^2 + 3126^2 + 866^2 + 4210^2 = 36,737,736

Joachim Isaksson

Posted 2014-01-04T02:26:24.000

Reputation: 1 161