Kolmogorov Complexity Meta Golfer



  • Mike Bufardeci (Pyth) - 175 bytes

  • Leo (Retina) - 175 bytes

  • devRicher (Lua) - 182 bytes

  • Peter Taylor (CJam) - Waiting for clarification

  • Lyth (C++11) - Waiting for clarification

Edit: Several changes made, should not affect any already submitted answers.

Edit #2: Very few people are taking advantage of the 1024 byte limit. The point of this challenge is to output code golf answers, not things like the example at the bottom of this post.

Edit #3: Here is the paragraph to test against: WASHINGTON - These are chaotic and anxious days inside the National Security Council, the traditional center of management for a president's dealings with an uncertain world. I will test the submissions myself later.

At PPCG, we pride ourselves in writing programs to solve problems. However, it's very time consuming to write the programs to solve problems. Instead we should write programs to write programs to solve problems!

In this case, the problem is Kolmogorov Complexity. You must write a program that golfs a program, that outputs piece of text.

Main program -> Generated program -> Text

You do not necessarily need to golf the main program. Your score is the length of the generated program.

You will not know the final text when writing the main program. Therefore, you will not be able to write the generated program by hand. The main program should take a string input and create the generated program, which correctly outputs the text.

On February 12th, the first paragraph of the headline of the New York Times will be used for scoring. I will post the article at 10 AM ESThttps://www.nytimes.com/

  • You may use any language for the main program and generated program. They do not have to be the same.

  • The final text will never contain these characters: ", ', \n, \r.

  • The main program is limited to 1024 bytes (not characters). The generated program has no byte limit, as long as it outputs the text.

  • Compression algorithms built in to your language are not allowed (both ways), unless you code it from scratch. This includes any type of text compression libraries, and base conversion. This does not include Regexes.

  • Standard loopholes apply.

Example: (JavaScript)
f=(g)=>{console.log("console.log('" + g + "')")}

Julian Lachniet

Posted 2017-01-29T13:06:03.767

Reputation: 3 216

9Compression algorithms built in to your language are not allowed That's fuzzy. What qualifies as compression algorithm? Things like run-length decoding are probably borderline – Luis Mendo – 2017-01-29T13:47:42.807

1The code is expected to run in a reasonable amount of time. That's unenforceable. What's reasonable? – Dennis – 2017-01-29T13:51:18.333

2In order for this challenge to be objective you must lock in your choice of text right now, before revealing it, by posting a hash of it. Otherwise you have 100% free reign in choosing the winner by choosing a text that best matches your chosen answer. – orlp – 2017-01-29T14:10:34.827

So the main challenge is to compress an unknown string, and the score is determined by the resulting program? – devRicher – 2017-01-29T14:12:15.407

@Dennis, e.g. The code does not take hours to run. Mainly so it can't be a brute force a hash solution. – Julian Lachniet – 2017-01-29T14:14:29.713

@orlp I have chosen the text, but I don't think there is a fair way to prove I chose it in advance. If you have a suggestion, I would appreciate it. – Julian Lachniet – 2017-01-29T14:14:36.177

@devRicher Yes. – Julian Lachniet – 2017-01-29T14:14:51.860

So I do not know my score until you publish the string? – devRicher – 2017-01-29T14:15:35.737

@devRicher This is clearly stated in the challenge: "In two weeks, I will post what text I have chosen. " – Julian Lachniet – 2017-01-29T14:16:40.980

5@JulianLachniet Yes there is, if you publish the hash of the text you chose right now we know when you post the text later that it's the correct text that you chose now. – orlp – 2017-01-29T14:16:59.280

3@JulianLachniet If you want an even stronger guarantee, you can instead post a function that generates a text given a number, and state that the number used to generate the scoring text will be the hash of the headline of a big publicly available newspaper at a certain date. This way no one knows the text ahead of time, including you. – orlp – 2017-01-29T14:23:30.903

3Also, why is the main program restricted to 512 bytes? It makes any attempt of making your own compression algorithm (since pre-made ones are invalid) a lot harder in common languages. – devRicher – 2017-01-29T14:45:20.767

@orlp I will try that. – Julian Lachniet – 2017-01-29T15:34:00.393

@devRicher Because you could just paste in the code of a built in. Trough I agree that the 512 bytes should be more. – dzaima – 2017-01-29T15:34:05.657

@devRicher 512 might be too small. But, it is necessary to have done cutoff. I will change it to 1024 – Julian Lachniet – 2017-01-29T15:34:42.717

It seems that none of the people who have voted to put this question on hold have explained their reason to do so. To me, the only unclear restriction seems to be "The code is expected to run in a reasonable amount of time" - and I think it's debatable if this is too unclear. As such, I'm voting to reopen. – Leo – 2017-01-30T10:56:23.813

@Leo Let me explain then. What is considered a built in compression algorithm? A posted answer uses a built in to do base conversion on the string, does that count? – Mike Bufardeci – 2017-01-30T18:45:57.827

I really like this challenge and think there shouldn't be a winning criterion and that the objective should be to create one in every language. – Magic Octopus Urn – 2017-01-30T18:55:25.943

@carusocomputing One of the rules of this site is that all challenges need an objective winning criterion.

– Mike Bufardeci – 2017-01-30T19:00:12.137

@MikeBufardeci yes, I am familiar with the rules, but there are multiple posts that say the goal is not to find the shortest answer in a language, rather the shortest answer in all languages. These posts are usually on the more interesting side. – Magic Octopus Urn – 2017-01-30T19:01:27.340

@carusocomputing That's the nature of all code golf questions. The golfscript answers compete with the other golfscript answers, the java answers compete with the other java answers, etc. That wouldn't really work with this scoring criteria however. – Mike Bufardeci – 2017-01-30T19:07:54.897

@MikeBufardeci I'm talking more along the lines of [tag:popularity-contest]. – Magic Octopus Urn – 2017-01-30T19:11:26.103

Let us continue this discussion in chat.

– Mike Bufardeci – 2017-01-30T19:12:26.560

As @JulianLachniet stated in chat, using built in base conversion is not allowed.

– Mike Bufardeci – 2017-01-30T20:35:48.377

"paragraph of the headline"? How long will it approximately be? – smls – 2017-01-30T20:36:41.450

1@smls here is an example: WASHINGTON — As President Trump signed a sweeping executive order on Friday, shutting the borders to refugees and others from seven largely Muslim countries, the secretary of homeland security was on a White House conference call getting his first full briefing on the global shift in policy. – Julian Lachniet – 2017-01-30T20:46:32.497

Okay, wait, so hold the #@%! up. Can I at least use a built-in to DECOMPRESS? Like, can I generate code that uses a built in but the code to generate the code doesn't use the built-in? – Magic Octopus Urn – 2017-01-30T21:23:28.033

1@JulianLachniet Make sure to include any rule changes and clarifications in your challenge description. – mbomb007 – 2017-01-30T21:32:25.413


  • What are "Reflexes"? 2. How are you able to guarantee that an unwritten paragraph in an English-language newspaper "will never contain these characters: ", ', ..."?
  • < – Peter Taylor – 2017-01-31T08:20:40.083

    @PeterTaylor typo – Julian Lachniet – 2017-01-31T11:41:20.813

    When you say "Compression algorithms built in to your language are not allowed (both ways)", do you mean built-in compression and decompression are not allowed? – Mike Bufardeci – 2017-01-31T15:34:51.073

    Can the input contain backslashes? Or non-ASCII characters? – smls – 2017-01-31T17:19:51.790

    Also, is it okay if the main program takes a few minutes to run? – smls – 2017-01-31T17:42:26.837

    @MikeBufardeci Yes. – Julian Lachniet – 2017-01-31T21:35:50.997

    @smls Of course. – Julian Lachniet – 2017-01-31T21:36:00.277

    @JulianLachniet I am confused: you have two seemingly contradictory statements: You do not necessarily need to golf the main program. Your score is the length of the generated program. and The main program is limited to 1024 bytes (not characters). The generated program has no byte limit, as long as it outputs the text. What is the reason for the limit on the main program if we are scored on the generated program? – Moogie – 2017-02-02T05:42:56.607

    I ask because non-trival answers that perform a compression process via main program and decompression via generated program are penalized by having to effectively store both compression and decompression processes within the 1024 byte limit... but are scored only on the size of the generated program that contains only the decompression process – Moogie – 2017-02-02T05:51:49.847

    @Moogle The point of the byte limit is to prevent copying and pasting large compression algorithms. Without it, your code could be infinitely long. As for your second question, that is really the point of the challenge. – Julian Lachniet – 2017-02-02T11:08:44.713

    2@JulianLachniet Ok, If that is the intention then that is fine. I do, however think unless the input text that will be output by the generated file is significantly larger than one paragraph then the trivial answers will always win it is very unlikely that the ~100 or so bytes saved via a decent compression alg will not offset the cost of implementing the decompresor alg. But i will like to be proven incorrect :-) – Moogie – 2017-02-02T22:33:16.897

    Couldn't you just have a few long strings (e.g. ~10 paragraphs from NY times on the day of posting this challenge) as the testcases instead of this "I'll post the message in the future" stuff? – clismique – 2017-02-05T04:08:02.500

    @Qwerp-Derp If I posted the test cases, you would be able to bias your code to work better on the test cases. By not giving the text in advance, you have to try to objectively write a text compressor. – Julian Lachniet – 2017-02-05T13:16:56.823

    @JulianLachniet But given a big enough "pool" of test-cases, it's nearly impossible to optimise for the test case at all, simply because there's too much to optimise for. It doesn't necessarily have to be 10 paragraphs - a bigger sample size (e.g. maybe 50-100 paragraphs) is almost guaranteed to eliminate any bias. – clismique – 2017-02-06T08:51:57.430

    @Qwerp-Derp That is true, but this is easier than finding 100 paragraphs about different subjects. – Julian Lachniet – 2017-02-06T11:05:33.587

    @JulianLachniet It's already possible to bias the code to work better for this challenge than in general. Many assumptions can be made about next Sunday's NY times article. It will likely start with "WASHINGTON – " and contain the phrase "President Trump" at least once. Each of those words will probably appear again in the article. – Mike Bufardeci – 2017-02-06T19:14:50.327

    I'm a bit sad that the testing paragraph is so short :( Anyway, is that "Nationalional" a typo? @JulianLachniet – Leo – 2017-02-13T11:11:06.060

    "Paragraph"? That's just a sentence – user41805 – 2017-02-13T11:39:24.897

    @Kritixi_Lithos It's both. I would choose a different paragraph, but someone would accuse me of being biased by not following my own rules. – Julian Lachniet – 2017-02-13T11:55:32.530

    @PeterTaylor Your code doesn't seem to work for me. What environment are you using? – Julian Lachniet – 2017-02-13T21:33:36.213

    @Lyth I am unable to run the code on my computer. What environment are you using? – Julian Lachniet – 2017-02-13T21:34:17.130

    I'm voting to close this question because the secret string has now been revealed, meaning that one can no longer post answers to the original challenge of compressing an unknown string. – pppery – 2019-10-22T19:12:31.033



    Retina, worst case (3+original length) bytes

    EDIT: added support for newlines, in order to try this code in actual kolmogorov complexity challenges. Behaviour on strings without newlines remains unchanged.

    Retina is great at substituting substrings, so that's what we're going to do!

    This program looks for repeated sequences of characters and replaces them every time with a single character not appearing in the full string. The generated program then starts with the "compressed" string and applies all substitutions in reverse.

    This code adds 3 bytes of boilerplate (a newline at the beginning, and a \` towards the end to suppress a trailing newline), so in the worst case (when there are no repeated sequences worth substituting) we have a length of 3 bytes more than the starting string. On average, this will shave a few bytes from the original length, not much but still a compression.

    Here's the code with some comments:

    First, we use a trick with transliteration ranges to change newlines into , which is how newlines are printed in a Retina program


    then we start the main loop, duplicating the first line


    we only substitute when the sequence is repeated enough times (depending on its length)


    to find a character not in the string, we append a list of candidates to the string, using " as separator, and then remove duplicates

    $2" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!#%&\'()*+,-./:;?@\^_`{|}~¶$2¶$1

    here the actual substitution in the text happens


    in case we couldn't find a repeated sequence or an unused character, we restore the string to the original state (which makes us break from the loop)


    at the end, we add the 3 bytes of boilerplate


    Try it online!

    In the linked example, the 292 bytes paragraph gets compressed to a 286 bytes Retina code.

    NOTE: the substitutions made are not optimal, we just make every time the first suitable substitution we find; I don't know if an algorithm finding the optimal substitutions could run in a reasonable amount of time.


    Posted 2017-01-29T13:06:03.767

    Reputation: 8 482


    Pyth, worst case length(input) + 1 bytes

    =kNJ++kzkV92Iq/zC+32N0 aYC+32N))V.:z7I>/JN2=J+++\::JNK.)Y+\\K++kNk))V.:z6I>/JN2=J+++\::JN=K.)Y+\\K++kNk))V.:z5I>/JN2=J+++\::JN=K.)Y+\\K++kNk))V.:z4I>/JN3=J+++\::JN=K.)Y+\\K++kNk))V.:z3I>/JN4=J+++\::JN=K.)Y+\\K++kNk))V.:z2I>/JN7=J+++\::JN=K.)Y+\\K++kNk))?qeJkPJJ

    Test Suite Available Online
    Generated Programs from Test Suite


    This program no longer abuses the fact that the article will be the first paragraph of a NY Times article and is now a general compressor of short strings. There are still some assumptions being made here, specifically that not all characters will appear in the text and that the text is at least seven characters long.

    The high level explanation is that the program searches thorough the text for progressively shorter repeated strings, then replaces each repeated with a character not present in the text. Each string length has its own break even point where making a substitution is actually shorter than leaving the text alone. For strings of length 7 only 2 occurrences are needed but for strings of length 3 the break even point is 4 occurrences.

    Main Program

    The main program mostly consists of six loops: one each for strings of length seven down through strings of length 2. Each loop goes over all substrings of the input with that length, counts how often they appear in the article and determines if a replacement should be made. If a replacement is made then the string is wrapped with a replacement method to obtain the original string.

    This breakdown is itself broken down into several parts for readability. This first part is setting up for the main loops.

    =kNJ++kzkV95Iq/zC+32N0 aYC+32N))
      N                               N is initialized to " (the double quote character)
    =kN                               assign " to k
           z                          z is initialized to the input
        ++kzk                         wrap z with double quotes
       J                              and assign that to J (J automatically assigns on first use)
              92     +32      +32     32-126 is the "printable" character range
              92     +32      +32     | causes issues so we stop before it (code point 124)
             V92                   )  for each printable character:
                  /zC+32N              count occurrences of it in the input
                Iq       0        )    if there are 0 occurrences:  
                           aYC+32N      append that character to list Y (Y is initialized to an empty list)

    The six main loops are nearly identical. I will breakdown a general loop for the sake of simplicity.

       z                                   z is the input
     .:z5                                  get each substring of a certain length (here 5)
    V                                   )  for each string:
           /JN                              count how many times it appears in the input
         I>   2                        )    if count > break even point: (here 2)
                           .)Y               pop a character off Y (unused characters list)
                         =K                  assign it to K (K automatically assigns the first time but must be manually assigned after that)
                      :JN K                  regex replacement:
                       J                      in the input
                        N                     replace current string
                          K                   with popped character
                   +\:                       prepend : (the colon character)
                  +           +\\K           append \ and popped character
                 +                ++kNk      append current string wrapped in double quotes
               =J                            and assign to variable J

    This last section only saves one byte on the generated program but also ensures that the worst case is as long as our trivial answer.

        k     k is still set to " (the double quote character)
       J  JJ  J is the string to be returned
    ?qeJk     if the last character of J is ":
         PJ    return J without the last character
           J   return J

    Generated Program

    If no substitutions are made the generated program will be in this format:


    This will have a length of length(input) + 1 bytes. This is the worst case, and I fear it may also be the average case. One paragraph does not have a lot of repeated strings, especially because professional writers try to vary their words.

    If one substitution is made then the generated program will be in this format:

    :"(input with replacements)"\(unused character)"(replacement string)

    In Pyth, this is the format for replacement with regular expressions. Instances of the unused character in the input are replaced with the replacement string.

    This adds a static 6 characters to the string, but in the worst case we are hitting the break even point mentioned earlier.

    Here is the format for a generated program with two replacements:

    ::"(input with replacements)"\(char 1)"(str 1)"\(char 2)"(str 2)

    As replacements stack up we simply add another colon to the front and add the substring and replacement character to the end.

    I have not determined what the size reduction is in the best case. If anyone can determine the size of the generated program in the best case please let me know.

    Mike Bufardeci

    Posted 2017-01-29T13:06:03.767

    Reputation: 1 680


    CJam, Burrows-Wheeler transform and run-length encoding

    I think it's stupid to ban builtins for base-conversion as "compression", but I've implemented my own, as well as implementing run-length encoding and decoding manually rather than using the builtin.

    e# Reduce range, add start-of-message marker
    e# Burrows-Wheeler transform
    e# RLE without e`, decrementing each runlength for compactness
    e# Pair [runlength-1 ch] to a single int
    e# Base-convert to the minimal possible base
    e# Base-convert to base 256
    e# Format as a string literal with as few special cases as possible
    e# Decoder base-converts to base 256, then base B, unpairs, run-length decodes, and
    e# inverse Burrows-Wheeler transforms
    • I is the increment which must be added to each decoded value
    • P is the base used to combine run-length with char
    • B is the base used to convert the whole thing

    Peter Taylor

    Posted 2017-01-29T13:06:03.767

    Reputation: 41 901


    CJam, range reduction and base conversion

    I think it's stupid to ban builtins for base-conversion as "compression", but I've implemented my own.

    e# Reduce range by offsetting towards 0 as far as possible
    e# Base-convert to the minimal possible base
    e# Base-convert to base 256
    e# Format as a string literal with as few special cases as possible
    e# Decoder base-converts to base 256 and then base B, adding the offset

    The text is too small to benefit greatly from techniques which require a lot of decoding, but if we can subtract 32 from each byte and base-convert then we get an asymptotic improvement of at least 2.4%, and with luck the input source will be pure ASCII and we get an asymptotic improvement of about 18.7%. With an overhead of about 26 bytes we stand a decent chance of getting compression.

    If the output is allowed to have a trailing NUL byte, one byte can be saved by removing that final ;.

    Peter Taylor

    Posted 2017-01-29T13:06:03.767

    Reputation: 41 901

    This is not a valid answer: Compression algorithms built in to your language are not allowed (both ways), unless you code it from scratch. This includes any type of text compression libraries, and base conversion. This does not include Regexes. – Julian Lachniet – 2017-02-02T21:29:24.680

    @JulianLachniet, it's a perfectly valid answer. "unless you code it from scratch". – Peter Taylor – 2017-02-02T22:20:54.900

    If you are coding it from scratch, that's fine. Your post does not make that clear. – Julian Lachniet – 2017-02-02T22:57:51.837

    @JulianLachniet, I thought my other answer made it pretty clear, but as you insist I've copied across the relevant introduction. – Peter Taylor – 2017-02-03T08:19:57.430


    Pyth, length(input) + 1 bytes


    Online interpreter available here.

    This will return a generated program of the form "(input). Put another way, the main program prepends a double quote to the input text.

    Main Program Explanation

    +      Concatenate
     N     " (the variable N defaults to a double quote mark)
      z    and the input

    Generated Program Explanation

    "                start of a string
     (input)         the input text 
            (eof)    strings are closed automatically at the end of file

    This is the shortest of the "input length plus a constant" answers. I'm posting this mostly to set a bar for the non-trivial answers. It's unclear if a non-trivial answer can exist within the rules as they are currently worded, but I am setting this bar nonetheless.

    Mike Bufardeci

    Posted 2017-01-29T13:06:03.767

    Reputation: 1 680


    Submissions should attempt to be competitive. There is already a trivial answer, and to me, there shouldn't be any. "Try to optimize your score" http://meta.codegolf.stackexchange.com/a/7076/34718

    – mbomb007 – 2017-01-30T19:26:20.567

    Let us continue this discussion in chat.

    – mbomb007 – 2017-01-30T22:04:41.050


    C++11, arithmetic coding, overhead >= 382 bytes

    Compression algorithm is shamelessly copied from Simple byte-aligned binary arithmetic coder by Fabian 'ryg' Giesen

    #include <iostream>
    using std::cout;
    void pu(char c){c=='\r'?cout.write(")OMG\" \"\\r\" R\"OMG(",17):cout.put(c);}
    int main() {
        uint32_t lo{0},hi{~0u},x,i,prob{256};
        char ch;
        cout<<R"!*(#include <iostream>
    void operator ""_p(const char* q,size_t z){uint32_t d{0},l{0},h{~0u},p{256},x,t;size_t r{0},i;for(i=4;i-->0;)d=(d<<8)|q[r++];while(r<=z){char y{0};for(i=8;i-->0;){x=l+((uint64_t(h-l)*p)>>9);t=(d<=x);t?(h=x):(l=x+1);while((l^h)<(1u<<24)){d=(d<<8)|q[r++];l<<=8;h=(h<<8)|0xff;}p+=t?((512-p)>>5):-(p>>5);y+=y+t;}std::cout.put(y);}}int main(){R"OMG()!*";
        while (std::cin.get(ch)) {
            for (size_t b=8; b-->0;) {
                int bit{ch&(1<<b)};
                x = lo+((uint64_t(hi-lo)*prob)>>9);
                bit ? (hi=x) : (lo=x+1);
                while ((lo^hi)<(1u<<24)) {pu(lo>>24);lo<<=8;hi=(hi<<8)|0xff;}
                prob += bit?((512-prob)>>5):-(prob>>5);
        for (i=4;i-->0;) {pu(lo>>24);lo<<=8;}
        cout<<")OMG\"_p;return 0;}";
        return 0;


    1. Main program reads the input and uses BinShiftModel to generate an encoded bitstream.
    2. The bitstream is then included into a C++11 user-defined raw literal in the output program.
    3. The literal is being processed by the user-defined function, operator ""_p(const char* string, size_t size), printing out the decoded stream.

    Some extra care was necessary due to all newline values being converted to \n.


    The code does not look for different compression paths (there were, but some optimal value was chosen instead).

    It is possible to improve the compression ratio, either by limiting input to 7 bits (did not do that because "em-dash" in the sample became corrupted) or by storing MSB from all input bytes first, which will likely be roughly same for 1/4 of the text.

    Either way, algorithm overhead is huge on small texts.


    g++ -std=c++11 -o golf golf.cpp && ./golf < sample.txt > degolf.cpp && g++ -std=c++11 -o degolf degolf.cpp && ./degolf


    Posted 2017-01-29T13:06:03.767

    Reputation: 781

    As implemented this is actually an adaptives binary arithmetic coder because it doesn't track the partially coded byte (you would need 255 different probs to achieve byte order). This usally doesn't work very well at all for text (probably 1-2%). A byte oriented coder would give you around 1/3 reduction rate for english text. – Christoph – 2017-02-13T07:41:47.190

    @Christoph you're right. I do not expect byte-oriented coder to fit into 1024 bytes with decoder, though. – Lyth – 2017-02-13T10:54:46.250

    There are only a few minor changes which probably pay of quite fast: prob would have to be an array of 256 elements and used as prob[ctx] (problem maybe to init it all to 256 but changing math to use prob[ctx] as prob[ctx]+256 and init to 0 should do the trick). ctx would be set to 1 at the start of each byte and gets shifted in the coded bit ctx = (ctx << 1) |bit; after each bit. – Christoph – 2017-02-13T11:35:51.080




    Try it here.

    Will always be string length + 7 bytes long.


    using System;
    namespace H {
        class K {
            static string P;
            static void Main(string text)
                P = text;

    Works if the program itself is added as a dependency where you run the resulting output program. If you would to run the resulting code in a OutputText.exe, you will have to add OutputCodeThatOutputsText.exe as a dependency.

    Resulting code is always ()=>{Console.WriteLine(H.K.P)}.


    Posted 2017-01-29T13:06:03.767

    Reputation: 1 609