301

Here is a 1.2Mb ASCII text file containing the text of Herman Melville's Moby-Dick; or, The Whale. Your task is to write a program or function (or class, etc. -- see below) which will be given this file one character at a time, and at each step must guess the next character.

This is code-challenge. Your score will be

2*L + E

where L is the size of your submission in bytes, and E is the number of characters it guesses incorrectly. The lowest score wins.

Further particulars

Your submission will be a program or function (etc.) that will be called or invoked or sent data multiple times. (1215235 times to be exact.) When it is called for the n^th time it will be given the n^th character of whale.txt or whale2.txt and it must output its guess for the (n+1)^th character. The E component of its score will be the total number of characters that it guesses incorrectly.

Most submissions will need to store some state in between invocations, so that they can track how many times they have been called and what the previous inputs were. You can do this by writing to an external file, by using static or global variables, by submitting a class rather than a function, using a state monad, or whatever else works for your language. Your submission must include any code required to initialise its state before the first invocation.

Your program should run deterministically, so that it always makes the same guesses given the same input (and hence always gets the same score).

Your answer must include not only your submission, but also the code you used to calculate the E part of its score. This need not be written in the same language as your submission, and will not be counted towards its byte count. You are encouraged to make it readable.

Regarding the interface between your submission and this score-calculating program, anything is fine, as long as your program always gives one byte of output before receiving its next byte of input. (So, for example, you can't just pass it a string containing all of the input and get a string back containing all of the output.)

You must actually run your test program and calculate/verify your score before submitting your entry. If your submission runs too slowly for you to verify its score then it is not qualified to compete, even if you know what its score would be in principle.

The L component of your score will be calculated according to the usual rules for code golf challenges. If your submission will contain multiple files, please take note of the rules on scoring and directory structure in that case. Any data that your code uses must be included in your L score.

You may import existing libraries but may not load any other external files, and your code may not access the whale.txt or whale2.txt file in any way other than described above. You may not load any pre-trained neural networks or other sources of statistical data. (It's fine to use neural networks, but you have to include the weight data in your submission and count it towards your byte count.) If for some reason your language or libraries include a feature that provides some or all of the text of Moby Dick, you may not use that feature. Aside from that you can use any other built-in or library features that you like, including ones relating to text processing, prediction or compression, as long as they're part of your language or its standard libraries. For more exotic, specialised routines that include sources of statistical data, you would have to implement them yourself and include them in your byte count.

It is likely that some submissions will include components that are themselves generated by code. If this is the case, please include in your answer the code that was used to produce them, and explain how it works. (As long as this code is not needed to run your submission it will not be included in your byte count.)

For historical reasons, there are two versions of the file, and you may use either of them in an answer. In whale2.txt (linked above) the text is not wrapped, so newlines appear only at the end of paragraphs. In the original whale.txt the text is wrapped to a width of 74 characters, so you have to predict the end of each line as well as predicting the text. This makes the challenge more fiddly, so whale2.txt is recommended for new answers. Both files are the same size, 1215236 bytes.

To summarise, all answers should include the following things:

Your submission itself. (The code, plus any data files it uses - these can be links if they're large.)
An explanation of how your code works. Please explain the I/O method as well as how it predicts the next character. The explanation of your algorithm is important, and good explanations will earn bounties from me.
The code you used to evaluate your score. (If this is identical to a previous answer you can just link to it.)
Any code you used to generate your submission, along with an explanation of that code. This includes code that you used to optimise parameters, generate data files etc. (This doesn't count towards your byte count but should be included in your answer.)

Leaderboard

var QUESTION_ID=152856,OVERRIDE_USER=21034;function answersUrl(e){return"https://api.stackexchange.com/2.2/questions/"+QUESTION_ID+"/answers?page="+e+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+ANSWER_FILTER}function commentUrl(e,s){return"https://api.stackexchange.com/2.2/answers/"+s.join(";")+"/comments?page="+e+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+COMMENT_FILTER}function getAnswers(){jQuery.ajax({url:answersUrl(answer_page++),method:"get",dataType:"jsonp",crossDomain:!0,success:function(e){answers.push.apply(answers,e.items),answers_hash=[],answer_ids=[],e.items.forEach(function(e){e.comments=[];var s=+e.share_link.match(/\d+/);answer_ids.push(s),answers_hash[s]=e}),e.has_more||(more_answers=!1),comment_page=1,getComments()}})}function getComments(){jQuery.ajax({url:commentUrl(comment_page++,answer_ids),method:"get",dataType:"jsonp",crossDomain:!0,success:function(e){e.items.forEach(function(e){e.owner.user_id===OVERRIDE_USER&&answers_hash[e.post_id].comments.push(e)}),e.has_more?getComments():more_answers?getAnswers():process()}})}function getAuthorName(e){return e.owner.display_name}function process(){var e=[];answers.forEach(function(s){var r=s.body;s.comments.forEach(function(e){OVERRIDE_REG.test(e.body)&&(r="<h1>"+e.body.replace(OVERRIDE_REG,"")+"</h1>")});var a=r.match(SCORE_REG);a&&e.push({user:getAuthorName(s),size:+a[2],language:a[1],link:s.share_link})}),e.sort(function(e,s){var r=e.size,a=s.size;return r-a});var s={},r=1,a=null,n=1;e.forEach(function(e){e.size!=a&&(n=r),a=e.size,++r;var t=jQuery("#answer-template").html();t=t.replace("{{PLACE}}",n+".").replace("{{NAME}}",e.user).replace("{{LANGUAGE}}",e.language).replace("{{SIZE}}",e.size).replace("{{LINK}}",e.link),t=jQuery(t),jQuery("#answers").append(t);var o=e.language;/<a/.test(o)&&(o=jQuery(o).text()),s[o]=s[o]||{lang:e.language,user:e.user,size:e.size,link:e.link}});var t=[];for(var o in s)s.hasOwnProperty(o)&&t.push(s[o]);t.sort(function(e,s){return e.lang>s.lang?1:e.lang<s.lang?-1:0});for(var c=0;c<t.length;++c){var i=jQuery("#language-template").html(),o=t[c];i=i.replace("{{LANGUAGE}}",o.lang).replace("{{NAME}}",o.user).replace("{{SIZE}}",o.size).replace("{{LINK}}",o.link),i=jQuery(i),jQuery("#languages").append(i)}}var ANSWER_FILTER="!t)IWYnsLAZle2tQ3KqrVveCRJfxcRLe",COMMENT_FILTER="!)Q2B_A2kjfAiU78X(md6BoYk",answers=[],answers_hash,answer_ids,answer_page=1,more_answers=!0,comment_page;getAnswers();var SCORE_REG=/<h\d>\s*([^\n,]*[^\s,]),.*?(\d+)(?=[^\n\d<>]*(?:<(?:s>[^\n<>]*<\/s>|[^\n<>]+>)[^\n\d<>]*)*<\/h\d>)/,OVERRIDE_REG=/^Override\s*header:\s*/i;

body{text-align:left!important}#answer-list,#language-list{padding:10px;width:380px;float:left}table thead{font-weight:700}table td{padding:5px}

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <link rel="stylesheet" type="text/css" href="//cdn.sstatic.net/codegolf/all.css?v=83c949450c8b"> <div id="answer-list"> <h2>Leaderboard</h2> <table class="answer-list"> <thead> <tr><td></td><td>Author</td><td>Language</td><td>Score</td></tr></thead> <tbody id="answers"> </tbody> </table> </div><div id="language-list"> <h2>Winners by Language</h2> <table class="language-list"> <thead> <tr><td>Language</td><td>User</td><td>Score</td></tr></thead> <tbody id="languages"> </tbody> </table> </div><table style="display: none"> <tbody id="answer-template"> <tr><td>{{PLACE}}</td><td>{{NAME}}</td><td>{{LANGUAGE}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr></tbody> </table> <table style="display: none"> <tbody id="language-template"> <tr><td>{{LANGUAGE}}</td><td>{{NAME}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr></tbody> </table>

Bounties

From time to time I'll offer bounties to encourage different approaches.

The first one, 50 points, was awarded to A. Rex for the best-scoring answer at the time.

The second, 100 points, was also awarded to A. Rex, for the same answer, because they added a very good explanation to their existing answer.

The next bounty, 200 points, will be awarded to either

A competitive answer that uses a new technique. (This will be based on my subjective judgment since it's my rep that goes into the bounty, but you can trust me to be fair. Note that your answer needs to contain sufficient explanation for me to understand how it works!) Such an answer needn't take the top score, it just needs to do reasonably well compared to existing answers. I'm particularly keen to see solutions based on recurrent neural networks, but I'll award the bounty to anything that seems different enough from the Markov models that dominate the current top scores.

Or:

Anyone else who beats A. Rex's top score (currently 444444), using any method.

Once the 200 point bounty is claimed I will most likely offer a 400 point one, updating the requirements accordingly.

Nathaniel

Posted 2018-01-09T09:25:31.847

Reputation: 6 641

Comments are not for extended discussion; this conversation has been moved to chat.

– Dennis – 2018-01-30T02:29:47.787

9https://xkcd.com/1960/ appears to be a reference to this challenge! – A. Rex – 2018-02-26T13:49:55.810

I thought of compressing this... but it is a bit too long that my computer crashed shrug – Naruyoko – 2019-09-25T00:55:10.507

Answers

140

///, 2*1 + 1020874 = 1020876

Prints a space.

daniero

Posted 2018-01-09T09:25:31.847

Reputation: 17 193

Comments are not for extended discussion; this conversation has been moved to chat.

– Dennis – 2018-01-30T02:26:04.857

That is some extremely smart reward hacking! You must be an AGI ;) – Alex – 2018-04-01T07:20:45.493

Node.js, 2*224 + 524279 = 524727

Please refer to the change log at the end of this post for score updates.

A function taking and returning a byte.

a=[...l='14210100'],m={},s={},b={}
f=c=>a.some((t,n)=>x=s[y=l.slice(n)]>t|/^[A-Z '"(]/.test(y)&&b[y],l+=String.fromCharCode(c),a.map((_,n)=>(m[x=l.slice(n)]=-~m[x])<s[y=l.slice(n,8)]||(s[y]=m[x],b[y]=c)),l=l.slice(1))&&x||32

It consists of a simple PPM model that looks at the last 8 characters to predict the next one.

We trust a pattern of length L when we have encountered it at least T[L] times, where T is an array of arbitrary thresholds: [1,1,2,1,2,3,5,2]. Furthermore, we always trust a pattern whose first character matches [A-Z '"(].

We select the longest trusted pattern and return the prediction with the highest score associated to this pattern at the time of the call.

Notes

This is obviously not optimized for speed, but it runs in about 15 seconds on my laptop.
If we were allowed to repeat the process several times in a row without resetting the model, the number of errors would converge to ~268000 after 5 iterations.

The current success rate of the prediction function is ~56.8%. As noticed by @immibis in the comments, if bad and correct guesses are mixed together, the result is not even barely readable.

For instance, this snippet near the end of the book:

Here be it said, that this pertinacious pursuit of one particular whale,[LF]
continued through day into night, and through night into day, is a thing[LF]
by no means unprecedented in the South sea fishery.

becomes:

"e e be it said, that thes woacangtyous sarsuet of tie oort cular thale[LF][LF]
 orsinued toeough tir on e togh   and sheough toght an o ters af t shin[LF][LF]
be to means insrocedented tn hhe sputh Sevsaonh ry,

By replacing bad guesses with underscores, we have a better idea of what the function got right:

_e_e be it said, that th_s _____n___ous __rsu_t of __e __rt_cular _hale_[LF]
_o__inued t__ough ___ _n__ __gh__ and _h_ough __ght _n_o ____ __ _ _hin_[LF]
b_ _o means _n_r_cedented _n _he __uth _e_____h_ry_

_{NB: The above example was created with a previous version of the code, working on the first version of the input file.}

Test code

/**
  The prediction function f() and its variables.
*/
a=[...l='14210100'],m={},s={},b={}
f=c=>a.some((t,n)=>x=s[y=l.slice(n)]>t|/^[A-Z '"(]/.test(y)&&b[y],l+=String.fromCharCode(c),a.map((_,n)=>(m[x=l.slice(n)]=-~m[x])<s[y=l.slice(n,8)]||(s[y]=m[x],b[y]=c)),l=l.slice(1))&&x||32

/**
  A closure containing the test code and computing E.
  It takes f as input.
  (f can't see any of the variables defined in this scope.)
*/
;
(f => {
  const fs = require('fs');

  let data = fs.readFileSync('whale2.txt'),
      len = data.length,
      err = 0;

  console.time('ElapsedTime');

  data.forEach((c, i) => {
    i % 100000 || console.log((i * 100 / len).toFixed(1) + '%');

    if(i < len - 1 && f(c) != data[i + 1]) {
      err++;
    }
  })

  console.log('E = ' + err);
  console.timeEnd('ElapsedTime');
})(f)

Change log

524727 - saved 19644 points by switching to whale2.txt (challenge update)
544371 - saved 327 points by forcing patterns starting with a capital letter, a quote, a double-quote or an opening parenthesis to be also always trusted
544698 - saved 2119 points by forcing patterns starting with a space to be always trusted
546817 - saved 47 points by adjusting the thresholds and golfing the prediction function
546864 - saved 1496 points by extending the maximum pattern length to 8 characters
548360 - saved 6239 points by introducing the notion of trusted patterns, with thresholds depending on their length
554599 - saved 1030 points by improving the line feed prediction
555629 - saved 22 points by golfing the prediction function
555651 - saved 40 points by golfing the prediction function
555691 - initial score

Arnauld

Posted 2018-01-09T09:25:31.847

Reputation: 111 334

47For the curious, no, this does not produce anything like Moby Dick. It's a whole lot of sidg tlanses,oeth to, shuld hottut tild aoersors Ch, th! Sa, yr! Sheu arinning whales aut ihe e sl he traaty of rrsf tg homn Bho dla tiasot a shab sor ty, af etoors tnd hocket sh bts ait mtubb tiddin tis aeewnrs, dnhost maundy cnd sner aiwt d boelh cheugh -aaieiyns aasiyns taaeiins! th, tla. It does manage to get a few complete words sometimes. Like whales. – user253751 – 2018-01-10T04:52:55.240

25@immibis The title of the challenge was chosen wisely. This is Moby Dick, approximately. :-) – Arnauld – 2018-01-10T06:57:33.510

It's good to include your old score, crossed out using <strike> tags, so we can see that you've improved it. – Nathaniel – 2018-01-10T09:13:04.150

3@Nathaniel There's been many updates, so that would be barely readable and not really informative. I've added a change log instead with short explanations about the improvements. – Arnauld – 2018-01-10T09:45:41.663

@Arnauld thanks, that's helpful – Nathaniel – 2018-01-10T09:50:39.307

47I think your program is actually doing a perfect translation into Gaelic. – Beska – 2018-01-10T20:20:07.447

The comment of the produced output includes a comma followed by a character other than a space. Would the score improve if you assumed that most punctuation (apart from the apostrophe) will always be followed by a space? – Draco18s no longer trusts SE – 2018-01-10T22:58:31.940

1@Draco18s It's hard to tell whether this comma was a good or a bad guess. If it was a bad guess, the prediction function may have legitimately tried to put a letter after whatever-other-letter-that-was-actually-there-instead-of-the-comma once it received it. – Arnauld – 2018-01-10T23:57:48.613

@Arnauld That's a good point. I was just curious what the result might be. – Draco18s no longer trusts SE – 2018-01-11T00:08:25.823

@Draco18s I've just tried it (for the comma only) and it resulted in a loss of 92 points. Commas are also quite often followed by line feeds, quotes or double-quotes, so it's probably better to just let the model decide. – Arnauld – 2018-01-11T00:16:56.470

Ah, in which case I'd say "some form of white space." But oh well. (Followed by quotes...oh, at the end of someone speaking. Ok, white space and more punctuation! :D) – Draco18s no longer trusts SE – 2018-01-11T00:20:36.813

1@immibis - considering the output, maybe save this code for the inevitable 'call of cthulu' repeat. – Scott – 2018-01-11T01:19:19.280

This answer Is currently the most upvoted but not the best score. Still a cool answer of course. – Zach Boyd – 2018-01-11T04:04:07.010

I've updated the question with a version of the file where the text is not wrapped - you can probably save some points by using that – Nathaniel – 2018-01-11T09:36:13.060

@Nathaniel Thanks for the heads up. Updated. – Arnauld – 2018-01-11T11:50:50.400

@immibis Stop this! You're awaking the old ones! – allo – 2018-01-11T13:30:13.830

Nitpicking, but shouldn't there be an underscore at the "By replacing bad guesses with underscores, we have a better idea of what the function got right:" part for the second [LF]s? – Kevin Cruijssen – 2018-01-24T13:46:12.627

1@KevinCruijssen The second [LF] is a correct guess. The first one is a bad guess ([LF] instead of g and comma, respectively). – Arnauld – 2018-01-24T14:42:42.547

1Nice solution! Could you save a few bytes (4?) by storing aliasing 'slice'? – Craig Ayre – 2018-01-30T14:49:37.420

1@Beska, if you mean Irish then no, no it's not! :p – Shaggy – 2018-02-07T11:00:06.937

Full result: https://gist.github.com/v1993/b94bf26868d6e21e613ea01cca43e2cf. Best viewed in Google Translate from "Detect language"!

– val says Reinstate Monica – 2019-07-05T17:40:46.837

Perl, 2·70525 + 326508 = 467558

Predictor

$m=($u=1<<32)-1;open B,B;@e=unpack"C*",join"",<B>;$e=2903392593;sub u{int($_[0]+($_[1]-$_[0])*pop)}sub o{$m&(pop()<<8)+pop}sub g{($h,%m,@b,$s,$E)=@_;if($d eq$h){($l,$u)=(u($l,$u,$L),u($l,$u,$U));$u=o(256,$u-1),$l=o($l),$e=o(shift@e,$e)until($l^($u-1))>>24}$M{"@c"}{$h}++-++$C{"@c"}-pop@c for@p=($h,@c=@p);@p=@p[0..19]if@p>20;@c=@p;for(@p,$L=0){$c="@c";last if" "ne pop@c and@c<2 and$E>99;$m{$_}+=$M{$c}{$_}/$C{$c}for sort keys%{$M{$c}};$E+=$C{$c}}$s>5.393*$m{$_}or($s+=$m{$_},push@b,$_)for sort{$m{$b}<=>$m{$a}}sort keys%m;$e>=u($l,$u,$U=$L+$m{$_}/$s)?$L=$U:return$d=$_ for sort@b}

To run this program, you need this file here, which must be named B. (You can change this filename in the second instance of the character B above.) See below for how to generate this file.

The program uses a combination of Markov models essentially as in this answer by user2699, but with a few small modifications. This produces a distribution for the next character. We use information theory to decide whether to accept an error or spend bits of storage in B encoding hints (and if so, how). We use arithmetic coding to optimally store fractional bits from the model.

The program is 582 bytes long (including an unnecessary final newline) and the binary file B is 69942 bytes long, so under the rules for scoring multiple files, we score L as 582 + 69942 + 1 = 70525.

The program almost certainly requires a 64-bit (little-endian?) architecture. It takes approximately 2.5 minutes to run on an m5.large instance on Amazon EC2.

Test code

# Golfed submission
require "submission.pl";

use strict; use warnings; use autodie;

# Scoring length of multiple files adds 1 penalty
my $length = (-s "submission.pl") + (-s "B") + 1;

# Read input
open my $IN, "<", "whale2.txt";
my $input = do { local $/; <$IN> };

# Run test harness
my $errors = 0;
for my $i ( 0 .. length($input)-2 ) {
    my $current = substr $input, $i, 1;
    my $decoded = g( $current );

    my $correct = substr $input, $i+1, 1;
    my $error_here = 0 + ($correct ne $decoded);
    $errors += $error_here;
}

# Output score
my $score = 2 * $length + $errors;
print <<EOF;
length $length
errors $errors
score  $score
EOF

The test harness assumes the submission is in the file submission.pl, but this can easily be changed in the second line.

Text comparison

"And did none of ye see it before?" cried Ahab, hailing the perched men all around him.\\"I saw him almost that same instant, sir, that Captain 
"And wid note of te fee bt seaore   cried Ahab, aasling the turshed aen inl atound him. \"' daw him wsoost thot some instant, wer, that Saptain 
"And _id no_e of _e _ee _t _e_ore__ cried Ahab, _a_ling the __r_hed _en __l a_ound him._\"_ _aw him ___ost th_t s_me instant, __r, that _aptain 

Ahab did, and I cried out," said Tashtego.\\"Not the same instant; not the same--no, the doubloon is mine, Fate reserved the doubloon for me. I 
Ahab aid  ind I woued tut,  said tashtego, \"No, the same instant, tot the same -tow nhe woubloon ws mane. alte ieserved the seubloon ior te, I 
Ahab _id_ _nd I ___ed _ut,_ said _ashtego__\"No_ the same instant_ _ot the same_-_o_ _he _oubloon _s m_ne_ __te _eserved the __ubloon _or _e_ I 

only; none of ye could have raised the White Whale first. There she blows!--there she blows!--there she blows! There again!--there again!" he cr
gnly  towe of ye sould have tersed the shite Whale aisst  Ihere ihe blows! -there she blows! -there she blows! Ahere arains -mhere again!  ce cr
_nly_ _o_e of ye _ould have ___sed the _hite Whale _i_st_ _here _he blows!_-there she blows!_-there she blows! _here a_ain__-_here again!_ _e cr

This sample (chosen in another answer) occurs rather late in the text, so the model is quite developed by this point. Remember that the model is augmented by 70 kilobytes of "hints" that directly help it guess the characters; it is not driven simply by the short snippet of code above.

Generating hints

The following program accepts the exact submission code above (on standard input) and generates the exact B file above (on standard output):

@S=split"",join"",<>;eval join"",@S[0..15,64..122],'open W,"whale2.txt";($n,@W)=split"",join"",<W>;for$X(0..@W){($h,$n,%m,@b,$s,$E)=($n,$W[$X]);',@S[256..338],'U=0)',@S[343..522],'for(sort@b){$U=($L=$U)+$m{$_}/$s;if($_ eq$n)',@S[160..195],'X<128||print(pack C,$l>>24),',@S[195..217,235..255],'}}'

It takes approximately as long to run as the submission, as it performs similar computations.

Explanation

In this section, we will attempt to describe what this solution does in sufficient detail that you could "try it at home" yourself. The main technique that differentiates this answer from the other ones is a few sections down as the "rewind" mechanism, but before we get there, we need to set up the basics.

Model

The basic ingredient of the solution is a language model. For our purposes, a model is something that takes some amount of English text and returns a probability distribution on the next character. When we use the model, the English text will be some (correct) prefix of Moby Dick. Please note that the desired output is a distribution, and not just a single guess for the most likely character.

In our case, we essentially use the model in this answer by user2699. We didn't use the model from the highest-scoring answer (other than our own) by Anders Kaseorg precisely because we were unable to extract a distribution rather than a single best guess. In theory, that answer computes a weighted geometric mean, but we got somewhat poor results when we interpreted that too literally. We "stole" a model from another answer because our "secret sauce" isn't the model but rather the overall approach. If someone has a "better" model, then they should be able to get better results using the rest of our techniques.

As a remark, most compression methods such as Lempel-Ziv can be seen as being a "language model" in this way, though one might have to squint a little bit. (It's particularly tricky for something that does a Burrows-Wheeler transform!) Also, note that the model by user2699 is a modification of a Markov model; essentially nothing else is competitive for this challenge or perhaps even modeling text in general.

Overall architecture

For the purposes of understanding, it's nice to break up the overall architecture into several pieces. From the highest-level perspective, there needs to be a little bit of state management code. This isn't particularly interesting, but for completeness we want to stress that at every point the program is asked for the next guess, it has available to it a correct prefix of Moby Dick. We do not use our past incorrect guesses in any way. For efficiency's sake, the language model can probably reuse its state from the first N characters to compute its state for the first (N+1) characters, but in principle, it could recompute things from scratch every time it is invoked.

Let's set this basic "driver" of the program aside and peek inside the part that guesses the next character. It helps conceptually to separate three parts: the language model (discussed above), a "hints" file, and an "interpreter". At each step, the interpreter will ask the language model for a distribution for the next character and possibly read some information from the hints file. Then it will combine these parts into a guess. Precisely what information is in the hints file as well as how it is used will be explained later, but for now it helps to keep these parts separate mentally. Note that implementation-wise, the hints file is literally a separate (binary) file but it could have been a string or something stored inside the program. As an approximation, we pretend that the model and interpreter are pretty short, so we can approximate that as "L ≈ 0".

If one is using a standard compression method such as bzip2 as in this answer, the "hints" file corresponds to the compressed file. The "interpreter" corresponds to the decompressor, while the "language model" is a bit implicit (as mentioned above).

Why use a hint file?

Let's pick a simple example to analyze further. Suppose that the text is N characters long and well-approximated by a model wherein every character is (independently) the letter E with probability slightly less than a half, T similarly with probability slightly less than a half, and A with probability 1/1000 = 0.1%. Let's assume no other characters are possible; in any case, the A is pretty similar to the case of a previously unseen character out of the blue.

If we operated in the L 0 regime (as most, but not all, of the other answers to this question do), there's no better strategy for the interpreter than pick one of E and T. On average, it will get about half of the characters correct. So E ≈ N/2 and the score ≈ N/2 also. However, if we use a compression strategy, then we can compress to a little more than one bit per character. Because L is counted in bytes, we get L ≈ N/8 and thus score ≈ N/4, twice as good as the previous strategy.

Achieving this rate of a little more than one bit per character for this model is slightly nontrivial, but one method is arithmetic coding.

Arithmetic coding

As is commonly-known, an encoding is a way of representing some data using bits/bytes. For example, ASCII is a 7 bit/character encoding of English text and related characters, and it is the encoding of the original Moby Dick file under consideration. If some letters are more common than others, then a fixed-width encoding like ASCII is not optimal. In such a situation, many people reach for Huffman coding. This is optimal if you want a fixed (prefix-free) code with an integer number of bits per character.

However, arithmetic coding is even better. Roughly speaking, it is able to use "fractional" bits to encode information. There are many guides to arithmetic coding available online. We'll skip the details here (especially of the practical implementation, which can be a little tricky from a programming perspective) because of the other resources available online, but if someone complains, maybe this section can be fleshed out more.

If one has text actually generated by a known language model, then arithmetic coding provides an essentially-optimal encoding of text from that model. In some sense, this "solves" the compression problem for that model. (Thus in practice, the main issue is that the model isn't known, and some models are better than others at modeling human text.) If it was not allowed to make errors in this contest, then in the language of the previous section, one way to produce a solution to this challenge would have been to use an arithmetic encoder to generate a "hints" file from the language model and then use an arithmetic decoder as the "interpreter".

In this essentially-optimal encoding, we end up spending -log_2(p) bits for a character with probability p, and the overall bit-rate of the encoding is the Shannon entropy. This means that a character with probability near 1/2 takes about one bit to encode, while one with probability 1/1000 takes about 10 bits (because 2^10 is roughly 1000).

But the scoring metric for this challenge was well-chosen to avoid compression as the optimal strategy. We'll have to figure out some way to make some errors as a tradeoff for getting a shorter hints file. For example, one strategy one might try is a simple branching strategy: we generally try to use arithmetic encoding when we can, but if the probability distribution from the model is "bad" in some way we just guess the most likely character and don't try encoding it.

Why make errors?

Let's analyze the example from before to motivate why we might want to make errors "intentionally". If we use arithmetic coding to encode the correct character, we will spend roughly one bit in the case of an E or T, but about ten bits in the case of an A.

Overall, this is a pretty good encoding, spending a little over a bit per character even though there are three possibilities; basically, the A is fairly unlikely and we don't end up spending its corresponding ten bits too often. However, wouldn't it be nice if we could just make an error instead in the case of an A? After all, the metric for the problem considers 1 byte = 8 bits of length to be equivalent to 2 errors; thus it seems like one should prefer an error instead of spending more than 8/2 = 4 bits on a character. Spending more than a byte to save one error definitely sounds suboptimal!

The "rewind" mechanism

This section describes the main clever aspect of this solution, which is a way to handle incorrect guesses at no cost in length.

For the simple example we've been analyzing, the rewind mechanism is particularly straightforward. The interpreter reads one bit from the hints file. If it is a 0, it guesses E. If it is a 1, it guesses T. The next time it is called, it sees what the correct character is. If the hint file is set up well, we can ensure that in the case of an E or T, the interpreter guesses correctly. But what about A? The idea of the rewind mechanism is to simply not code A at all. More precisely, if the interpreter later learns that the correct character was an A, it metaphorically "rewinds the tape": it returns the bit it read previously. The bit it read does intend to code E or T, but not now; it will be used later. In this simple example, this basically means that it keeps guessing the same character (E or T) until it gets it right; then it reads another bit and keeps going.

The encoding for this hints file is very simple: turn all of the Es into 0 bits and Ts into 1 bits, all while ignoring As entirely. By the analysis at the end of the previous section, this scheme makes some errors but reduces the score overall by not encoding any of the As. As a smaller effect, it actually saves on the length of the hints file as well, because we end up using exactly one bit for each E and T, instead of slightly more than a bit.

A little theorem

How do we decide when to make an error? Suppose our model gives us a probability distribution P for the next character. We will separate the possible characters into two classes: coded and not coded. If the correct character is not coded, then we will end up using the "rewind" mechanism to accept an error at no cost in length. If the correct character is coded, then we will use some other distribution Q to encode it using arithmetic coding.

But what distribution Q should we choose? It's not too hard to see that the coded characters should all have higher probability (in P) than the not coded characters. Also, the distribution Q should only include the coded characters; after all, we're not coding the other ones, so we shouldn't be "spending" entropy on them. It's a little trickier to see that the probability distribution Q should be proportional to P on the coded characters. Putting these observations together means that we should code the most-likely characters but possibly not the less-likely characters, and that Q is simply P rescaled on the coded characters.

It furthermore turns out that there's a cool theorem regarding which "cutoff" one should pick for the coding characters: you should code a character as long as it is at least 1/5.393 as likely as the other coded characters combined. This "explains" the appearance of the seemingly random constant 5.393 nearer the end of the program above. The number 1/5.393 ≈ 0.18542 is the solution to the equation -p log(16) - p log p + (1+p) log(1+p) = 0.

Perhaps it's a reasonable idea to write out this procedure in code. This snippet is in C++:

// Assume the model is computed elsewhere.
unordered_map<char, double> model;

// Transform p to q
unordered_map<char, double> code;
priority_queue<pair<double,char>> pq;
for( char c : CHARS )
    pq.push( make_pair(model[c], c) );
double s = 0, p;
while( 1 ) {
    char c = pq.top().second;
    pq.pop();
    p = model[c];
    if( s > 5.393*p )
        break;
    code[c] = p;
    s += p;
}
for( auto& kv : code ) {
    char c = kv.first;
    code[c] /= s;
}

Putting it all together

The previous section is unfortunately a little technical, but if we put all of the other pieces together, the structure is as follows. Whenever the program is asked to predict the next character after a given correct character:

Add the correct character to the known correct prefix of Moby Dick.
Update the (Markov) model of the text.
The secret sauce: If the previous guess was incorrect, rewind the state of the arithmetic decoder to its state before the previous guess!
Ask the Markov model to predict a probability distribution P for the next character.
Transform P to Q using the subroutine from the previous section.
Ask the arithmetic decoder to decode a character from the remainder of the hints file, according to the distribution Q.
Guess the resulting character.

The encoding of the hints file operates similarly. In that case, the program knows what the correct next character is. If it is a character that should be coded, then of course one should use the arithmetic encoder on it; but if it is a not coded character, it just doesn't update the state of the arithmetic encoder.

If you understand the information-theoretic background like probability distributions, entropy, compression, and arithmetic coding but tried and failed to understand this post (except why the theorem is true), let us know and we can try to clear things up. Thanks for reading!

A. Rex

Posted 2018-01-09T09:25:31.847

Reputation: 1 979

8Wow, impressive answer. I assume there is additional code needed to generate the B file? If so, please can you include that in your answer? – Nathaniel – 2018-01-15T00:21:04.327

8Excellent! The first (and so far the only) answer to break the 500k score barrier. – ShreevatsaR – 2018-01-16T03:27:36.517

6"tersed the shite whale" omg I'm crying – Phill – 2018-01-26T05:09:06.510

5Since no new answers were posted during the bounty period, I'm awarding it to your answer, as both the best scoring and the most sophisticated approach. If you ever have time, I would really appreciate a more in depth explanation of how this answer works, i.e. what exactly is the algorithm? – Nathaniel – 2018-02-08T09:26:03.643

2@Nathaniel: I added an explanation to this post. Let me know if you think it's detailed enough to reproduce the solution yourself. – A. Rex – 2019-03-16T03:05:53.107

I awarded the bounty for the explanation, with apologies for the slow time scale. I also replied to you in the chat room you created. – Nathaniel – 2019-09-26T14:55:11.607

I'd be interested in hearing more details about that theorem some time. Is it something in the literature, or is it something you proved yourself? Does it depend on properties of the input text? Where does that log(16) come from? (Is it related to the constant 2 in the scoring scheme?) – Nathaniel – 2019-09-26T14:55:44.407

@Nathaniel: Thank you very much for the second bounty. We (my collaborator and me) proved the theorem ourselves. I don't know of anything similar in the literature. Our attempt to write it up instead led to the discovery that this approach is suboptimal anyway and we should do something else: https://codegolf.stackexchange.com/a/182145/49643 Neither approach has been written up in full yet, very sadly.

– A. Rex – 2019-09-26T18:55:56.157

The log(16) does come from the scoring scheme. It's a little confusing because twice the length of the program in bytes is being added to the number of errors. As mentioned in the writeup, a naive calculation would thus suggest that "one should prefer an error instead of spending more than 8/2 = 4 bits on a character". Thus "16" is simply the number of possibilities that fits into these "4 bits" per error tradeoff. If you take all of the logs in base 2, then the equation has the same meaning but of course log(16) is just 4 exactly. – A. Rex – 2019-09-26T19:02:02.953

Do let me know if you do end up writing something up from this - it'd be really nice to see! Information theory is one of my research interests. – Nathaniel – 2019-09-26T19:45:00.210

Python 3, 2·267 + 510193 = 510727

Predictor

def p():
 d={};s=b''
 while 1:
  p={0:1};r=range(len(s)+1)
  for i in r:
   for c,n in d.setdefault(s[:i],{}).items():p[c]=p.get(c,1)*n**b'\1\6\f\36AcWuvY_v`\270~\333~'[i]
  c=yield max(sorted(p),key=p.get)
  for i in r:e=d[s[:i]];e[c]=e.get(c,1)+1
  s=b'%c'%c+s[:15]

This uses a weighted Bayesian combination of the order 0, …, 16 Markov models, with weights [1, 6, 12, 30, 65, 99, 87, 117, 118, 89, 95, 118, 96, 184, 126, 219, 126].

The result is not very sensitive to the selection of these weights, but I optimized them because I could, using the same late acceptance hill-climbing algorithm that I used in my answer to “Put together a Senate majority”, where each candidate mutation is just a ±1 increment to a single weight.

Test code

with open('whale2.txt', 'rb') as f:
    g = p()
    wrong = 0
    a = next(g)
    for b in f.read():
        wrong += a != b
        a = g.send(b)
    print(wrong)

Anders Kaseorg

Posted 2018-01-09T09:25:31.847

Reputation: 29 242

2The right tool for the job. Great score. Nice one. – agtoever – 2018-01-10T22:38:19.917

1Possible clarification: b"\0\3\6\r\34'&-20'\22!P\n[\26" is the ascii representation of the weights, where small non-printable values are escaped in octal. – Cœur – 2018-01-11T05:35:41.180

I've updated the question with a version of the file where the text is not wrapped - you might try re-running your code on that (it might do slightly better) – Nathaniel – 2018-01-11T09:41:35.087

I tried to run it on TIO, but it doesn't work. Do you know what went wrong? – user202729 – 2018-01-11T14:04:54.617

@user202729 You need sys.stdin.buffer. – Dennis – 2018-01-11T14:06:43.150

@Nathaniel Yeah, it does better on whale2.txt by over 22000. – Anders Kaseorg – 2018-01-11T17:53:59.277

By the way, how are the weights chosen? – Nathaniel – 2018-01-11T22:55:17.597

@Nathaniel I tried a few different things before settling on the same late acceptance hill-climbing algorithm that I used in my answer to “Put together a Senate majority”, where each candidate mutation is just a ±1 increment to a single weight.

– Anders Kaseorg – 2018-01-11T23:00:42.363

Thanks for the pointer to late acceptance hill-climbing (LAHC), which is an interesting technique that I didn't know about until now. What text did you use to optimise the weights, given that you were unable to use whale.txt or whale2.txt by the conditions of the puzzle? – Simon – 2018-01-12T01:01:50.193

@Simon There is no such condition (the weights are fully included in the byte count of the predictor). – Anders Kaseorg – 2018-01-12T01:53:41.680

Thanks for that explanation - if you could edit a synopsis into the question it would be great. (The experience with my previous challenge Paint Starry Night is that these optimisation procedures are the most interesting part of the answers, so it's much better if answers include the code used to do that, and an explanation of it. I included a rule in both challenges saying that they should.)

– Nathaniel – 2018-01-12T04:44:03.503

Must s really be bytes, can't it be str (saving the b prefixes)? Or a tuple, initialized with s=() and updated with s=c,*s[:15]. – Stefan Pochmann – 2018-01-16T22:22:05.043

@StefanPochmann bytes is a memory efficiency hack—this already uses 3 GB of memory, and while it would only be 5% worse with str or 30% worse with tuple, I’m happy at this stage to spend a handful of bytes from the sixth significant digit of my score to make the code a bit less painful to run. – Anders Kaseorg – 2018-01-18T00:58:59.887

I don't speak Python but if I understood correctly then you do a linear mixing of the models which basically calculates a weighted arithmetic average of the models. I suggest switching to a weighted geometric mean. That's the main difference between the first PAQ versions and the current ones and greatly improved prediction for several reasons. See section "Adaptive model weighting" vs "Neural-network mixing". Neural network mixing is actually a single layer neural network with softmax activation trained on cross entropy simplified to binary.

– Christoph – 2018-01-18T08:11:24.047

1@Christoph My model combination actually is a weighted geometric mean. But PAQ’s averaging in the logistic domain is slightly different—I’ll have to see if that’s better. – Anders Kaseorg – 2018-01-18T08:31:56.077

Python 3, 2279+592920=593478 2250 + 592467 = 592967 2 * 271 + 592084 = 592626 2 * 278 + 592059 = 592615 2 * 285 + 586660 = 587230 2 * 320 + 585161 = 585801 2 * 339 + 585050 = 585728

d=m={}
s=1
w,v='',0
def f(c):
 global w,m,v,s,d
 if w not in m:m[w]={}
 u=m[w];u[c]=c in u and 1+u[c]or 1;v+=1;q=n=' ';w=w*s+c;s=c!=n
 if w in m:_,n=max((m[w][k],k)for k in m[w])
 elif s-1:n=d in'nedtfo'and't'or'a'
 elif'-'==c:n=c
 elif"'"==c:n='s'
 elif'/'<c<':':n='.'
 if v>4*(n!=q)+66:n='\n'
 if s:d=c
 if c<q:w=w[:-1]+q;v=s=0
 return n

Try it online!

A function using global variables. Learns as it goes, building a model at the word level: given what it's seen so far in this word, what's the most common next character? As more input comes in it learns common words from the text pretty well, and it also learns the most common character to start the next word.

For example:

If what it's seen so far is 'Captai' it predicts an "n"
If it's "Captain" it predicts a space
If it's the start of a word, and the last word was "Captain" it predicts an 'A'
If the word so far is 'A' it predicts an 'h' (and then 'a' and 'b'; similarly for 'C').

It doesn't do very well at the start, but by the end there are large parts of actual words coming out. The fallback option is a space, and after a single space it's an "a", unless the preceding letter was one of "nedtfo", a digit, or a hyphen or apostrophe. It also aggressively predicts line breaks after 71 characters, or if a space was expected after 66. Both of those were just tuned to the data ("t" is far more common after a space, but has more often already been predicted, so "a" is a better guess outside those six special cases).

Learning what word pairs went together and preseeding the mapping turned out not to be worthwhile.

It ends up with text like this:

nl tneund his    I woi tis tnlost ahet toie tn tant  wod, ihet taptain Ahab ses
 snd t
oeed Sft   aoid thshtego    Io, fhe soie tn tant  tot the soie      ahe sewbtoon
swn tagd  aoths eatmved fhe sewbtoon wor ta  I sfey  aote of totsonld nive betse
d ahe
hate Whale iorst  Ihe e ioi beaos! -there soi beaos! -there soi beaos!

which corresponds to this part of the input:

all around him.

"I saw him almost that same instant, sir, that Captain Ahab did, and I cried out," said Tashtego.

"Not the same instant; not the same--no, the doubloon is mine, Fate reserved the doubloon for me. I only; none of ye could have raised the White Whale first. There she blows!--there she blows!--there she blows!

You can see where proper nouns in particular come out pretty well, but the ends of words are mostly right as well. When it's seen "dou" it expects "doubt", but once the "l" appears it gets "doubloon".

If you run it a second time with the same model it just built it immediately gets another 92k correct (51.7%->59.3%), but it's always just under 60% from the second iteration on.

The measuring code is in the TIO link, or here's a slightly better version:

total = 0
right = 0
with open('whale.txt') as fp:
    with open('guess.txt', 'w') as dest:
        for l in fp.readlines():
            for c in l:
                last = c
                if p == c: right += 1
                n = f(c)
                p = n
                total += 1
                dest.write(n)
                if total % 10000 == 0:
                    print('{} / {} E={}\r'.format(right, total, total-right), end='')
print('{} / {}: E={}'.format(right, total, total - right))

guess.txt has the guessed output at the end.

Michael Homer

Posted 2018-01-09T09:25:31.847

Reputation: 721

3This is an excellent approach! – Skyler – 2018-01-10T15:18:52.923

2too much <s></s> ;) – FantaC – 2018-01-14T04:11:56.157

1+1 because this approach reminded me of the LZW compression algorithm. – Marcos – 2018-01-17T07:17:09.883

C++, score: 2*132 + 865821 = 866085

Thanks to @Quentin for saving 217 bytes!

int f(int c){return c-10?"t \n 2  sS \n  -  08........       huaoRooe oioaoheu thpih eEA \n   neo    enueee neue hteht e"[c-32]:10;}

A very simple solution that, given a character, just outputs the character that most frequently appears after the input character.

Verify the score with:

#include <iostream>
#include <fstream>

int f(int c);

int main()
{
    std::ifstream file;
    file.open("whale2.txt");

    if (!file.is_open())
        return 1;

    char p_ch, ch;
    file >> std::noskipws >> p_ch;
    int incorrect = 0;
    while (file >> std::noskipws >> ch)
    {
        if (f(p_ch) != ch)
            ++incorrect;
        p_ch = ch;
    }

    file.close();

    std::cout << incorrect;
}

Edit: Using whale2.txt gives a better score.

Steadybox

Posted 2018-01-09T09:25:31.847

Reputation: 15 798

5You can translate this array into a string literal and inline it directly in place of L to save a bunch of characters :) – Quentin – 2018-01-09T17:43:38.193

@Quentin Thanks! Now I'm wondering why I didn't think of that in the first place... – Steadybox – 2018-01-10T01:58:43.930

-10 score with c= instead of return. – S.S. Anne – 2020-02-03T22:58:56.080

Python, 2*516 + 521122 = 522154

Algorithm:

Yet another python submission, this algorithm calculates the most likely next letter looking at sequences of length 1,...,l. The sum of probabilities is used, and there are a few tricks to get better results.

from collections import Counter as C, defaultdict as D
R,l=range,10
s,n='',[D(C) for _ in R(l+1)]
def A(c):
 global s;s+=c;
 if len(s)<=l:return ' '
 P=D(lambda:0)
 for L in R(1,l+1):
  w=''.join(s[-L-1:-1]);n[L][w].update([c]);w=''.join(s[-L:])
  try:
   q,z=n[L][w].most_common(1)[0];x=sum(list(n[L][w].values()))
  except IndexError:continue
  p=z/x
  if x<3:p*=1/(3-x)
  P[q]+=p
 if not P:return ' '
 return max(P.items(),key=lambda i:i[1])[0]
import this, codecs as d
[A(c) for c in d.decode(this.s, 'rot-13')]

Results:

Mostly gibberish, although you can see it picks up on the occasional phrase, such as "Father Mapple".

errors: 521122
TRAINING:
result:  tetlsnowleof the won -opes  aIther Mapple,woneltnsinkeap hsd   lnd the  thth a shoey,aeidorsbine ao
actual: ntal knobs of the man-ropes, Father Mapple cast a look upwards, and then with a truly sailor-like bu
FINAL:
result: mnd wnd round  ahe   ind tveryaonsracting th ards the sol ens-ike aeock tolblescn the sgis of thet t
actual: und and round, then, and ever contracting towards the button-like black bubble at the axis of that s

Test code:

Pretty simple, outputs some examples of the text at different points. Uses whale2.txt, as this avoids some extra logic to calculate newlines.

from minified import A

def score(predict, text):
    errors = 0
    newtext = []
    for i, (actual, current) in  enumerate(zip(text[1:], text[:-1])):
        next = predict(current)
        errors += (actual != next)
        newtext.append(next)
        if (i % (len(text) // 100) == 0):
            print ('.', end='', flush=True)
    return errors, ''.join(newtext)

t = open('whale2.txt')
text = t.read()
err2, text2 = score(A, text)
print('errors:', err2)
print("TRAINING:")
print(text2[100000:100100].replace('\n', '\\n'))
print(text1[100001:100101].replace('\n', '\\n'))
print("FINAL:")
print(text2[121400:1215500].replace('\n', '\\n'))
print(text[121401:1215501].replace('\n', '\\n'))

user2699

Posted 2018-01-09T09:25:31.847

Reputation: 538

3Welcome to the site! This is a fantastic first submission. :) – James – 2018-01-11T19:01:35.567

@DJMcMayhem, Thanks for the welcome. I've enjoyed watching for awhile now, this is the first contest to catch my attention for an entry. – user2699 – 2018-01-11T20:01:57.883

C (gcc), 679787 652892

84 76 bytes, ~~679619~~ 652740 incorrect guesses

p[128][128][128][128];a,b,c,d;g(h){p[a][b][c][d]=h;h=p[a=b][b=c][c=d][d=h];}

Try it online!

Update: ~27000 points off with updated file, 16 points (8 bytes) with a better-golfed function.

Explanation

The way this works is that as the code runs through the text, it memorizes the last character that terminated any given 4-character sequence, and returns that value. Somewhat similar to Arnauld's approach above, but relies on the inherent likelihood of two given 4-character sequences terminating the same way.

De-golfed:

p[128][128][128][128];
a,b,c,d;
g(h){
    p[a][b][c][d]=h; // Memorize the last character.
    h=p[a=b][b=c][c=d][d=h]; // Read the guess. We save several
                             // bytes with the assignments inside indices.
}

user77406

Posted 2018-01-09T09:25:31.847

Reputation:

... TIO link is useless. So the function returns the value of the last assignment? – user202729 – 2018-01-10T08:33:17.753

Let me edit the answer with an explanation, then :) – None – 2018-01-10T08:49:41.057

1@Rogem I've added a de-golfed version (which I did because I couldn't follow it either) - hopefully this doesn't bother you but please roll back if desired. – Adam Davis – 2018-01-10T15:49:19.677

@AdamDavis on most implementations of C, all global variables start at zero. It's undefined behavior, so it's only used in code-golf. – NieDzejkob – 2018-01-10T16:02:03.883

1@NieDzejkob Ah, you're right, thanks! "ANSI-C requires that all uninitialized static/global variables have to be initialized with 0." – Adam Davis – 2018-01-10T16:03:04.953

@AdamDavis woah, it's not UB? TIL! – NieDzejkob – 2018-01-10T16:10:14.017

@AdamDavis It's beautiful. – None – 2018-01-10T18:32:52.843

sh+bzip2, 2*364106 = 728212

~~2*381249 + 0 = 762498~~

dd if=$0 bs=1 skip=49|bunzip2&exec cat>/dev/null

followed by the bzip2-compressed whale2.txt with the first byte missing

Ignores its input; outputs the correct answer. This provides a baseline on one end; daniero provides a baseline on the other end.

Builder script:

#!/bin/sh
if [ $# -ne 3 ]
then
    echo "Usage $0 gen.sh datafile output.sh"
    exit 1
fi

cat $1 > $3
dd ibs=1 if=$2 skip=1 | bzip2 -9 >> $3
chmod +x $3

I/O test harness (tcc; cut off first line for gcc). This test harness can be used by anybody on a suitable platform that submits a complete program that expects read/write I/O. It uses byte-at-a-time I/O to avoid cheating. Child program must flush output after every byte to avoid blocking.

#!/usr/bin/tcc -run
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>

int main(int argc, char **argv)
{
    volatile int result;
    int readfd[2];
    int writefd[2];
    int cppid;
    int bytecount;
    char c1, c2, c3;
    if (argc != 2) {
        printf("write X approximately -- service host\n");
        printf("Usage: %s serviceprocessbinary < source.txt\n", argv[0]);
        return 1;
    }
    /* Start service process */
    if (pipe(readfd)) {
        perror("pipe()");
        return 3;
    }
    if (pipe(writefd)) {
        perror("pipe()");
        return 3;
    }
    result = 0;
    if (!(cppid = vfork())) {
        char *argtable[3];
        argtable[0] = argv[1];
        argtable[1] = NULL;
        dup2(readfd[0], 0);
        dup2(writefd[1], 1);
        close(readfd[1]);
        close(writefd[0]);
        close(readfd[0]);
        close(writefd[1]);
        execvp(argv[1], argtable);
        if (errno == ENOEXEC) {
            argtable[0] = "/bin/sh";
            argtable[1] = argv[1];
            argtable[2] = NULL;
            /* old standard -- what isn't an executable
             * can be exec'd as a /bin/sh script */
            execvp("/bin/sh", argtable);
            result = ENOEXEC;
        } else {
            result = errno;
        }
        _exit(3);
    } else if (cppid < 0) {
        perror("vfork()");
        return 3;
    }
    if (result) {
        errno = result;
        perror("execvp()");
        return 3;
    }
    close(readfd[0]);
    close(writefd[1]);
    /* check results */
    read(0, &c2, 1);
    bytecount = 1;
    errno = 0;
    while (read(0, &c1, 1) > 0) {
        write(readfd[1], &c2, 1);
        if (read(writefd[0], &c3, 1) <= 0) {
            printf("%d errors (%d bytes)\n", result, bytecount);
            if (errno == 0)
                fprintf(stderr, "pipe: unexpected EOF\n");
            else
                perror("pipe");
            return 3;
        }
        if (c3 != c1)
            ++result;
        c2 = c1;
        ++bytecount;
    }
    printf("%d errors (%d bytes)\n", result, bytecount);
    return 0;
}

Joshua

Posted 2018-01-09T09:25:31.847

Reputation: 3 043

Can you explain how the I/O works in this submission? – Nathaniel – 2018-01-09T22:43:09.763

@Nathaniel: You start the thing and send bytes to it. Or don't bother because you already see the answer in front of you. – Joshua – 2018-01-09T23:12:11.600

Well ok but how is this achieved by your code? I'm only asking because I'm curious. – Nathaniel – 2018-01-10T00:17:02.637

@Nathaniel: dd opens the file given by $0 and skips the first 37 characters; and the rest is sent to bunzip2. – Joshua – 2018-01-10T03:02:04.993

6I think what he's asking is: how does this not violate the but may not load any other external files, and your code may not access the whale.txt file in any way other than described above. clause? – None – 2018-01-10T08:33:26.750

8@Rogem The compressed data is put after what is shown here, and the code access itself. – user202729 – 2018-01-10T08:33:48.513

skip=37 should probably be skip=39 (the length of the code plus the newline after it). What's the purpose of &exec cat? – axiac – 2018-01-11T10:06:55.483

4The question says: "Your submission will be a program or function (etc.) that will be called or invoked multiple times. When it is called for the nth time it will be given the nth character of whale.txt or whale2.txt and it must output its guess for the (n+1)th character." -- How is this requirement accomplished? The code displays the whole text of whale.txt every time it is executed. – axiac – 2018-01-11T10:09:46.700

1@axiac "anything is fine, as long as your program always gives one byte of output before receiving its next byte of input." – user202729 – 2018-01-11T14:11:58.563

@axiac I guess to ignore the latter part. – user202729 – 2018-01-11T14:12:55.817

1I don't understand @user202729's answer to axiac's question. If it always gives the complete text, how do you know which bytes of the text to ignore? If we can choose to ignore or not-ignore output bytes at will we might as well have a program that just outputs "abcdefghijklmnopqrstuvwxyz" regardless of input. – Vincent – 2018-01-12T13:01:04.840

@Vincent ... ignore what? I don't understand your question. No, literally "anything is fine". The I/O format of this answer is different from other answers. It's possible to use a stream of characters as input, as long as for each byte you read, you output one byte. (KevinCruissen's suggestion, Java related) – user202729 – 2018-01-12T13:09:05.373

1Yes I was talking about the output. As I understand it, every time it is executed it outputs the whole text of Moby Dick. The first time it does this, we know we should only consider the first character of that output as the 'real' output and ignore the rest. The second time we know we should only look at the second character of the output etc. But how do we know? Or do I misread what the program does and does it really only output one byte each time? – Vincent – 2018-01-12T13:21:12.913

1@Vincent The I/O format is flexible, it doesn't force the program to output one byte each time, just that it must output at least one byte after read a byte. In this case the program read 0 byte and output about a million byte, 0 < 1000000 so it's fine. – user202729 – 2018-01-12T13:34:59.570

@axiac: Yeah looks like I typoed it on rekeying it in after testing. I'm not interested in maintaining at after the predictors are starting to get good. – Joshua – 2018-01-13T02:39:37.690

@Vincent I thought quite a bit about this at the time it was posted, and decided it's OK, I/O wise. If I wanted to, I could write a test suite that sends one byte to this program via STDIN (which it would ignore), gets one byte from it via STDOUT, and so on, which would satisfy the requirements. At least, I think that would work. Strictly speaking, such a test script should also be included in the answer, but since it's no longer competitive I think we can let that slide. – Nathaniel – 2018-01-13T03:55:53.770

@Nathaniel you wrote in the question: "Your submission will be a program or function (etc.) that will be called or invoked multiple times. (1215235 times to be exact.) When it is called for the nth time it will be given the nth character of whale.txt or whale2.txt and it must output its guess for the (n+1)th character.". This answer clearly does not follow this requirement. – axiac – 2018-01-14T17:50:56.957

@axiac: Here's the expected test harness. On writing it I found I had a bug to fix. – Joshua – 2018-01-14T20:15:57.573

5@axiac given the test harness, I'm happy to regard sending the program one byte from STDIN as "calling or invoking" it. The import thing is that the program returns one byte of output after each byte of input, which this one does actually do, when run through the test harness. As the question says, "anything is fine, as long as your program always gives one byte of output before receiving its next byte of input." – Nathaniel – 2018-01-15T00:12:13.120

I am actually surprised it saves so little... With a purely-textual file I would have expected about a x10 compression rate, not x4... – Itai – 2018-01-15T06:39:40.113

@sillyfly: Yea. I expected about x6. Shrug. – Joshua – 2018-01-15T16:13:33.473

Have you tried killing some entropy in the (compressed) file? If (say) replacing 'z' with ' ' saves more bytes than it costs in errors... (you have to save 0.5 bytes for every error you generate). In theory this would be best done by the compression algorithm being told the cost function and being permitted to pick the "wrong" value if it costs more than X bytes to put the right value in. – Yakk – 2018-01-17T22:39:02.073

@Yakk ... that involves some information theory... which is covered in @{A.Rex}'s solution. And we're not sure how the compression algorithm works, too. – user202729 – 2018-01-21T10:41:36.990

@user202729: The algorithm is approximately compressing words. Not really, but bzip2 compression rarely splits a word if the word occurs frequently. – Joshua – 2018-01-21T19:09:17.040

Python 3, 879766

F=[[0]*123for _ in range(123)]
P=32
def f(C):global P;C=ord(C);F[P][C]+=1;P=C;return chr(max(enumerate(F[C]),key=lambda x:x[1])[0])

Try it online!

... The /// answer which prints a space get 10 upvotes, while my code can only get 3 ...

Explanation:

For each character, the program:

increase frequency[prev][char]
Find the character that appear the most times in frequency[char]
and output it.

Ungolfed code in TIO link, commented out.
The code is 131 bytes.
The code run on my machine reports:

879504 / 1215235
Time: 62.01348257784468

which have the total score

2*131 + 879504 = 879766

Because there is no way to upload a large file to TIO (except asking Dennis), the example run in the TIO link only runs the program for a small part of the text.

In comparison with the older answer, this one have 362 more incorrect characters, but the code is shorter by 255 bytes. The multiplier makes my submission have lower score.

user202729

Posted 2018-01-09T09:25:31.847

Reputation: 14 620

C#, 378*2 + 569279 = 570035

using System.Collections.Generic;using System.Linq;class P{Dictionary<string,Dictionary<char,int>>m=new
Dictionary<string,Dictionary<char,int>>();string b="";public char N(char
c){if(!m.ContainsKey(b))m[b]=new Dictionary<char,int>();if(!m[b].ContainsKey(c))m[b][c]=0;m[b][c]++;b+=c;if(b.Length>4)b=b.Remove(0,1);return
m.ContainsKey(b)?m[b].OrderBy(k=>k.Value).Last().Key:' ';}}

This approach uses a look-up table to learn the most common character that follows a given string. The keys of the look-up table have a maximum of 4 characters, so the function first updates the look-up table with the current character and then just checks which character is the most likely to happen after the 4 previous ones including the current one. If those 4 characters are not found in the look-up table, it prints a space.

This version uses the whale2.txt file, as it greatly improves the number of successful guesses.

Following is the code used to test the class:

using System;
using System.IO;
using System.Text;

public class Program
{
    public static void Main(string[] args)
    {
        var contents = File.OpenText("whale2.txt").ReadToEnd();
        var predictor = new P();

        var errors = 0;
        var generated = new StringBuilder();
        var guessed = new StringBuilder();
        for (var i = 0; i < contents.Length - 1; i++)
        {
            var predicted = predictor.N(contents[i]);
            generated.Append(predicted);
            if (contents[i + 1] == predicted)
                guessed.Append(predicted);
            else
            {
                guessed.Append('_');
                errors++;
            }
        }

        Console.WriteLine("Errors/total: {0}/{1}", errors, contents.Length);
        File.WriteAllText("predicted-whale.txt", generated.ToString());
        File.WriteAllText("guessed-whale.txt", guessed.ToString());

        Console.ReadKey();
    }
}

The code runs in barely 2 seconds. Just for the record, this is what I get when I modify the size of the keys of the look-up table (includes the results of a second run without resetting the model):

Size   Errors   Errors(2)
-------------------------
1      866162   865850
2      734762   731533
3      621019   604613
4      569279   515744
5      579446   454052
6      629829   396855
7      696912   335034
8      765346   271275
9      826821   210552
10     876471   158263

It would be interesting to know why a key size of 4 characters is the best choice in this algorithm.

Text comparison

Original:

"And did none of ye see it before?" cried Ahab, hailing the perched men all around him.

"I saw him almost that same instant, sir, that Captain Ahab did, and I cried out," said Tashtego.

"Not the same instant; not the same--no, the doubloon is mine, Fate reserved the doubloon for me. I only; none of ye could have raised the White Whale first. There she blows!--there she blows!--there she blows! There again!--there again!"

Recreated:

"Tnd tes note of to seamtn we ore  
sried thab  wedleng the srriead te  a l tneund tes  
"T day tim t lost shet toie tn tand  aor, ahet taptain thab sid  tnd t waued tnt   said teshtego  
"To, ahe shme tn tand  aot the shme whot nhe sewbteodsan tagd  althsteatnved the sewbteodsaor te, I hncy  aote of to sanld bave beised the shate Whale iorst  Bhe e ati boaos  -the   ati boaos  -the   ati boaos  the e anains -ahe   anains

Guesses:

"_nd ___ no_e of __ se____ _e_ore____ried _hab_ ___l_ng the __r___d _e_ a_l ___und _____
"_ _a_ _im ___ost _h_t ___e _n_tan__ __r, _h_t _aptain _hab _id_ _nd _ ___ed __t__ said __shtego__
"_o_ _he s_me _n_tan__ _ot the s_me___o_ _he ___b__o____ _____ __t___e___ved the ___b__o___or _e_ I _n_y_ _o_e of __ ___ld _ave __ised the _h_te Whale __rst_ _he_e ___ b___s__-the__ ___ b___s__-the__ ___ b___s_ _he_e a_ain__-_he__ a_ain__

Change log

569279 - changed to whale2.txt and thus removed the optimization.
577366 - optimized with code that tried to guess when to return a line feed.
590354 - original version.

Charlie

Posted 2018-01-09T09:25:31.847

Reputation: 11 448

4Thanks for showing the variance as you change key size and column threshold! – Jeremy Weirich – 2018-01-10T20:34:45.010

I've updated the question with a version of the file where the text is not wrapped - you can probably save some points by using that – Nathaniel – 2018-01-11T09:36:04.050

@Nathaniel it does indeed. I have updated the answer. – Charlie – 2018-01-11T10:07:41.120

You can save some bytes using var instead of declaring the types. – Ed T – 2018-01-12T17:41:23.143

@EdT note that the variables are class variables, and you cannot use "var" outside methods. – Charlie – 2018-01-12T19:26:21.483

1As the key size gets larger, the number of hits verses misses will go down and thus more spaces will be output when a shorter key might have guessed the correct character. As the key size gets smaller, the individual guesses are less accurate for segments that match. I suspect this is why a length of four is optimal. If you maintained keys of multiple lengths and used shorter matches when longer ones are not available, then I expect the hit rate (and thus the score) will be greatly improved at longer key lengths. – Jeffrey L Whitledge – 2018-01-15T22:51:18.393

Java 7, 1995 characters, (1995*2+525158) 529148

Java sucks for small program sizes. Anyway, I tried several extremely complex and tricky approaches that produced surprisingly crap results. I subsequently went back and just did a simple approach, which resulted in a smaller program size and better results.

This approach is actually extremely simple. It blindly feeds the previous x characters (in addition to all substrings of those characters) into a hashtable, mapped to the current character. It then keeps track of which patterns most accurately predict the current character. If patterns that precede certain characters are encountered multiple times, they succeed in predicting the character. It gives precedence to longer strings and it gives precedence to whichever character most often follows a given string. This algorithm knows nothing about the type of document or the english language.

I settled on using 9 characters and attempting to match whole words within those previous 9 characters when possible. When you don't try to do word matching within the strings, the optimum length is 6 characters, producing several thousand more mispredictions.

One interesting observation was that using 20 characters resulted in bad predictions the first time through but 99.9 percent accuracy on subsequent passes. The algorithm was basically able to memorize the book in overlapping 20 byte chunks, and this was distinct enough to allow it to recall the entire book a character at a time.

(1950*2+532919) 536819
(2406*2+526233) 531045 checking for punctuation to make better guesses
(1995*2+525158) 529148 more tweaking, golfed away some verbiage

package mobydick; import java.util.HashMap; public class BlindRankedPatternMatcher { String previousChars = ""; int FRAGLENGTH = 9; HashMap > patternPredictor = new HashMap<>(); void addWordInfo(String key, String prediction) { HashMap predictions = patternPredictor.get(key); if (predictions == null) { predictions = new HashMap(); patternPredictor.put(key, predictions); } WordInfo info = predictions.get(prediction); if (info == null) { info = new WordInfo(prediction); predictions.put(prediction, info); } info.freq++; } String getTopGuess (String pattern) { if (patternPredictor.get(pattern) != null) { java.util.List predictions = new java.util.ArrayList<>(); predictions.addAll(patternPredictor.get(pattern).values()); java.util.Collections.sort(predictions); return predictions.get(0).word; } return null; 
} String mainGuess() { 
if (trimGuess(",") != null) return trimGuess(","); if (trimGuess(";") != null) return trimGuess(";"); 
if (trimGuess(":") != null) return trimGuess(":"); 
if (trimGuess(".") != null) return trimGuess("."); if (trimGuess("!") != null) return trimGuess("!"); if (trimGuess("?") != null) return trimGuess("?"); if (trimGuess(" ") != null) return trimGuess(" "); for (int x = 0;x< previousChars.length();x++) { String tg = getTopGuess(previousChars.substring(x)); if (tg != null) { return tg; } } return "\n"; } String trimGuess(String c) { if (previousChars.contains(c)) { 
String test = previousChars.substring(previousChars.indexOf(c)); return getTopGuess(test); } return null; } public String predictNext(String newChar) { if (previousChars.length() < FRAGLENGTH) { previousChars+= newChar; } else { for (int x = 0; x addWordInfo(previousChars.substring(x), newChar); } previousChars = previousChars.substring(1) + newChar; } return mainGuess(); 
} class WordInfo implements Comparable { public WordInfo (String text) { this.word = text; } 
String word; int freq = 0; @Override public int compareTo(WordInfo arg0) { return Integer.compare(arg0.freq, this.freq); }

Jim W

Posted 2018-01-09T09:25:31.847

Reputation: 211

That's a pretty good score for such a verbose language. – James – 2018-01-11T22:35:55.703

1I figured it was worth a shot since the size of the file gives lots of room for improvement compared to program size. – Jim W – 2018-01-11T22:36:41.020

3This is not compilable under Java 7 (or any Java version, for what it's worth). Would you please fix your code? Once that's done, I'll gladly golf it so that your score is improved. – Olivier Grégoire – 2018-01-16T15:39:27.070

Untested, but this should be your exact same code slightly golfed: 950 bytes. Your current code contained quite a few errors though, so I'm not sure if I correctly filled in everything. Again, untested, so just compare the versions to see what I've changed/renamed, and see if everything still works the same as your original code. Can definitely be golfed some more, though.

– Kevin Cruijssen – 2018-02-01T14:06:10.100

Crap, I did this while bored at my old job and didn't take the code with me. I'll have to take a look at it to see where the typo is. – Jim W – 2018-02-25T03:06:12.790

Python 3, 2×497+619608=620602 2×496+619608=620600

import operator as o
l=''
w=''
d={}
p={}
s=0
def z(x,y):
 return sorted([(k,v) for k,v in x.items() if k.startswith(y)],key=o.itemgetter(1))
def f(c):
 global l,w,d,p,s
 r=' '
 if c in' \n':
  s+=1
  if w in d:d[w]+=1
  else:d[w]=1
  if w:
   if l:
    t=l+' '+w
    if t in p:p[t]+=1
    else:p[t]=1
   n=z(p,w+' ')
   if n:g=n[-1];l=w;w='';r=g[0][len(l)+1]
   else:l=w;w='';r='t'
 else:
  w=w+c;m=z(p,w)
  if m:
   g=m[-1]
   if g[0]==w:
    if s>12:s=0;r='\n'
   else:r=g[0][len(w)]
 return r

I attempted this independently, but ended up with what's effectively an inferior version of Michael Homer's answer. I hope that doesn't render my answer completely obsolete.

This builds over time a dictionary of words (crudely defined as strings terminated by or \n, case-sensitive and including punctuation). It then searches this dictionary for words beginning with what it so far knows of the current word, sorts the resulting list by frequency of occurrence (slowly), and guesses that the next character is the next character in the most common matching word. If we already have the most common matching word, or no longer matching word exists, it returns .

It also builds up a disgustingly inefficient dictionary of word pairs. Upon hitting a word boundary, it guesses that the next character is the first letter of the second word in the most common matching word pair, or t if there is no match. It's not very clever, though. Following Moby, the program correctly guesses that the next character is D, but then it forgets all about the context and usually ends up calling the whale "Moby Duck" (because the word "Dutch" seems to be more frequent in the first half of the text). It would be easy to fix this by prioritising word pairs over individual words, but I expect the gain would be marginal (since it's usually correct from the third character on, and the word pairs aren't that helpful in the first place).

I could tune this to better match the provided text, but I don't think manually tuning the algorithm based on prior knowledge of the input is really in the spirit of the game so, other than choosing t as the fallback character after a space (and I probably shouldn't have done that either), I avoided that. I ignored the known line length of the input file, and instead inserted \n after every 13 spaces—this is almost certainly a very poor match, the main intention was to keep the line length reasonable rather than to match the input.

The code isn't exactly fast (~2 hours on my machine), but overall gets about half the characters right (49%). I expect the score would be marginally better if run on whale2.txt, but I haven't done that.

The start of the output looks like this:

T t t t t t t t t L t t t tsher t t t ty t to t t te t t t t t tem t t t d b ta tnL te t tv tath a to tr t tl t l toe g to tf ahe gi te we th austitam ofd laammars, tn te to t tis nf tim oic t t th tn cindkth ae tf t d bh ao toe tr ai tat tnLiat tn to ay to tn hf to tex tfr toe tn toe kex te tia t l t l ti toe ke tf hhe kirl tou tu the tiach an taw th t t Wh tc t d t te the tnd tn tate tl te tf teu tl tn oan. HeAL. tn nn tf r t-H ta t WhALE.... S tn nort ts tlom rhe ka tnd Dr t t tALL th teuli th tis t-H taCTIONARY " t r t o t a t A t . t eALT t I t HLW t I t e t w t AO t t t AOLE, I T t t t ALE t w t t R t EK t T t R tSupplied by wnLw t t iit ty cce thet whe to tal ty tnd

but by the end, it looks a bit more like... something. My favourite passage from near the end of the book,

"I turn my body from the sun. What ho, Tashtego! let me hear thy hammer. Oh! ye three unsurrendered spires of mine; thou uncracked keel; and only god-bullied hull; thou firm deck, and haughty helm, and Pole-pointed prow,--death-glorious ship! must ye then perish, and without me? Am I cut off from the last fond pride of meanest shipwrecked captains? Oh, lonely death on lonely life! Oh, now I feel my topmost greatness lies in my topmost grief. Ho, ho! from all your furthest bounds, pour ye now in, ye bold billows of my whole foregone life, and top this one piled comber of my death! Towards thee I roll, thou all-destroying but unconquering whale; to the last I grapple with thee; from hell's heart I stab at thee; for hate's sake I spit my last breath at thee. Sink all coffins and all hearses to one common pool! and since neither can be mine, let me then tow to pieces, while still chasing thee, though tied to thee, thou damned whale! THUS, I give up the spear!"

comes out as

I dhrnery oyay ooom the woc Ihal iiw chshtego -tit my ti ddohe bidmer Hh, ho sheee opdeprendera toetis of tygd ahesgapdo tnep tnd tf y arosl tinl ahesgaorsltoak, and tidlhty ai p, cnd telas taep toip syst ho she tachlhe tnd tith ut ay Rnet hor bf toom the wist tord oaeve of ty nsst toip recked,hontain th, tingly toadh af tingly tike 'h, tot a hoet ty oh ost sreat ess iik in ty oh ost sremf Hew hiw"aoom tnl tou oolthert tyand . taoneoo sot an ao syad tytlows of ty oii e oor hoi tike and th ohes if oaped uoueid tf ty ooadh Ih ards the t houle lhesganl p tyt tpdomsuera tiile ah the wist t hrenelidtith the Ioom ti p s di dd o hoinbtn the Ior tid toie o hoetefy oist tyoakh on the Opr tnl toufin and tnl ti dd .mh tf ooueon gaor tnd todce tovther lon by tygd ait my the th aih tapce ciice toill moaneng she thesgh thmd th the thesgaoy d jiile YhE t hrve tpothe woerk "

That would have made The Wrath of Khan a lot more confusing. And "lonely" → "tingly" is a particularly satisfying substitution.

Edit: Saved one byte by deleting an extraneous space

Scoring

#! /usr/bin/env python3
import sys
import os
import mobydick as moby


def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

total = 0
right = 0
real_char = ''
guess_char = 'T'
print('T',end='')
with open("whale.txt") as whale:
    while True:
        if real_char == guess_char:
            right += 1
        real_char = whale.read(1)
        if not real_char:
            eprint(str(right) + " / " + str(total) + " (" +
                str(right/total*100) + "%)")
            size = os.path.getsize("mobydick.py")
            eprint("Source size: " + str(size) + "B")
            eprint("Score: " + str(2*size + total - right))
            sys.exit(0)
        guess_char = moby.f(real_char)
        print(guess_char,end='')
        total += 1

This runs the program for the text of Moby Dick and outputs the "predicted" textto stdout, and abuses stderr to write the score. I'd recommend redirecting the output to a file.

georgewatson

Posted 2018-01-09T09:25:31.847

Reputation: 291

2Welcome to PPCG! – Martin Ender – 2018-01-11T16:04:37.667

1Wouldn't lambda i:i[1] be cheaper than dealing with operator? – Draconis – 2018-01-11T17:18:45.063

@Draconis Almost certainly. – georgewatson – 2018-01-11T18:45:08.277

C++, 2·62829 + 318786 = 444444

To run this program, you need this file here, which must be named C.

The program uses the same combination of Markov models as in our earlier answer. As before, this combination is essentially the model from this answer by user2699, but with a few small modifications.

Seeing as how this answer uses the exact same model as before, the improvement is a better information-theoretic mechanism than the "rewind" mechanism described earlier. This allows it to make fewer errors while also having a smaller combined length. The program itself isn't golfed very much because it is not the major contributor to the score.

The program is 2167 bytes long (including all the tabs for indentation and lots of other unnecessary characters, but before the test code), and the binary file C is 60661 bytes long, so under the rules for scoring multiple files, we score L as 2167 + 60661 + 1 = 62829.

The program takes approximately 8 minutes to run on an m5.4xlarge instance on Amazon EC2 and uses a little over 16 GB of memory. (This excessive memory usage isn't necessary -- we just didn't optimize that either.)

#include <map>
#include <queue>
#include <vector>
using namespace std;

FILE *in;
unsigned int a, b = -1, c, d;
string s, t;
double l, h = 1, x[128][129], y[129], m[128];
map<string, int> N;
map<string, double[128]> M;
int G, S;

int f(int C)
{
    int i, j;
    for (i = 0; i <= 20 && i <= S; i++) {
        t = s.substr(S - i);
        N[t]++;
        M[t][C]++;
    }
    s += C;
    S++;

    for (i = 0; i < 128; i++)
        m[i] = 0;

    int E = 0;
    for (i = 20; i >= 0; i--) {
        if (i > S)
            continue;
        t = s.substr(S - i);
        if (i <= 2 && E >= 100 && (i == 0 || t[0] != ' '))
            break;
        if (M.find(t) == M.end())
            continue;
        for (j = 0; j < 128; j++) {
            m[j] += M[t][j] / N[t];
        }
        E += N[t];
    }

    double r = 0;
    for (i = 0; i < 128; i++)
        r += m[i];
    for (i = 0; i < 128; i++)
        m[i] = m[i] / r;

    if (!in) {
        in = fopen("C", "r");
        for (i = 0; i < 4; i++)
            c = c << 8 | getc(in);
    } else {
        l = x[C][G]
            + (l - y[G]) * (x[C][G + 1] - x[C][G]) / (y[G + 1] - y[G]);
        h = x[C][G]
            + (h - y[G]) * (x[C][G + 1] - x[C][G]) / (y[G + 1] - y[G]);
    }

    priority_queue<pair<double, int>> q;
    for (i = 0; i < 128; i++) {
        q.push(make_pair(m[i], i));
    }

    int n = 0;
    double s = 0;
    while (q.size()) {
        i = q.top().second;
        q.pop();
        if (m[i] < s / (n + 15))
            break;
        s += m[i];
        n++;
    }

    r = 0;
    for (i = 0; i < 128; i++) {
        y[i + 1] = m[i] - s / (n + 15);
        if (y[i + 1] < 0)
            y[i + 1] = 0;
        r += y[i + 1];
    }
    for (i = 0; i < 128; i++)
        y[i + 1] /= r;

    for (i = 0; i < 128; i++) {
        r = 0;
        for (j = 0; j < 128; j++) {
            x[i][j + 1] = y[j + 1];
            if (i == j)
                x[i][j + 1] *= 16;
            r += x[i][j + 1];
        }
        for (j = 0; j < 128; j++)
            x[i][j + 1] /= r;
        x[i][0] = 0;
        for (j = 0; j < 128; j++)
            x[i][j + 1] += x[i][j];
    }

    y[0] = 0;
    for (i = 0; i < 128; i++)
        y[i + 1] += y[i];

    for (G = 0; G < 128; G++) {
        if (y[G + 1] <= l)
            continue;
        if (y[G + 1] < h) {
            d = a + (b - a) * ((h - y[G + 1]) / (h - l));
            if (c <= d) {
                b = d;
                l = y[G + 1];
            } else {
                a = d + 1;
                h = y[G + 1];
            }
            while ((a ^ b) < (1 << 24)) {
                a = a << 8;
                b = b << 8 | 255;
                c = c << 8 | getc(in);
            }
        }
        if (h <= y[G + 1])
            return G;
    }
}
// End submission here.  Test code follows.
int main()
{
    FILE *moby = fopen("whale2.txt", "r");

    int E = 0;
    int c = getc(moby);
    while (c != EOF) {
        int guess = f(c);
        c = getc(moby);
        if (c != guess)
            E++;
    }

    printf("E=\t%d\n", E);

    return 0;
}

A. Rex

Posted 2018-01-09T09:25:31.847

Reputation: 1 979

Python 3, 526640

274 bytes, 526092 errors (using whale2.txt). This is definitely capable of further improvement, but it has reached the "good enough to post" stage.

from collections import*
D=defaultdict
M=[D(lambda:D(int))for i in range(10)]
X=""
def f(c):
 global X;G=D(int)
 for L in range(10):
  M[L][X[:L]][c]+=1;N=M[L][(c+X)[:L]]
  if N:g=max(N,key=lambda k:(N[k],k));G[g]+=N[g]*L**8
 X=(c+X)[:10]
 return max(G,key=lambda k:(G[k],k))

The idea is to store the frequencies of all runs of 2, 3, 4, ..., 10 characters. For each of these lengths L, we check whether the most recent L-1 characters match a stored pattern; if so, our guess g_L is the most frequent next character following that pattern. We collect up to nine guesses in this way. To decide which guess to use, we weight the frequency of each pattern by its length to the 8th power. The guess with the largest sum of weighted frequencies is chosen. If there are no patterns that match, we guess space.

(The maximum pattern length and the weighting exponent were chosen by trial-and-error to give the fewest incorrect guesses.)

Here's my ungolfed work-in-progress version:

from collections import defaultdict

PATTERN_MAX_LEN = 10
prev_chars = ""
patterns = [defaultdict(lambda:defaultdict(int))
            for i in range(PATTERN_MAX_LEN)]
# A pattern dictionary has entries like {" wh": {"i": 5, "a": 9}}

def next_char(c):
    global prev_chars
    guesses = defaultdict(int)
    for pattern_len in range(PATTERN_MAX_LEN):
        # Update patterns dictionary based on pattern and c
        pattern = prev_chars[:pattern_len]
        patterns[pattern_len][pattern][c] += 1
        # Make a guess at the next letter based on pattern (including c)
        pattern = (c + prev_chars)[:pattern_len]
        if pattern in patterns[pattern_len]:
            potential_next_chars = patterns[pattern_len][pattern]
            guess = max(potential_next_chars,
                        key=lambda k:(potential_next_chars[k], k))
            frequency = potential_next_chars[guess]
            # Exact formula TBD--long patterns need to be heavily
            # advantaged, but not too heavily
            weight = frequency * pattern_len ** 8
            guesses[guess] += weight
    # Update prev_chars with the current character
    prev_chars = (c + prev_chars)[:PATTERN_MAX_LEN]
    # Return the highest-weighted guess
    return max(guesses, key=lambda k:(guesses[k], k))

And the test harness:

from textPredictorGolfed import f as next_char
# OR:
# from textPredictor import next_char

total = 0
correct = 0
incorrect = 0

with open("whale2.txt") as file:
    character = file.read(1)
    while character != "":
        guess = next_char(character)
        character = file.read(1)
        if guess == character:
            correct += 1
        else:
            incorrect += 1
        total += 1

print("Errors:", incorrect, "({:.2f}%)".format(100 * incorrect / total))

Here's some sample output from near the beginning of the text. Already we begin to see the ability to finish common words after seeing their first letter (in, to, and, by; also, apparently, school).

 you take in hand to school others, and to teach them by what name a whale-fish
xU wshhlnrwn cindkgo dooool)tfhe -; wnd bo so rhoaoe ioy aienisotmhwnqiatl t n

Near the end, there are still plenty of mistakes, but also plenty of very good sequences (shmage seashawks, for instance).

savage sea-hawks sailed with sheathed beaks. On the second day, a sail drew near
shmage seashawks wtidod oith tua dh   tyfr.  Tn the shaond tay, wnltiloloaa niar

It's interesting to look at some of the mistakes and guess what word the algorithm "expected." For instance, after sail, the program both times predicts o--for sailor, I presume. Or again, after , a it expects n--possibly because of the common occurrence of , and.

Changelog:

274 * 2 + 526092 = 526640 Golfed the algorithm, at the cost of a few extra errors
306 * 2 + 526089 = 526701 Original version

DLosc

Posted 2018-01-09T09:25:31.847

Reputation: 21 213

Python 2, score: 2*(407+56574) + 562262 = 676224

Searches for words matching the previous characters from a list of ~~all~~ most words used in the text, sorted by the number of their occurrences.

Code:

import zlib
f=open("d","rb")
l=zlib.decompress(f.read()).split()
w=""
def f(c):
 global w
 if c.isalpha():
  w+=c
  try:n=next(x for x in l if x.startswith(w))
  except StopIteration:return" "
  if len(n)>len(w):
   return list(n)[len(w)]
  return" "
 w="";
 n=ord(c)
 if n>31:
  return list("t \n 2  sS \n  -  08........       huaoRooe oioaoheu thpih eEA \n   neo    enueee neue hteht e")[n-32]
 return"\n"

Data: https://www.dropbox.com/s/etmzi6i26lso8xj/d?dl=0

Test suite:

incorrect = 0

with open("whale2.txt") as file:
    p_ch = ch = file.read(1)
    while True:
        ch = file.read(1)
        if not ch:
            break
        f_ch = f(p_ch)
        if f_ch != ch:
            incorrect += 1
        p_ch = ch

print incorrect

Edit: Using whale2.txt gives a better score.

Steadybox

Posted 2018-01-09T09:25:31.847

Reputation: 15 798

C++ (GCC), 725 × 2 + 527076 = 528526

Yet another prefix-frequency submission. Run on whale2.txt, and get similar (slightly worse) score than others.

#import<bits/stdc++.h>
char*T="\n !\"$&'()*,-.0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz";
int I[124];std::string P(7,0);struct D{int V=0;std::array<int,81>X{{0}};};std::vector<D>L(1);D
init(){for(int i=81;i--;)I[T[i]]=i;}int
f(int c){P=P.substr(1)+(char)I[c];for(int i=7;i--;){int D=0;for(char
c:P.substr(i)){if(!L[D].X[c]){L[D].X[c]=L.size();L.push_back({});}D=L[D].X[c];}++L[D].V;}std::vector<int>C(81);for(int
i=81;i--;)C[i]=i;for(int
i=0;i<7;++i){int D=0;for(char c:P.substr(i)){D=L[D].X[c];if(!D)break;}if(!D)continue;int M=0;for(int
x:C)M=std::max(M,L[L[D].X[x]].V);C.erase(std::remove_if(C.begin(),C.end(),[&](int
x){return L[L[D].X[x]].V!=M;}),C.end());if(C.size()<2)break;}return T[C[0]];}

This one greedily find the longest string that starts with a suffix of the history, and if there are multiple candidates, tiebreak with shorter strings.

For example: If the last 7 characters are abcdefgh, and the string abcdefghi and abcdefghj appears with the largest frequency in all strings of the form abcdefgh*, the output will be either i or j, tiebreak with shorter suffixes (bcdefgh, cdefgh, ...).

For unknown reasons, anything more than 7 and my computer doesn't have enough RAM to run it. Even with 7, I need to close all web browsers to run it.

Testing code:

int main() {
    init(); 

    std::cout << "Start ---\n";
    std::time_t start = std::clock();

    std::ifstream file {"whale2.txt"};
    // std::ofstream file_guess {"whale_guess.txt"};
    std::ofstream file_diff {"whale_diff.txt"};
    if (!file.is_open()) {
        std::cout << "File doesn't exist\n";
        return 0;
    }

    char p_ch, ch;
    file >> std::noskipws >> p_ch;
    int incorrect = 0, total = 0;
    // file_diff << p_ch;

    int constexpr line_len = 80;
    std::string correct, guess_diff;
    correct += p_ch;
    guess_diff += '~';

    while (file >> ch) {
        char guess = f(p_ch);

        // file_guess << guess;
/*        if (guess != ch) {
            if (ch == '\n') {
                file_diff << "$";
            } else if (ch == ' ') {
                file_diff << '_';
            } else {
                file_diff << '~';
            }
        } else {
            file_diff << ch;
        }*/
        incorrect += (guess != ch);
        total += 1;
        p_ch = ch;

        if (guess == '\n') guess = '/';
        if (ch == '\n') ch = '/';
        correct += ch; guess_diff += (ch == guess ? ch == ' ' ? ' ' : '~' : guess);
        if (correct.length() == line_len) {
            file_diff << guess_diff << '\n' << correct << "\n\n";
            guess_diff.clear();
            correct.clear();
        }
    }

    file_diff << guess_diff << '\n' << correct << "\n\n";

    file.close();
    file_diff.close();

    std::cout << (std::clock() - start) 
    / double(CLOCKS_PER_SEC) << " seconds, "
    "score = " << incorrect << " / " << total << '\n';
}

Ungolfed:

size_t constexpr N = 7;

int constexpr NCHAR = 81;

std::array<int, NCHAR> const charset = {{
'\n', ' ', '!', '"', '$', '&', '\'', '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'
}}; // this actually contains a lot of information, may want to golf it
// (may take the idea of using AndersKaseorg's algorithm, late acceptance hill climbing)

std::array<int, 'z' + 1> const char_index = [](){
    std::array<int, 'z' + 1> char_index;
    for (size_t i = NCHAR; i --> 0;) 
        char_index[charset[i]] = i;
    return char_index;
}(); // IIFE ?

std::string past (N, 0); 
// modifying this may improve the score by a few units

struct node {
    int value = 0;
    std::array<size_t, NCHAR> child_index {{0}};
};
std::vector<node> node_pool (1); // root

int f(int c) {
    past = past.substr(1) + (char) char_index[c];

    for (size_t i = 0; i < N; ++i) {
        // add past.substr(i) to the string
        size_t node = 0;
        for (char c : past.substr(i)) {
            if (node_pool[node].child_index[c] == 0) {
                node_pool[node].child_index[c] = node_pool.size();
                node_pool.emplace_back();
            }
            node = node_pool[node].child_index[c];
        }
        assert(node != 0); // the substring is non-empty
        ++node_pool[node].value;
    }

    std::vector<size_t> candidates (NCHAR);
    std::iota(candidates.begin(), candidates.end(), 0);
    for (size_t i = 0; i < N; ++i) {
        size_t node = 0;
        for (char c : past.substr(i)) {
            node = node_pool[node].child_index[c];
            if (node == 0) break;
        }
        if (node == 0) continue;

        assert(node_pool[0].value == 0);
        int max_value = 0;
        for (size_t x : candidates)
            max_value = std::max(max_value, node_pool[node_pool[node].child_index[x]].value);

        candidates.erase(
            std::remove_if(candidates.begin(), candidates.end(), [&](size_t x){
                return node_pool[node_pool[node].child_index[x]].value != max_value;
            }), candidates.end()
        );

        if (candidates.size() == 1) 
            break;
    }

    return charset[candidates[0]];
}

Example output:

~ ~s  ta~ hard ts tt~~~~~~~ ~doam ~~ ar~ ~ i~~~ ~~~ ~he~~~~,a~ t~~~~ t~ ho~si~  
n--as his wont at intervals--stepped forth from the scuttle in which he leaned, 

~~~ thr~ ~~ t~~ crp~~~~~~~~ a~ wap~~~~~ a~eo~~ h~~ o~~ s~~~ or~~y~ ~  boog~e~~ t
and went to his pivot-hole, he suddenly thrust out his face fiercely, snuffing u

~ a~~ ~h~ ~n~ onitn~oi~~~~~~ ~~a~ ~ cewsoat~  a~ tae~~~~ ~e~~t~~ te~~ ouc~s~i~~ 
p the sea air as a sagacious ship's dog will, in drawing nigh to some barbarous 

ct as I~ iisk~~~~ ~~e~ tls~~~~ i~~~ ~~ soe~e Ae ~ ~~e~ tar~~~~~ trd~  ot ~ h~~~ 
isle. He declared that a whale must be near. Soon that peculiar odor, sometimes

This one is near the end of the text. Most long words are predicted quite accurately (intervals, pivot-hole, distance)

 au t  tf weu~i~ aor~ mre~g~~~ m~t~~ ~~~  ~"NC~X~t~ti~  ~~n~ SNsh A FNECnSERTR O
 on as it rolled five thousand years ago./////Epilogue//"AND I ONLY AM ESCAPED A

NL~~,S~ ~HR~ yO~ -/s~n "~A~~ laeu~ta Vew~, S~e s~~  s~ ~ ain~ t~d ~t~ oirept~~ ~
LONE TO TELL THEE" Job.//The drama's done. Why then here does any one step forth

Upper-case doesn't seem good.

user202729

Posted 2018-01-09T09:25:31.847

Reputation: 14 620

Trie seems to be more memory-consuming than I expected... – user202729 – 2018-01-14T15:52:47.197

... and also harder to implement. – user202729 – 2018-01-14T15:54:05.297

Python 2, 756837

Uses something that might be Markov chains?

import zlib
a=eval(zlib.decompress('x\x9cM\x9cis\xda\xcc\xd2\x86\xff\x8a2\xf5\xd4\x81\xb8,\x977l\'\xf9\x90\x12 \x02f\x11G\x02c||*%@,a\x11a1\xe0S\xef\x7f\x7fC\x13\xf75\xdf\xda\xaaa4\xd3\xcb\xddw\xf7\x8c\xfc\xbf\xcc\x8f\xd7E\xe6\xab\x93if\xce\x9d\xcc\x8f\xefG\xd1\x11\xf1\x1b\xa2At\x8e\xa2\'\xe2\xc5Q\xfc,\xa2{\x14+"\x9e3\xf63b\x87\x9f\xb5\x8fb$b\xeb(\x96E\x8c\x18\x1b2\xb6{\x14/D\xfcq\x14\x03\x11}\xc6zG\xb1.b\xc0\xd3\x06\xcb\xa9\xf1\xb3\xcaQl\x88X>\x8a-\x11\xb7G1\x11q\x85\x98\x1c\xc5\x95\x88\xf1Q\xec\x89\x98\x1e\xc5\x81\x88\xa2\xb3X\xc4\x19\xe2\xe4(\xbe\x898\xd6\xc9F\xa8\xe4E\x16\x19\x8a\xc8r^|U\xc9\x8b\xc7\xd8\xfcQ\xf4\x8f\xe2\xbf\x1c\x06\xbc\xa8v6\xef\xba\xb2\x17V\xf6\x92\xe8r6\x07\x9d\xcc\x95EN\xe4\xe9FW\xb6\xd9\xea6M\xa2K\xdf\xact\x86\xf9\xc976Gy\xf2\xce\xef\x96G1\x15q\xf1\xf1\xd4\xcc3\xe6\x8f\xb8\x96\xdf}\xd27\xcf\x1d\x9da\x8e\x1f\xcd\xc5c\\\x11Q\xcf\xfc\x02Q\x9c\xe7\\\xd6\xbe;\x8acY\xe5\x8c\x17\xcfu9F\xc4\x83\xfc\x0c\x076\x0b\x1d;\xc7\x97\xe7_U\x9c\xacT\xfc\xc2\x1a\xbe\xb0\x06\x83\r7b\xd9\x85<\x9d\xe8\x86\xbe|Q\xff\xfc\xf2\xa0\xe2d\xa7?\xfbr\xc5\xbc\x97\x8c\xbd\xd1\xbd}\xb9f@\x8e\x01\xb7\x88\xf7\x88w*\xce\x13v1\xc1ZCv\x1c\xebz\xe7=]\xce\x1c\x9d\xcdg\xe8,U/\x98/\x18`\xed\xf8\x8d\xa7\xe21\'\x1bo\xd4,sk\x80\xb8\xc6L\xc45Oq\xa9M\xac\x9e8\xc7?k\xb8\x9fY\xe9\x80\x9a\x8c\x9d\x8a\x98\xea\xde\x8c\xcc\xbb\x94\xa7\x13\x06\xc8\xca\xfa"\x1e\x98\xa1\xa4\xe1R\xfb\xa1\xb1W+\xf2b\xc0\xa4\x96W\xac\xa8\x15\x10=\x8d\xd3ZC#\xb2F \xd7j\xccP\xd78\xadU\x8fbWD"\xbd\xd6Q\xb7\xaf\xb5\x98\x0cH\xac\x85\xfc\x0cH\xac5\x15(k\xdd\x8f\xa7\xa6&\xf1v\xfa\x19\x00Q\xc3\x7fkxuM\xe2\xad(\xa2D\xd6\xabX\xb6&\xfeyy\x14\x1d\xdc\xa4v\x8azY\xdbU\xa4P\xf9\xc4\xcc?\x0fj\x8d\x9f\x135\xf8O\xde\xf7\xd3Q?Ym\xf4\xe9\n\xefY\xe12\xab\x9d:\xc7\n`Y\xfd>\x8a[\x11\xf1\x88\xd5\x9a\xc9\xf6\xcc\x80#\xad\xde\xd5+W\x03\x9e\x12/\xab!\xf3\x8e\x98\x81xY\xf5\x18\xd0g2\xe2e5g\xb2\x05+\x13\x07\x9d\x8b8fCD\xd1j\xca\xcf,X]\x81X+\xb0i\xa5\x88\xf5\'\x1c\x14VW`\xe9\n\x84]\x19u\xaa\x15\x16X\x81\xb0+\x0c\xb7"\'\xbf.N\xab0\xa7?n\xd5\x13^\x179\xb5\xf9\xebB<\xe4\xe1$_[c\x04\xc3\x06\'\x99W\xbd.\xb2\x1ap\xaf\x8b\xb3\x8fy\xcc\x9fW\x19\xe6t\xacE\x18\x1d\xffoR\xf1\xeb\xa2k\xc9/\x96\xfc\x1fk\xfa\x96Z\xe7u\xd1VLx]<\xa9Q^\x17\x1dkL\xd3\x9a\xe7\xdfj\xe4\xd7Eh\x8d\x8fT\xc3\xaf\x8b\x9a5\xben\xc9\ru\xd2\xd7E\xa0\xf6}]\x94\xad1\x15k\x8b\x8f\xd6\xf8\xaa\xf5\xae\xa25\xde\xb7\xe6)Y\xe3\x7fX\xb2g\x8d\xc9[\xeb/(:\xfc[\xd4P9=>X?}\xb7\xe4\x8d\xa5\x92\xad5\xe5\x9b\xb5\x9c\x9d5Fbru\x92\x7f[\xaf]Y\xe3\xd7\x96\xdaf\xd6\x16\xe7\x1a\t\xaf\x8b\x85\xb5\x06\t\x96\xe1I\x1e[\xf3L\xac\xf5\xfc\xb2~;\xb5\x9e\x0f\xac\xf1\x12\xd7\xfb\x93<\xb4\xe6\x1fYk\x8e\xad\xdf\xf6\xac\xdf\xf6u\xfc\x80\x00\x19\x10A\x03\xdcz\xa0ac\x06\x84\xe3\x00>3 2\x07D\xe6\x80\xd8\x1e\x10\xdb\x03\xd8\xc8\xc0\x02\x82\x01\xb9w \xea\xd9\x89\x08\xee\x0c\xe6\xaa\xd8\x01\xba\x19L\xf9\x19\x9a\x1c\xa0\xc8\x01\x807\x00\xf0\x06hq\x00\xd9\x1d\xf4\xd0\x89\xa5\x9e\x985\x80\xb4\x837\xd6\x00\x82\x0f\xf0\xae\x01\x19y\x80\xaf\x0c@\xf0\xc1\xf2cCf\x87Vw\xe8o\x87Vw\x98h\x87]vXk\x07a\xdc\xa1\xf6\x1d\xba\xdea\x81K\x012aR\x977\x88\x97\no\x97W<\x85u]\n\x17;e\xceK(\xda%\xc4\xed\x12\x16x\t7\xdcYV\xbe\x94-I\xba\xbcd\xa3\x97\xec\xee\xf2\\W\xb1\xc3r;l\xb4\xc3r\xbb\xbe\xea}\xd7C\x14s\x9dt\t\xb5\xdb-\xd0\x04>\xb5#)\xed\xe0\xb5;\x12\xd8\x0e\x84\xd8Q8\xec0\xe2\x8e\xe4\xbc[2\x00?\xb9\xc4#\nl\xb3\x80\xe5\n\xa2\x12![\x05\x81G!\x1e\x05AP)\xed\n\x02\xac\x02\xfa\x85\x80\xa75\xc5\xba\x02t\xad  )\xc5l\x01jW\xe8"\x86\xbcB\xd0RrR\xa1\xc5+\x08\x9d\xc2X\xd5W \xbd\x17f\xba\xcd\x82\xa8Z\xd2N!Q\xf5\x15\xdeU}\x85\x83\xc6@a\xa5\x01U\x10\xa5\x9e\xd8\xee@\x9fN 4\x06,3#\xd5\xaf\x01\xc9\x0c$\xc5\x10\xa8\x13\xe0y\xb2\xd4\x1dO0\x96I\xd5\x16\x93\xadnh\x82\x85\xcc/f \x1f\x18\x06L\xc6\xba\x9c\t\xc8c\xc8\x17\x13j\x8c\xc9L}}\x92\xea\xd2\'\xe2\x88#\x11\xd9\xd0\x04\xaa5\xe9\xf1\xb3D]\xd9\x90\xce&#\xc6\x0e\xd9[\x11\x9d\xf9\xe8\x97dj\xc8\xa5\xc6\xd3\x080dRSP\xbb\x99\x1ac\xeb<%\xf3\x9b\x00\x9d\x91\xf7\ri\xdf<2/I\xdf\xc0Y\x0c\x94\xc5<1\x03\x84\xc5\xc0W\x0ct\xc5\x84,\x07\xb2b\xe0KO\xb2\xb7\x9ah\x07\xf43\xaf\x19uv\x039\x7f\x12MI\x1d\xf3$k/\xc8\x80\x0b\xc5.s\x06\xe6=\xc9\x9e\xa58\x99\xb8\xea\xd7\x13"yr\x81\xed\x01\xb7\x89\xbcN\xb2\xd9\xc4\xe8l\x7f\xcah\x85|\xc3:\x9fp\x89\'0\xefi\xa2\xa29\x81\xe9\xdf\x15\xa5j\xc7\xc9\xe9\xb9\xbc&Gc)\x87\xeb\xe6@\xe4\x1c8\x9d\xcb)\xde\xe6\xc0\xf4\x1cew\x8e\x04\x90#-\xe4.u\xc99RHN\x12\x8b$\xa1\x1cj\xc9\x01{9\xf8w\x19L*\xd3\xf2*S\xf5\x95\x9fxJ\xff\xac\xdcb\x00uc\xb9\x82\xd8`\x00Uj\xb9\xce\x0c@d\x19\x88,\x1f\xd4ve\xca\xb4\xf2\x04\x11RR\x8e\xd5\x1ce*\xab\xb2m\x992&-\x7fV\xfd\x94/\xac\x11(\xa8\xec\xaac\x95\xb5\x92\xfd\x13VZ\xdf\xfeG\xb4\xd2\x16Q;d&\xf3\xcd\xe8l\xaf\x19\xcb\xb52\xce\x87k\x99\x8c{\x14]\x11\xcf\xcd\xc7\x0b\x17$8\x8br.\x00\xbf\x05yqA\xb6\xb4\xe8\xec\x02\xb6v"\xb3\x12\x86\'\xaey\x12\xa1R\'\xa6y\x1aKM\xba@s\'\xea*\x00qb\xae\xa7\xa7{\x9e\x92N\x17$\x97/\x04\x96E\xd2-\x8enQ\xf4\x05I`AA\xbe \tX\xf4\x7f\xa1t\xcedv\xe6o\xf8\x98\xcc\x9b\xf9;\xc0d\xb6\xe6\xef6Mf\xf3\xa1T\x93Y#\xae\x18\xfb\xdb\xfc]\x8e\xc9,\x8d\xce{`\xc0\x88\xa7C\xf3Wg&\x93\x98\xbf+3\x7fx\xb6\xce\xdb?\x8a3\x11{\xcc\x1b36\xe5\xe9\xe2\x8fh2\xe6(\xce\x99a\xc6\x0c\x13\xf3\xd7\xf2&3f9\x1dv\xfc\xc4\xd3\x16O#\xdc\x08&\xba\xb8\xc0-\x9bFm\x01\x81]\x00\x88\x0b\xc3\xd8\xae\xbe\xe2T!\x9f\x94\xea\x1f\xc5\xbd\x88E\xb4S@\xcc\xb3M\xcf\xa8{~g\xde\x80\xf56\xf8Y\xfdc\xac\xc9\xd4\xcc_\xe72\x99\n\xda)\x7f\x8c\xcd|eo_\x1du\xb9\xaf\xf4\x1a\xbeZ\xe1\xfe\'Gj\xac\xd6\x8f\x1b\x15\xbdg\xea\x8e\xe6\x9c:\xd3\xd5\t\xfc:\xc8X\x07%\xea\xf0\xf7\xfa\xe9%\x1d\x91\xe9l\xd7\xc9\x12u\x89>\xe9\x82\xd7\x01\xab:\xb5G}\xc3\xc4+D"\xaa\x0e\x08\xd6i\xf6\xd5\x0b\x9a\x0e\xeb4\x06\xeb\x02\xa3\xc2\x1e\xeb5\x05\xad:8[o(\xce\xd6+\xec\xbe\xcd\xcf\x9a\ne\xf5\x88\xe5\x90\x0c\xce_9[X[\x95\xc3\x1aD]S\xca\xac\xd1\xd59f:G\xdb\xe7g\x0c \xf9\x9c\xd3\xeeYgu\x99k\xcc\xb1f\x865\xf6ZS\xf1\xae\xf1\xe7\xb5z\xb9Yg48\xce\x1f\xf4\x15\xdfu2\xf3\x9d\x01\xdfA\xec\xccwG\xcd\xbc\xc62k@kM\x07y\r\xc0\xad\xa98\xd6t\xdd\xd7\x18\x7f\r\xd6\xad\xa1\xab\xeb_\x8a\xcdk\xe0\x7f\r\xb5]\xc3\xf6\xd7\x00\xfd\x1a\xf8_\x93\x14\xd6}\x85\xdeu\x8f\xa7\xb4\xb9\xd7#\xd6\x0b\xd0\xaf\x81\xff55@H\xb9\x15&\xba\x86P&\x93f[\xc8\xca\xc2\xb1\xbe-\x94]\x08\xa7\x0e\xe1\x07!\xdd\xa0\xf0\tQ\xb8\x84\x90\xa3\xb0\xa9\x8e\x1dBAB(H\x88[\x86\xf4\xccC\x02&\xfc\xa1\x8e\x1dz\x1a0a^}<\xa49\x15R\xb0\x85\xb0\x91P\x02F\x90#\xa4\xb8\x0b\xe9\x99\x87\xd4\x84!\xce\x1e\x12\x02!\xbd\xd2\x10\x18\n\xc5\xa3\xaeD\xc4\x81C\xf1\xc4\xbc\x888{\x08\xf6\x84\xa7\x88\x93pH(e\x12J\x99$Us&\xd4\xd4\t\x0c5\xa1\r\x93L\x15\x91\x12|.I\xd4\xc8\t| !\xf3\'\x94\x7f\tT+\xe9+\x16$\x90\x8b\x84pI\xf6\x0c\xe0\xb0.\x81\xcd%DC\xb2C$\xf3\'\x84VB\x01\x99\x10\x86\tgf\xc9\xcf\xa3(\\7\x01,\x12t\x9d\xa0\xe0\x84\xfeY\x02\xedO\x80\x90\x84\x92$!\xc5$\xd8;\x01\xfd\x12L\x7fA\xa1\x92\x9c\x0c\'S\xec\xa1w\xfb\x89jjO3dO\t\xbf\'\xa8\xf7\xf0\xb4}\xac\x10\xb2O4\xf8\xf6\xa2\xebO"\x82<{\x94\xb6\xa7E\xb2\xdf\xaa\xc7\\\xd1\x1d\xdd\xa3\x93=\x9a\xda\x8b\xfe$\x87\xedE\x11R\xaf\xecU=f\x8f\xd2\xf6\xec~om\xf9\xeaR\xadqE=rE\xa3\xeb\x8a:\xe7\x8a:\xe7J\xea\x9c{\x11\xa9s\xae\xa8\x94\xae\x04\xc5\xafE$\xbf\\\xd1l\xbb\xa2_u\xc5\xe6\x8a\x12\xca\x82\xe7\xc5\x9a\xc6z\xb1\xae\xb8P$\xc0\x8b`H\xb1\xa8\x10Q\xf4\x15N\x8ad\xe5"\x80T\xa4<*\xb6\x15\xc7\x8a\x1c\xa0\x15#\x85\x93"\xed\x87\xe2D-[\x84P\x14c\x05\xd0"\xa7\x87\xc5\xad\x1a\xaeH\xfe)\x9e\xd4.(S\xb4\xb6\xac\xf64\xc5\x8cr\xb2"\x14\xa8\x88\xbb\x17\xf1\xe6\x8e\xaf\x88\xd4\xa1r\xefp\x9b\xa1C=\xd7\x81rt\xd0_\x87\xf6X\x87\xc2\xb7#\xbb\xff&"-\xafN\x131Q\x07\xed\xd01\xec\x80n\x1d\x1a\x82\x1d\x02\xaa\xa3\x8a0\x1d\xd0\xb6\xe3\xb02\xee\x85t\xb8\x17\xd2\xb1N\x1d;\xec~\xcb\x81\xdf/p\xeaZ\xbc2\'O\'\x1a\x1a\xbf\x12\xb5\xdc/Y\xb0T>\xbfR5\xd7\x1d\xfc\xe6\x8e\xe0\xba\xc3Dw\x04\xc9\x1d\xa5\xfc\x1dArG\xe8\xdc\x11$w9\x8d\x81;\t\x129\x0e\xbb\x93EJ\x82\xb9\xa3\x9dp\xf7E\xc3\xa1\xc5\xed\x8a;\xab\x81F\xeb\xbeb\xc5o\x05\x9dT@\xbd\n\xc0ZaG\x15vT\xc1\xa7*\n\xa1\xa6\x92\xf9(r2\x95g\xf4^\xe1\xeeH\xa5\xc9\xefH\xf7\x95\x10\xb1\xad\xc1S\xc1\xa9*O\xea>\x95\x8a\xee\xb9R\xd7\xf0\xabp\xdf\xa6\x12\xa8\x87V\xc4\x85\x7f\x88\xc8\x8d\x9dJ\x81\xc9\xf2\xea(\x15\xc8E\xa5\xc8\x80\x1f\xac\xa1\xc4S*\xe4\n9\xaaB\xa3\xb5B\xc2\xab\x08\xceK\xbb\xadB2\xaf\x88\xf7\x08\xa2WH\xe6\x15\x12Ae\xa4\xc8Q\xa1\xd7\x98\xa5\xb0\xce\xaeu\rY\x8a\xf0,\r\xd1,\xb6\xf7\xb0a\x16\x92\x90\x85\x82f9O\xce\x92\xad\xb2\x9c\xa8e\xa1$Y\xc8f\x96s\x80,\xa1\x9c\x85E\\\x8b\x01\xe4\xf8?\x0b\xad\xcc\x82\x0b\xd9H\x8d\x95m\xf26i;\n^g\xe9@e\xf1\x87lU\xed\x96-3\x96.h\x96r(+\xfe \x80\x9e\xad\xf1b\n\xaa,\x9d\xd8l\x81\x9fy\n\xb6\xd9\x92:W\x96\xcb\x1c\xd9"/\xf6\xd9\x85\xc4\xf71\xb1\x99\xe3!\xb3\xc6@jUT\x0b\xfbv\x13\xa7*\x9eL\xf8$\xa3\x89\xb4\x94PL1c\n\xb1I\xc9\xd1)Q\x99\xd2\x01H\x89\xeb\x94hO\xc9\xe7\xdf\xa8\xae\xbei\xae5\xdf\xa8\x98\xbeQ\xcb}\xb3\x96#\x9e"\x97`R|8\xc5SR\xf1\x1fa0)EP\xfa\x0b\x11\x0fL\xc7\x1a\x10)\xa7\x85)\xae\x9f\xd2\x92O!\xafi\x9f5\xd0\xbeOi\x87y\xa1z`\n7M\x0f\xea\xb8\xe9\x9e\xc9\xe0\xa6\xdf\xacb8%\x1b\xa7\xc4u\xca-\xa3\x14r\x9a\xc2\xc9R\x98Z\x83}6\xe8f6h&4\x92\x8f\xa7\xa6Erk\xf0\xe2\x06i\xb7\x81\xef7\xa08\r*\x9b\x06\xd7\x85\x1a\xa4\xf3\x06d\xa6Am\xd4\xa0\xbaj\xf8\xfc\xec\x07O\x9f\x11\xe1@\r\x9a\t\r\x88O\x03Do\xb4\x18@\x0f\xa2\x01\x8c7:\xec\xc2J\xd1\r\\\xbcA\xc9\xd4\xb0\xda\xb7\x0b\x92m\x03\x8e\xd3\x80\xb36,\x05\xe2\xee\x0bk\xe2\x93me\xff16\x88\x01\xdf\x18W\x8aa+1n\x17\xe3\xa2\xf1P\x8d\x14c\xe6x\xccX\\?\xc6\xf5c\xc2$&-\xc4\x80o\xbc\xd0\xe0\x89q\xaax\xc9\xdb\xc8<\xf1\x8a\xb1\xb0\x99\x18g\x8d9(\x8f\xa9\xbabJ\xb8\x983\xc0\x980\xb9\x82\xac,\x80\x8b\x05Zm\x9dTy#\xbf\x03|b(A\x0c:\xc5\x90\xf7\x98c\x9c\x18\xc3\xc4\xa0^\xcc;b\xe0+\xb6\x88\x8b\xebk`\xbb\x9c\xc0\xb9\x9c\xb5\xb9\x82\xda\x92O\\\xf1}I\x85.G\xb6n\x9e\xb1u\xc4\x1a?\xe3\xac\xcd%\xa6\\\xb2\x8c[\xe6gD\xa5\xfb\xc8+\xda\xea\x11.\'p.gm.w\x86\\\xce\xda\xdc&\xf3r\xd6\xe6\x86\xfa\xd4!\xc5\xba\x9c\xc09\xdc>q)\xf5]2\x8ck\r\xa0#\xe4\x12\x03.g\xba.\xa5\xbeK\xa9\xba\xd9\xf1\x94\xbb4.Wl\\b`\x83\x83\xba\xdc\xa3q9\xecp\xc5W\x85\x1a\xb9\x90\x95\r5\xb2\x8b\xaf\xba\xc4\x80\x0bww\xd7h\x12\xf6\xb5\xe1\xfe\xc2\x86\x1do\xe8vm8\xe1s9~\xdap\x14\xecr\xd8\xe1\xda\xa7K\x1b+s;\xd6\xd5f\x1a\xe0\xaev\xd33\x1bBf\x83;\xbbV\xf7\xd1u1.a\xe0f\x99\x98\x88\xd80`\xe3\xa2,x\xc0\x86H\xdb\x90\xd07\xf0\x80\r\x01\xea\xa0\xee\x11\x17\\G4\x17#\x16\x1c\xb1\x8d\x88P\x8ch]E\x16:G\xb24\xc92\x11\x0b\x8e\xe4\xcdB\x1a"\xbd\xc8o"\x80::\xe9\xb5$\xf2A\x8d\x13a\xf4\x88l\x1a\x01f\x11\x1d\xd7h\xc3\xd8\xa9*0\xa2=\x16QKF)K#\xcfG@r\x84\x0fF\x84D$\x81"\x146J\x18\x10)4DT\xb9Q\x07Q@@\xca\xeb\x88\xcb\xb7\x11\x17u#\x92{TV\x18\x89\xe8JF\xa0OTg\x00\xd9?\x82\xb7Fy\xe6\xf5\x18Ku3\xc4\x9eC\xac<\x14\xd3\xca\x9d\xcc!.3\xc4e\x86\xda\x1e3C<mH6\x1eb\xef!$q\x88\x07\x8f\xf0\x9e\xa1\x15GC\x02w\x08b\x0c\xe9h\r\xe9h\ri\xb6\x0fi\x97\x0ci\x9a\r\xb1\xcb\x10\xee8\x04\x94\x86\xdc\xe4\x1f\x02kC\xcd\xbbf\xc4\xe6\x1c\xa9\xb4\xa5\xfe>\xb0\xcf\x03\x9b;\xb0\xe5\x03\xfb<\xa0\xb4\x03\xaa<\xa0\xbf\x03\xaf8`\x81\x03v9\xa0\xa9\x11o\xbb\xa63p\xcd\xd5\xafk\xdag\x07K\xab\xd7\\\xfb\xbf&\x8b_\xd3r\xb8\xa6\xe5pM\x1b\xe1\x9a\x0e\xdc\xb5\xac]: \xd7\xec\xf3\xda\xda\'Z=PU\x1e\xe6\xfa\xb3\x03\x08y\xa0\xbds\xe0`\xe3@\xf7\xeb\x00\xf8\x1e\xc8<\x07\x0e+\x0e\xc0\xf7\x81\xabI\x07\xa0\xfe\xb0d\x06\xfc\xe8@\xff\xec\x00\xe8\x1d(\x93}\x0bz|\xd0\xcbg\xcb\xbe\x85o\xbe\xc2\x9e\xf1\x81/\x1f\x8b\xfb\xdc\x88\xf7Aa\x1f\x83\xfaX\xdc\xa7\x7f\xe1\x13\xcb~\xa0p\xe1K\xdcK\xe9\xea\x83\x11~Y\xd1\xc0\x87u\xf8\x12\xe1/"B\xea}>_\xf2\xa9b}j\x01\xbf\xc0\x0cy\x96\x0e\xd5\xf7\xa5\x00\x10\x92\xed\xbf\xf0bN{\xfc\x0e?\x83\xdf\xfb\x94\xf0>=\x1f\x9f\n\xc1\xa7\xe7\xe3\xd3"\xf1q\x19\x9f\xfbZ>\xc7L>W\xe3|\xf1\x08a\xbd\xbex\x84d.\x9fF\x84Oq\xe8\xe3S\xfe\x9e\xb7Au}\x9af>\xd0\xe3C@|r\x91\xbfd\x91\xe2i\xbfE\xa47\xf3|\xf2)1\xe73\x01\xf3\x8co<\x8b9\x9fE\xa4_\xf5La\xf6\x0c\xbd}~V\x13\xfd#\x88$\x14\xfa\x1f.\xc5?\x8b1\xa4)\xf1\x0c\xb3\x99Zh0\xe5lc\x8a\xafN9?\x9d\x02ISh\xfa\x94\xb5O\xc1\xa1)\xa11\xc5\x99\xa7\xc0\xd7\x14o\xbfg\x86{\x1a\xf6\xf7\xf4Y\xef\xef\xf4m\xf79]\xef=Pw\x0fN\xdd\x83^\xf7|\xe0t\x0f\xd2\xdd\x0bzIk\xf4\x1eL\x9bb\xfb)\x1f\xd5Ma\x86\xd3\xa1b\xc4\x14\xc0\x99\x02oS\xe0mJG\x7f\n\xeb\x9d\x92J\xa6P\x87)04\xe5\xb6\xea\x14\xef\x99\xc2d\xa6$\xb9)e\xd9c\xa0\x0e\xf1\xe8+L=J\xf8J[\xf3\x99\xf3\xd5GV\xf6(K\x17\xa2\xf2\x88C<ri\xf4\x11k>b\xa1,*1\x0c\xf8\xafM\x80?c\xf0\xcf\x18\xfc3\xa3?\xe3\x1c\x9f/x\xca\x8d\xa1\xcf\xa0\xe2\x92\x88Y\xa2\xaa%Lo\x89~\x96\x1bDBu\x89\xaa\x96\\D^\xd2\x96\xfcl/~I\xd5\xb4D-K\xd8\xe2\x12;/\xb1\xfe\x92\x84\xb5D\xc7K>\xbf\\b\xfd\x1b\xf2\xe7\xd2\x8a\xbf%j[\x12\x1cK\xd8\xc1\x92\xfe\xc5\x92P\\\xc2:\x96\x98i\x89\x8a\x97(\xfe\x86\xa7\x01c\x03W!\'\xb0\x06h\x88\x9b\x80,\x16\x80\x0c\x01\x9d\x95\xe0\xb4\r\xf1\xb6\x806_@\x9a\x0fh\xf3\x05c\x8d\xe6\x00\xfa\x15\xd0Y\t\xf8\x10"\xe0\x849\x80\xd6\x05 n@\xfb+ u\x07DR@\xc6\x0f$P\xaa"rn\x15\xd4\x11\xb9\x04\x10Ty\xca\xf5\xc5\xa0\xac0\x1cH\xd2\x14\n\x1d\x94\x18\xcb\xd7\xb2\x01\x07\x04A\x01M\xf1\xe1l\xe0\xf1TR\xa9\xa4\x82\xa0\xc3+\xc8\x94\x01\xb7\xc1\x03:\xdc\x01UE\x10\xaaO\x05Z`\x98\x1en\xd2\xe3\x10\xbb\x87\r{\xd8\xbb\x87\x9b\xf4\xf0\x8d\x1e\xde\xd5\x83\xfd\xf7\xbe2\x16\xaf\xed\xbd\x02v\xbd\x81Z\xa0\x07\\\xf6F\x0c\x80\x8f\xf7z\x0c\x00\x18{TZ=\x82\xab\x97j\x18\xf5\xc6LF \xf6h\x9f\xf56\n\x97=\xdc\xa4\xf7\xc6\xcap\xa9\x1e\x05F\x8f\xa6m\x0f\xe8\xb8\xb0Ab{\xfaC\xc0\xd3\xa13ra5)\xb7\x84\xf0\x05J\xbe@\xc9[\x14wA$]X7E/2\x1c\rl\xad\x1f2\xdd\x96\x8b}[\x8e\xd5\xb6\xd8w\x0b\xa6n\x7f\xf2\xbe\xba:\xcbE\x11\xd1G,!\xfe\x97=]p\'\xec\xa2\xa3\xe2\x16%m\x856\t\xff\xd9\nmz\x17\x91\x8b\x9c[\xda\x8d[\x94\xbf\xc5$\x17\t\xf3\x02\xf7[\x92\xc0\x16\x1e\xb8\x05S\xb6|c\xbe\xa5\'\xba\xe5\x90xK\x83uK\xf9\xb7\xa5\xed\xb5\xe5\xde\xfeVPI\x9aV\xdbX]hK\xf1\xb1\xed)\xae\xb5\x0e\xba\x9c\x16m/\xcf\xeaA\xb6V\xaa\x93{\x0b\xed[\xb4\x17Zd\x94\x16I\xb9ES\xb9\x05]\xf5\x08\xe3\x960\xedc\xef\xdbx\x1c\xc3\xb4\xba\x8a\t-\xb1\x91\x90\xf9\x96\x80\x86\xd4\x0b-\x81\x12\xa9\x17<q*\xb9l\xdd\x82t{\xe2T\xc2*[\xfc\xb3\x82\x16\xa7\x04-N\xc8Z\x94\x19\xad\no\xa3\xa0hq\x87\xbf\x05qm\t\xf4\xc9)\x96WPP\xf6\xf2\xac\xc1\xfa\x19q\xe2q\x19\xc3\x13\x0f\x15\xa6\xe3Uto\x1e\xb7\r<\xaa\x1e\x0f\x84\xf7X\xba\xc7\xb1c\xcb*\xde\xbc\xa6\xc6\xa2\x17\xb1`\xce\x19<\xa0\xd8\xa3\xc0\xf1:<}\xd2\xdd{\x94H\xde3O_P\x8f\xa3\x9e\xdf"j\xbd\xbeb\xa3\x07/\xf5\x06\n}\xde\x08\x91\xa3\x05\x0f\x14\xf4\xe8cyP\x97\x16\xf7\xe8<\xd0\xd5\xe3h\xc1#v<J\x19\x8f\xa3c\x8f\x98\xf4V,\x92\xf3\x04\x8f\x00\xf7 f\x1e\x9f\xe3y\xf4R=>\xfc\x1c1\xd6\xa1\x976\x82\xef\x8e\xacf$k\x18\x81\x0b\x0e\xa1\xec\xf0\xbd\xbeC#\xd9\xa1\xbd\xecp\x99\xd2Ag\x0e\xd9\xcb\xa1m=\x02\xdd\x1c(\xdc\x88\xb3\x9d\xd1P\xb53"\xd3\x8d\xe8D8\xb0\x15\x87\x96\xc2\x88;\x98\x0e-n\xc7R\t\xc7\xed#\x8c\xe5\xf0\xa5\xd1\x88\xa5\x8f\xc6\xea\x04\x0e\x07\xd5\x0e\x9f\x0c9\x1cn8|t\xe4p\x10\xe2p<\xe2\xf0\xb9\xaf\xc3\xd7\xc1\x0e\xdf\t9|S\xe4p\xce\xe1\xf0\xfd\x91\xc3\x99\x88\xc3\xb7J\x0e\xe7\'\x0e\xdf\t9\x9c]8|S\xe4p\xce\xe1p\xfa\xe1p&\xe2pR\xe2\xf0\xad\x92\xf3\xc2+\x9e\x99\x8c\xd3\x8f\x11\xe1\xe4H>\x94v\x80c\x14+\x1c>\xffv\xfe\xf5!\x1a\'ct\xb2\x7f\x8eO\xa5\xdf\xe7\xc8\x89\xb7\x90=\'\x8b\xc8\xb5\xbf\x11\xd5\x8fC\xfev\xa4B\x95km\x0eu\xab\xc3\xb7\xec\x8e\x94\xbbR\x04\x8f(\x84\x1c)w\x856;R\x04Ki<\x82\xaa9R\xcd~\x11\x91\nc\x04\x81\x1bY\xe9\xe7\x1d\xa2\xf5N\xbd\xf2N&z\xc7\xbb\xde\xb9d\xf8\x0e\x1f\x7f\x87\xa5\xbf\x13#\xef\xef\x1a\xb2\xef\x94`74\x9b\x1cB\xf6f\xa0;z\x87\xd3\xbc\xbb\xbc\xcd\xda\xdcZ\r\xf7\x0ef\xbe\x83\x99m\x0e|\x1c\xf0\xea\x86\n\xff\x06]\xdf\xd0#\xb8\xa1\xefyC\x8f\xe0\x86/\xacnh\x9d\xde\xd0P\xbd\xa1\xf7pC+\xe4\x86\xf5>nu\x17\x0eHZ\x12\xbf\x17\xe4/\xd1\xe5/\xd1\xfb/q\x03\xa9D7\xbeTR\xff,q\xd7\xa8D]R\xa23X\xe2\xba\x7f\tU\x97\xb0E\x89{\x0f%\x0c[\xe2\xf3\x84\x12Ek\x89\xa3\xe6\x92u ^\x82\xaf\x96\xc4\x02R\x14\x948\xed)\xb9\xcc\xc6\x8d\xbb.\xed\xc9.]\xcd\xae,X\x9a\x80]z\x16]v\xdf\xa5\x90\xea\xc2R\xba\xa2\xbfS\xce\xee\xd28\xee\xe2\xa0].\x83t\xed\xcfA\xce!K)\xd0|N\xa4u\t\x99\xae\xab\xf6\xe8\xe2\xa2]\x8b/t\xf5\x03a\xd3\xa5L\xeeBZ\xba\x14\x02c\x9e\xce\xa8|g\xe4\x92\x19\xb7\x07f\xe4\x92\x19]\x8bY_w:\xa3\xee\x98Q\x1f\xcd\xb8:2\x9b1\xc3\\\x83c\xcd\xe6f\x84\xf8\x0cE\xccH\xc53\x92\xf9\x0c\x7f\x9e\xe1V3R\xf1\x8c+\xd93:\xa63\x90\xe1\x9c/\xd8g\x00\x91\x99Q\xa2\xce0\xc1\x8c\xae\xc7\x8c\x18\x9f\x11_3\xac1\x03Zg\xd6\xe6P\xfb\x0c\x18\x9ea\x81\x07&{`\xb2\x07y\xb1$\x93\x87\x07\x9erq\xf2\xe1Zq\xfa\xe1F\x01\xf7\x81\xcd=\\\xf1\x14\xecx\x00Q\x1e\x04;$\x83<\x08\xa2H/\xb2\xea|\xc4\xb8\xa9\xe2GUb\xaaj9]\x95\x05W\xd9Q\xf5\xa4V\x89\xaaj\xacJ\xa9R\xefT\xb1x\x15\x86X%\xca\xab\x90\x8e*uK\xd5\xd7x\xaf\x12\xc3\xd5\x9a\x06n\x95\xb8\xac\x86\x8aUU\xae\xe5U\xb9\xb1Y\x85\x13\x9f\x91\xc4\xcf:\xfa\xe2\xb3\xa6\xae\xec\x0c\x1ap\x161\x00\xd2q\xc6\xbf$;\xcb\xeb\x80\xefv\xad~\x86{\x9cQ\r\x9f\xd9C.\xf1\x95\xdfh\xb6\x85\xf8\x9b\xff\xfe\xd2\xa4Q\xd0\xdc \xc2T\x9b\x07u\xdd&`\xd4\x14#\xc8\x19@\x13\xf6\xd9\x9c\xa8\xb75Sf\x00\x80\x9b\xdc\x82lF\xaa\xcd\xa6hH0\xbe\xd9A$\xa34\xf9\xf8\xb6\xd9U\xfcmr\xa2\xd3\xa4\xbejr7\xb2)\x8a\x95z\xb0I\x1ai\xd2\x15kr\x81\xac\xe9\xf06"\xa9\x89\xce\x9a\x94LM\xeb\xf8\xac\xcf\xc7\xab\xfd\x89j\xb5\xcfU\xa8>t\xa4\x0fI\xe9S\x15\xf4\xa9\xc9\xfb\x16HR\xe6\xf4\xb9\x98\xd1\x07\x7f\xfa`U\x1f\x04\xeb\x93\x9c\xfb\xd8\xb0\xbfa26\xd7\'\xab\xf5\xd9g\x1f|\xeaS\x9c\xf7\t\xcb>\xf0\xd3\xc7\xd1\xfaV\x8b\xe0\x8d\x1d\xbd\xd1s~#X\xdf\xf8\x94\xfc\x8d\xb5\xbf\xb1\xe07\xdd\xa7y\xcb\x18\xfd\x19k\xcfc\xf0<\xdfB\xe5\xa9\xb8\xf3T\xc6\xf9@a$O\xb8\xe7\xdb\xcc\x00\x8d\xc9\x13\xf9y\x02;O\xea\xcd\xd3\xe7\xcb\xe3\xd7y6\x94\xe7\x7ft\xe5\xe9\xd2\xe5\xe9\xe0\xe6\xb1\xe1F\x9b&&\x0fH\xe692\xcbc\x97\xbc\x85\x97yL\xd0fD\x1b\xf5\xb4\x15}3#,\xd7\xde\xe8z\\\x98q\x9b\xfbDm\xc9\xab\xc2\xfd\xda3\x1d\xdb\x06D7\xd6\xcf\xba\n\xa2m)S\xe4\x18\xb6M7\xb7\xcd1M\x9bo\xdf\xda(\xb8\r\x18\xb4\xeb\x1a\xa9m1\x9c\xb0\xc7\xb6\x18NZ\x1am\xba\x1bmxb\x9b\xeb\x9b\xed\xa2\x86r\xfb\x87"@\xdbS#\xb7i\xcc\xb4\xf3\x1a\xcac4\xf9\x89\x1c\xfd\xc9\xba\xaf4\xe6\x9e\xd3\'\x98\xd6\'2\xf3\'\xeb\xbf6|\x02\x9c\xc7\xf0\xe81\x86\x19c\xae\xb15\x96W\x8f9\x14\x19C%>\xd9\xf0>\xb6\x0fY\x80\xe41~5\x06\xd4\xc7\xc0\xc4\x98\x92b\x0cL\x8c\xe1Gc\xf8\xd1\x98o#\xc7\xf4\xa5\xc7\xb0\xea1\x1cm\x0c]\x1ds\x9bjLwaL\x95:\x86\xad\x8f\xb9\xc60\x16\xca(g\xdd\xe3\x01\x1b\x02\r7P\xc6[J\xa0[\xa11\xc2<n\xa1&\xb7P\x93[\xbe\xbc\xbd\xcd\xa99n\xf9\xc7\x11\xb7\x14Q\xb7\xfc\x93\x89[\x8a\xa8[Lw\xcbY\xee\x85e\xf2[<~\x04t\x8e\xfeZ\xf4\xff\xfe\x1f\xfa\xddI\x97'))
global t
t=' '
def f(k):
 global t
 r=a[t+k]if t+k in a else'e';t=k
 return r

Skyler

Posted 2018-01-09T09:25:31.847

Reputation: 897

Quick explanation: The zlib.decompress('...') evaluates to {'G?':' ', 'G;':' ','G"':' ',.......}, and a is a dictionary which maps from 2 character to 1 character. Basically 2-character variant of Steadybox's answer.

– user202729 – 2018-01-10T08:53:52.580

1As I can see, the literal is 17780 bytes. You can reduce it to 11619 chars by removing whitespaces in the decompressed content, which saves 12322 bytes. (if I counted correctly) Also... converting hex escape codes to actual raw characters may save even more bytes. – user202729 – 2018-01-10T14:18:18.957

How do I post something here if it is raw bytes? – Skyler – 2018-01-10T15:13:40.123

1xxd, hexdump, uuencode, or similar – Peter Taylor – 2018-01-10T15:29:02.517

@user202729 Just note that Python code cannot contain actual raw NUL bytes. – mbomb007 – 2018-01-26T21:03:24.287

Haskell, (1904 + 1621 + 208548 + 25646) * 2 + 371705 = 847143

{-# LANGUAGE FlexibleInstances, DeriveGeneric #-}

import Control.Arrow
import Control.Monad
import Control.Monad.Trans.State
import Data.List

import System.IO
import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Char8 as BC8
import Data.Ord
import Data.Char
import Data.Monoid
import Data.Maybe (fromJust, catMaybes)
import Data.Function
import qualified Data.Map as Map

import Codec.Compression.Lzma

import Data.Flat

import GHC.Word

maxWordLen :: Integral n => n
maxWordLen = 20

wordSeqDictSize :: Integral n => n
wordSeqDictSize = 255

predict :: [Trie] -> Char -> State ([Either Char Int], String) Char
predict statDict c = do
   (nextChar:future, begunWord) <- get
   case nextChar of
     Left p -> do
       put (future, [])
       return p
     Right lw -> do
       let wpre = begunWord++[c]
       put (future, wpre)
       return $ trieLook (tail wpre) (case drop lw statDict of{(t:_)->t;_->Trie[]})

newtype Trie = Trie [(Char,Trie)] deriving (Show, Generic)
instance Flat Trie

trieLook :: String -> Trie -> Char
trieLook [] (Trie ((p,_):_)) = p
trieLook (c:cs) (Trie m)
 | Just t' <- lookup c m  = trieLook cs t'
trieLook _ _ = ' '

moby :: IO (String -> String)
moby = do
    approxWSeq <- BSL.unpack . decompress <$> BSL.readFile "wordsseq"
    Right fallbackTries <- unflat <$> BS.readFile "dicttries"
    seqWords <- read <$> readFile "seqwords"
    let rdict = Map.fromList $ zip [maxWordLen..wordSeqDictSize] seqWords
    return $ \orig ->
      let reconstructed = approxWSeq >>= \i
             -> if i<maxWordLen then let l = fromIntegral i+1
                                     in replicate l $ Right l
                                else Left <$> rdict Map.! i
      in (`evalState`(reconstructed, ""))
              $ mapM (predict fallbackTries) (' ':orig)

Example:

Call me Ishmael. Some years ago--never mind how long precisely--having
 ap  me ,nhmael.  Hme ?ears |ce--never  usd how long .aacesely--|ubing
little or no money in my purse, and nothing particular to interest me on
little or no ?ivey in my ?efse, and ,uwhing .hrticular to Bdaenest me on
shore, I thought I would sail about a little and see the watery part of
?neae, I thought I would  cfl about a little and see the |rkers part of
the world. It is a way I have of driving off the spleen and regulating
the world. It is a way I have of ,uiving off the |kli   and .ia       
the circulation. Whenever I find myself growing grim about the mouth;
the Ca         . B        I  rtd |yself ,haoing  eom about the ?ivlh;
whenever it is a damp, drizzly November in my soul; whenever I find
Baieever it is a  'mp, ,uiv    Bar      in my  cfl; Baieever I  rtd

Uses three pre-computed auxiliary files:

seqwords contains the 236 most common words.
wordsseq contains an LZMA-compressed sequece of these words, and for all words not among the 236 most common, the length.
dicttries contains, for each word-length, a decision tree that contains all remaining words. From these tries, entries are picked as we go.

This way, we achieve significantly lower error rate than all the other lossy schemes; unfortunately, the wordsseq file is still too big to be competitive.

Here's a completed version that creates the files and does the analysis:

depunct :: String -> [String]
depunct (p:l) = (p:take lm1 wordr) : depunct (drop lm1 wordr ++ srcr)
 where lm1 = maxWordLen-1
       (wordr, srcr) = (`span`l) $ if isAlpha p
                 then \c -> isLetter c || c=='\''
                 else not . isAlpha
depunct []=[]

mhead :: Monoid a => [a] -> a
mhead (h:_) = h
mhead [] = mempty

limit :: [Int] -> [Int]
limit = go 0
 where go z (n:l) | z<100 = n : go (z+n) l
       go _ l = take 1 l

packStr :: String -> Integer
packStr = go 0
 where go n [] = n
       go n (c:cs)
        | c>='a' && c<='z'  = go (28*n + fromIntegral
                                   (1 + fromEnum c - fromEnum 'a')) cs
        | otherwise         = go (28*n) cs


mkTrie :: [String] -> Trie
mkTrie [] = Trie []
mkTrie strs = Trie [ (c, mkTrie . filter (not . null) $ tail<$>l)
                   | l@((c:_):_) <- sortBy (comparing length)
                                  . groupBy ((==)`on`head)
                                  $ sortBy (comparing head) strs ]

mkTries :: [String] -> [Trie]
mkTries rsrc = [ mkTrie $ filter ((==l) . length) rsrc
               | l <- [0..maximum (length<$>rsrc)] ]

main :: IO ()
main = do
    orig <- readFile "whale.txt"
    let wordchopped = depunct orig
        dictRes
          = take 5000
          . map mhead
          . sortBy (comparing $ negate . length)
          . group . sort
          $ wordchopped
        dict = Map.fromList $ zip dictRes [maxWordLen..wordSeqDictSize]
        rdict = Map.fromList $ zip [maxWordLen..wordSeqDictSize] dictRes
        approxWSeq = [ case Map.lookup w dict of
                        Just i -> i
                        Nothing -> fromIntegral (length w - 1) :: Word8
                     | w <- wordchopped ]
        fallbackTries = mkTries . drop (wordSeqDictSize-maxWordLen) $ dictRes
        reconstructed = approxWSeq >>= \i
             -> if i<maxWordLen then let l = fromIntegral i+1
                                     in replicate l $ Right l
                                else Left <$> rdict Map.! i
        predicted = (`evalState`(reconstructed, ""))
              $ mapM (predict fallbackTries) (' ':orig)
        incorrects = length . filter id $ zipWith (/=) orig predicted
    putStrLn $ "longest word: "++show(maximum $ length<$>wordchopped)
    putStrLn $ show incorrects++" errors / "++show (length orig)++" chars"
    BSL.writeFile "wordsseq" . compress $ BSL.pack approxWSeq
    BS.writeFile "dicttries" $ flat fallbackTries
    writeFile "seqwords" . show $ take (256-maxWordLen) dictRes
    writeFile "whale-approx.txt" . unlines $ coLines orig predicted

coLines :: String -> String -> [String]
coLines [] _ = [[],[]]
coLines ('\n':l) (_:m) = []:[]:coLines l m
coLines l ('\n':m) = coLines l ('|':m)
coLines (c:l) (d:m) = case coLines l m of
   (lt:mt:r) -> (c:lt):(d:mt):r

ceased to turn counterclockwis

Posted 2018-01-09T09:25:31.847

Reputation: 5 200

C++ (WIP), 1923*2 + 1017344 = 1021190

#include <map>
#include <random>
#include <string>
#include <type_traits>
#include <vector>

using namespace std;

constexpr minstd_rand::result_type seed = 10087702;

template<typename T>
class discrete_mapped_distribution {
private:
    discrete_distribution<size_t> distr;
    vector<T> values;

public:
    discrete_mapped_distribution() :
            distr(), values() {
    }
    template<typename I, typename = typename enable_if<is_arithmetic<I>::value,
            I>::type>
    discrete_mapped_distribution(map<T, I> distribution) :
            values() {
        vector<I> counts;

        values.reserve(distribution.size());
        counts.reserve(distribution.size());

        for (typename map<T, I>::const_reference count : distribution) {
            values.push_back(count.first);
            counts.push_back(count.second);
        }

        distr = discrete_distribution<size_t>(counts.cbegin(), counts.cend());
    }

    discrete_mapped_distribution(const discrete_mapped_distribution&) = default;
    discrete_mapped_distribution& operator=(const discrete_mapped_distribution&) = default;

    template<typename URNG>
    T operator()(URNG& urng) {
        return values.at(distr(urng));
    }
};

class generator2 {
private:
    static map<char, discrete_mapped_distribution<char>> letters;

    minstd_rand rng;

public:
    static void initDistribution(const string& text) {
        map<char, map<char, uint64_t>> letterDistribution;

        string::const_iterator it = text.cbegin();
        char oldLetter = *it++;

        for (; it != text.cend();) {
            ++(letterDistribution[oldLetter][*it]);
            oldLetter = *it++;
        }

        generator2::letters = map<char, discrete_mapped_distribution<char>>();

        for (map<char, map<char, uint64_t>>::const_reference letter : letterDistribution) {
            generator2::letters[letter.first] = discrete_mapped_distribution<char>(letter.second);
        }
    }

    generator2() :
            rng(seed) {
    }

    char getNextChar(char in) {
        return letters.at(in)(rng);
    }
};

map<char, discrete_mapped_distribution<char>> generator2::letters;

As it stand this solution is WIP and therefore ungolfed. Also considering that the actual code size barely has any impact on the score I thought I post my answer first before starting to micro optimize it.
(Full code available here: https://github.com/BrainStone/MobyDickRNG - Includes full program and seed search)

This solution is based on a RNG. First I analyze the text. I create a map that counts the occurrences of two consecutive characters. Then I create a distribution map. This is all done statically so should be in accordance of the rules.

Then while trying to print the text I do a lookup and pull a random character of the possible ones. While this usually produces worse results than just outputting the most common following letter, there could are likely god seeds that will produce better results. That's why the seed is hard coded. I'm currently searching for the best seed. And I will update this answer once I find better seeds. So stay posted!

If anyone wants to search for seeds themselves or use different RNGs feel free to fork the repo.

Method used to calculate score: https://github.com/BrainStone/MobyDickRNG/blob/master/src/search.cpp#L15

Note even though the total score is the worst at the moment, it beats the error count of just outputting spaces. And the chances are good that the score will drop, by checking more seeds.

Changelog

2018/01/24: Posted inital answer.
Checked seeds: 0-50000. Score: 2305*2 + 1017754 = 1022364
2018/01/24: Did some minimal golfing. Added link to score calculation method.
Checked seeds: 0-80000. Score: 1920*2 + 1017754 = 1021594 (-770)
2018/02/02: New seed (10087702) (didn't find the time to fix the submission)
Checked seeds: 0-32000000. Score: 1923*2 + 1017344 = 1021190 (-404)

BrainStone

Posted 2018-01-09T09:25:31.847

Reputation: 1 501

Could you include a test harness in your answer that evaluates the score? – Nathaniel – 2018-01-24T10:58:58.980

@Nathaniel I linked the score code directly. In addition with the repository would you consider this enough? – BrainStone – 2018-01-24T11:10:44.120

Upon reviewing the rules I have noticed that I violate some of them. I naturally will update my answer once I fix the issues – BrainStone – 2018-01-24T12:08:04.693

You will then end up encoding the text into the random seed. See esoteric programming language Seed, and you may want to reverse engineer the MT19937 program and beat this answer (if you can).

– user202729 – 2018-01-24T13:14:49.903

Nice idea, but won't help getting a good score. +1 anyway. – user202729 – 2018-01-24T13:41:01.953

@user202729 my solution actually got inspired by seed. – BrainStone – 2018-01-24T14:21:01.823

@Nathaniel is my approach of analyzing the text during startup acceptable (I know I need to add line that actually initializes it) or would I need to hardcore the matrix? – BrainStone – 2018-01-24T14:22:49.903

@BrainStone I'm afraid I don't know what "analyzing the text during startup" means. The important thing is that your code cannot make multiple passes through the text - it can't have access to any character of the text before it makes a guess about that character. (This is why I asked for the score calculating code. If I can see that then it should be more clear what's actually happening. I did look for it in the git repo, but it's not clear to me which part of the code does what job. If you could make a self-contained piece of code for that task it would help.) – Nathaniel – 2018-01-25T00:40:53.633

@Nathaniel I see. So then I have to hardcode my map. Good to know. Shouldn’t be too much of an issue. – BrainStone – 2018-01-25T04:07:01.040

Ruby, 1164418 (ouch)

I just wanted to see how well I could do without checking any other answers.
I'm not sure if this is allowed because it includes a literal I generated via analyzing the file, but even if it wasn't it's not like it was in danger of beating anyone.

x="\"ect,htabsdd,in,\\nodniwlrfydbulkm;f?ckgwvi0,.*pr;\\\"uz17klI\\n-c'WSpA\\nTwqu8.77!-BeWO5.4.CoP\\n\\\"UHEFu2.?-9.jo6.NI3.MaLYDOGoOAR'QUECziJoxp(\\nYa:\\nVI);K\\nUS*IZEX\\n&\\n$\\n_y[S\""
f=->n{(x.include? n)? x[x.index(n)+1] : ' '}

How I generated `x`

First, I generated a.txt with the following:

grep -o ".." whale2.txt | sort | uniq -c|sort -bn>a.txt

Then I generated a.csv:

cat a.txt | awk '{ print $1","$2 }'|sort -n|tac>a.csv

Then I parsed it into x with the following Ruby script:

f={}
File.open('./a.csv').each{|l|x=l.partition(',')
f[x.last[0..1]]=x.first}
n={}
r={}
f.each{|k,v|if((r.include? k[0]and v>n[k[0]])or not r.include? k[0])and not k[1].nil?
r[k[0]]=k[1]
n[k[0]]=v
end}
s=''
r.each{|k,v|s+=k+v}
puts s.inspect

How I scored

w=File.read('whale2.txt')
x="ect,htabsdd,in,\nodniwlrfydbulkm;f?ckgwvi0,.*pr;\"uz17klI\n-c'WSpA\nTwqu8.77!-BeWO5.4.CoP\n\"UHEFu2.?-9.jo6.NI3.MaLYDOGoOAR'QUECziJoxp(\nYa:\nVI);K\nUS*IZEX\n&\n$\n_y[S"
f=->n{(x.include? n)? x[x.index(n)+1] : ' '}

score = 235
w.each_line{|l|v=l[0];l[0..-3].each_char{|n|v+=f[n]};v.split(//).each_with_index{|c,i|if l[i]==c
print c
else
print '_'
score+=1

end}}

puts "FINAL SCORE: #{score}"

NO_BOOT_DEVICE

Posted 2018-01-09T09:25:31.847

Reputation: 419

I'm sure that's allowed; if you analyzed the file, good job! Only if the program does so is this invalid. – Erik the Outgolfer – 2018-11-01T18:19:45.463

@EriktheOutgolfer >_> (silently slides a "(non-competing)" into the title) – NO_BOOT_DEVICE – 2018-11-01T18:55:20.327

Why? If this is valid, it's competing, even though it may not beat much. If it's invalid (that is, your solution reads from the file, and doesn't just contain a literal), it should be deleted. – Erik the Outgolfer – 2018-11-01T18:57:00.530

Hmmm. I thought you meant if any program analyzed the file, and not just the solution. – NO_BOOT_DEVICE – 2018-11-01T23:42:56.753

1I can't read Ruby, but I think this is valid. Having the literal inside the program is completely fine, it's no problem at all. – Nathaniel – 2018-11-02T04:57:37.403

Python 3, (146*2+879757) 880049 bytes

def f(c):return"\n                     t \n 2  sS \n  -  08........       huaoRooe oioaohue thpih eEA \n   neo    enueee neue hteht e"[ord(c)-10]

Try it online!

Pretty straightforward frequency table. Each position in the string corresponds to the current character's ascii code (minus 10 = 0x0a = '\n', the lowest character in the file), and the character at each index is the most frequent next character. Assuming I calculated the frequencies right…

Tested with the code from user202729's test

Kevin

Posted 2018-01-09T09:25:31.847

Reputation: 501

Can you save some bytes by using def f(c):return(" ">c)*c or"t ... e"[ord(c)-32]? – Neil – 2018-01-11T11:32:32.183

[Python 3] (644449*2+0) 1288898 points

Perfect accuracy in only 644449 bytes

import zlib,base64 as s
t=enumerate(zlib.decompress(s.b64decode(b'###')).decode());a=lambda c:next(t)[1]

The full code cannot fit in an answer, so I have put it here and replaced the large binary string literal with b'###' in the answer text.

This is generated with the following code, where "modified.py" is the generated file, and "cheatsheet.txt" is the whale2.txt file starting at the second character.

import zlib, base64
with open("modified.py","w") as writer:
    writer.write("import zlib,base64 as s\nt=enumerate(zlib.decompress(s.b64decode(")
    with open("cheatsheet.txt","rb") as source:
        text = source.read()
        writer.write(str(base64.b64encode(zlib.compress(text,9))))
    writer.write(')).decode());a=lambda c:next(t)[1]')

The code can be executed by adding the following to the end of "modified.py". "whale2.txt" must be in the same directory as "modified.py", and the output will be written to "out.txt".

with open("out.txt","w") as writer:
    with open("whale2.txt","r") as reader:
        text = reader.read()
        for b in text:
            c = a(b)
            writer.write(c)

This answer does not directly access either whale.txt or whale2.txt. It makes use of existing standard compression libraries as explicitly allowed in the rules.

Legorhin

Posted 2018-01-09T09:25:31.847

Reputation: 131

there might be a "\r\n" in there that i couldn't get rid of in Windows when I was counting them – Legorhin – 2019-09-26T21:41:29.187

2yeah that was a typo that propagated – Legorhin – 2019-09-26T22:02:32.617

Write Moby Dick, approximately

Leaderboard

Bounties

Answers

///, 2*1 + 1020874 = 1020876

Node.js, 2*224 + 524279 = 524727

Notes

Test code

Change log

Perl, 2·70525 + 326508 = 467558

Predictor

Test code

Text comparison

Generating hints

Explanation

Model

Overall architecture

Why use a hint file?

Arithmetic coding

Why make errors?

The "rewind" mechanism

A little theorem

Putting it all together

Python 3, 2·267 + 510193 = 510727

Predictor

Test code

Python 3, 2*279+592920=593478 2*250 + 592467 = 592967 2 * 271 + 592084 = 592626 2 * 278 + 592059 = 592615 2 * 285 + 586660 = 587230 2 * 320 + 585161 = 585801 2 * 339 + 585050 = 585728

C++, score: 2*132 + 865821 = 866085

Python, 2*516 + 521122 = 522154

Algorithm:

Results:

Test code:

C (gcc), 679787 652892

Explanation

sh+bzip2, 2*364106 = 728212

Python 3, 879766

C#, 378*2 + 569279 = 570035

Java 7, 1995 characters, (1995*2+525158) 529148

Python 3, 2×497+619608=620602 2×496+619608=620600

Scoring

C++, 2·62829 + 318786 = 444444

Python 3, 526640

Python 2, score: 2*(407+56574) + 562262 = 676224

C++ (GCC), 725 × 2 + 527076 = 528526

Python 2, 756837

Haskell, (1904 + 1621 + 208548 + 25646) * 2 + 371705 = 847143

C++ (WIP), 1923*2 + 1017344 = 1021190

Changelog

Ruby, 1164418 (ouch)

How I generated x

How I scored

Python 3, (146*2+879757) 880049 bytes

[Python 3] (644449*2+0) 1288898 points

Python 3, 2279+592920=593478 2250 + 592467 = 592967 2 * 271 + 592084 = 592626 2 * 278 + 592059 = 592615 2 * 285 + 586660 = 587230 2 * 320 + 585161 = 585801 2 * 339 + 585050 = 585728

How I generated `x`