301
68
Here is a 1.2Mb ASCII text file containing the text of Herman Melville's Moby-Dick; or, The Whale. Your task is to write a program or function (or class, etc. -- see below) which will be given this file one character at a time, and at each step must guess the next character.
This is code-challenge. Your score will be
2*L + E
where L
is the size of your submission in bytes, and E
is the number of characters it guesses incorrectly. The lowest score wins.
Further particulars
Your submission will be a program or function (etc.) that will be called or invoked or sent data multiple times. (1215235 times to be exact.) When it is called for the nth time it will be given the nth character of whale.txt
or whale2.txt
and it must output its guess for the (n+1)th character. The E
component of its score will be the total number of characters that it guesses incorrectly.
Most submissions will need to store some state in between invocations, so that they can track how many times they have been called and what the previous inputs were. You can do this by writing to an external file, by using static
or global variables, by submitting a class rather than a function, using a state monad, or whatever else works for your language. Your submission must include any code required to initialise its state before the first invocation.
Your program should run deterministically, so that it always makes the same guesses given the same input (and hence always gets the same score).
Your answer must include not only your submission, but also the code you used to calculate the E
part of its score. This need not be written in the same language as your submission, and will not be counted towards its byte count. You are encouraged to make it readable.
Regarding the interface between your submission and this score-calculating program, anything is fine, as long as your program always gives one byte of output before receiving its next byte of input. (So, for example, you can't just pass it a string containing all of the input and get a string back containing all of the output.)
You must actually run your test program and calculate/verify your score before submitting your entry. If your submission runs too slowly for you to verify its score then it is not qualified to compete, even if you know what its score would be in principle.
The L
component of your score will be calculated according to the usual rules for code golf challenges. If your submission will contain multiple files, please take note of the rules on scoring and directory structure in that case. Any data that your code uses must be included in your L
score.
You may import existing libraries but may not load any other external files, and your code may not access the whale.txt
or whale2.txt
file in any way other than described above. You may not load any pre-trained neural networks or other sources of statistical data. (It's fine to use neural networks, but you have to include the weight data in your submission and count it towards your byte count.) If for some reason your language or libraries include a feature that provides some or all of the text of Moby Dick, you may not use that feature. Aside from that you can use any other built-in or library features that you like, including ones relating to text processing, prediction or compression, as long as they're part of your language or its standard libraries. For more exotic, specialised routines that include sources of statistical data, you would have to implement them yourself and include them in your byte count.
It is likely that some submissions will include components that are themselves generated by code. If this is the case, please include in your answer the code that was used to produce them, and explain how it works. (As long as this code is not needed to run your submission it will not be included in your byte count.)
For historical reasons, there are two versions of the file, and you may use either of them in an answer. In whale2.txt
(linked above) the text is not wrapped, so newlines appear only at the end of paragraphs. In the original whale.txt
the text is wrapped to a width of 74 characters, so you have to predict the end of each line as well as predicting the text. This makes the challenge more fiddly, so whale2.txt
is recommended for new answers. Both files are the same size, 1215236 bytes.
To summarise, all answers should include the following things:
- Your submission itself. (The code, plus any data files it uses - these can be links if they're large.)
- An explanation of how your code works. Please explain the I/O method as well as how it predicts the next character. The explanation of your algorithm is important, and good explanations will earn bounties from me.
- The code you used to evaluate your score. (If this is identical to a previous answer you can just link to it.)
- Any code you used to generate your submission, along with an explanation of that code. This includes code that you used to optimise parameters, generate data files etc. (This doesn't count towards your byte count but should be included in your answer.)
Leaderboard
var QUESTION_ID=152856,OVERRIDE_USER=21034;function answersUrl(e){return"https://api.stackexchange.com/2.2/questions/"+QUESTION_ID+"/answers?page="+e+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+ANSWER_FILTER}function commentUrl(e,s){return"https://api.stackexchange.com/2.2/answers/"+s.join(";")+"/comments?page="+e+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+COMMENT_FILTER}function getAnswers(){jQuery.ajax({url:answersUrl(answer_page++),method:"get",dataType:"jsonp",crossDomain:!0,success:function(e){answers.push.apply(answers,e.items),answers_hash=[],answer_ids=[],e.items.forEach(function(e){e.comments=[];var s=+e.share_link.match(/\d+/);answer_ids.push(s),answers_hash[s]=e}),e.has_more||(more_answers=!1),comment_page=1,getComments()}})}function getComments(){jQuery.ajax({url:commentUrl(comment_page++,answer_ids),method:"get",dataType:"jsonp",crossDomain:!0,success:function(e){e.items.forEach(function(e){e.owner.user_id===OVERRIDE_USER&&answers_hash[e.post_id].comments.push(e)}),e.has_more?getComments():more_answers?getAnswers():process()}})}function getAuthorName(e){return e.owner.display_name}function process(){var e=[];answers.forEach(function(s){var r=s.body;s.comments.forEach(function(e){OVERRIDE_REG.test(e.body)&&(r="<h1>"+e.body.replace(OVERRIDE_REG,"")+"</h1>")});var a=r.match(SCORE_REG);a&&e.push({user:getAuthorName(s),size:+a[2],language:a[1],link:s.share_link})}),e.sort(function(e,s){var r=e.size,a=s.size;return r-a});var s={},r=1,a=null,n=1;e.forEach(function(e){e.size!=a&&(n=r),a=e.size,++r;var t=jQuery("#answer-template").html();t=t.replace("{{PLACE}}",n+".").replace("{{NAME}}",e.user).replace("{{LANGUAGE}}",e.language).replace("{{SIZE}}",e.size).replace("{{LINK}}",e.link),t=jQuery(t),jQuery("#answers").append(t);var o=e.language;/<a/.test(o)&&(o=jQuery(o).text()),s[o]=s[o]||{lang:e.language,user:e.user,size:e.size,link:e.link}});var t=[];for(var o in s)s.hasOwnProperty(o)&&t.push(s[o]);t.sort(function(e,s){return e.lang>s.lang?1:e.lang<s.lang?-1:0});for(var c=0;c<t.length;++c){var i=jQuery("#language-template").html(),o=t[c];i=i.replace("{{LANGUAGE}}",o.lang).replace("{{NAME}}",o.user).replace("{{SIZE}}",o.size).replace("{{LINK}}",o.link),i=jQuery(i),jQuery("#languages").append(i)}}var ANSWER_FILTER="!t)IWYnsLAZle2tQ3KqrVveCRJfxcRLe",COMMENT_FILTER="!)Q2B_A2kjfAiU78X(md6BoYk",answers=[],answers_hash,answer_ids,answer_page=1,more_answers=!0,comment_page;getAnswers();var SCORE_REG=/<h\d>\s*([^\n,]*[^\s,]),.*?(\d+)(?=[^\n\d<>]*(?:<(?:s>[^\n<>]*<\/s>|[^\n<>]+>)[^\n\d<>]*)*<\/h\d>)/,OVERRIDE_REG=/^Override\s*header:\s*/i;
body{text-align:left!important}#answer-list,#language-list{padding:10px;width:380px;float:left}table thead{font-weight:700}table td{padding:5px}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <link rel="stylesheet" type="text/css" href="//cdn.sstatic.net/codegolf/all.css?v=83c949450c8b"> <div id="answer-list"> <h2>Leaderboard</h2> <table class="answer-list"> <thead> <tr><td></td><td>Author</td><td>Language</td><td>Score</td></tr></thead> <tbody id="answers"> </tbody> </table> </div><div id="language-list"> <h2>Winners by Language</h2> <table class="language-list"> <thead> <tr><td>Language</td><td>User</td><td>Score</td></tr></thead> <tbody id="languages"> </tbody> </table> </div><table style="display: none"> <tbody id="answer-template"> <tr><td>{{PLACE}}</td><td>{{NAME}}</td><td>{{LANGUAGE}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr></tbody> </table> <table style="display: none"> <tbody id="language-template"> <tr><td>{{LANGUAGE}}</td><td>{{NAME}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr></tbody> </table>
Bounties
From time to time I'll offer bounties to encourage different approaches.
The first one, 50 points, was awarded to A. Rex for the best-scoring answer at the time.
The second, 100 points, was also awarded to A. Rex, for the same answer, because they added a very good explanation to their existing answer.
The next bounty, 200 points, will be awarded to either
- A competitive answer that uses a new technique. (This will be based on my subjective judgment since it's my rep that goes into the bounty, but you can trust me to be fair. Note that your answer needs to contain sufficient explanation for me to understand how it works!) Such an answer needn't take the top score, it just needs to do reasonably well compared to existing answers. I'm particularly keen to see solutions based on recurrent neural networks, but I'll award the bounty to anything that seems different enough from the Markov models that dominate the current top scores.
Or:
- Anyone else who beats A. Rex's top score (currently 444444), using any method.
Once the 200 point bounty is claimed I will most likely offer a 400 point one, updating the requirements accordingly.
Comments are not for extended discussion; this conversation has been moved to chat.
– Dennis – 2018-01-30T02:29:47.7879https://xkcd.com/1960/ appears to be a reference to this challenge! – A. Rex – 2018-02-26T13:49:55.810
I thought of compressing this... but it is a bit too long that my computer crashed shrug – Naruyoko – 2019-09-25T00:55:10.507