47
8
Everyone knows that the content makes the question. But a good title helps too, and that's the first thing we see. It's time to turn that first impression into a program, and figure out what kinds of titles get more upvotes.
You are hereby challenged to write a program or function that takes the title of a PPCG question as input, and returns a prediction of its score.
For instance, you might receive Counting Grains of Rice as an input, and you would be trying to return something close to the score, 59 in this case. Non-integer guesses are fine, but guesses at or below -20 are not.
Here is the data, for testing and scoring:
http://data.stackexchange.com/codegolf/query/244871/names-and-upvotes
Scoring: Your program will be run on every question in this site's (PPCG) history, not counting closed questions. The function ln(score + 20) will then be applied to each score, and to each guess. The root-mean-squared-error between the two resultant sets of values is your score. Lower is better.
For instance, a program that guessed 0 every time would score 0.577, while one that guessed 11 every time would score 0.362.
Please calculate your score and include it in the title of your answer. Please also include your program's prediction for how many upvotes this question will get.
Restrictions:
To prevent excessive hard-coding, no more than 1000 characters.
Must run on the entire data set above in under a minute on a reasonable machine.
Standard Loopholes are closed.
Here is a tester written in Python, for your use and/or to clear up ambiguities:
import sys
import math
import csv
scores_dict = {}
with open(sys.argv[1], 'r') as csv_file:
score_reader = csv.reader(csv_file)
for score, title in score_reader:
if score == 'Score':
continue
scores_dict[title] = int(score)
def rate_guesses(guesser):
def transform(score):
return math.log(score + 20) if score > -20 else 0
off_by_total = 0
lines_count = 0
for title in scores_dict:
guessed_score = guesser(title)
real_score = scores_dict[title]
off_by_total += (transform(real_score) - transform(guessed_score)) ** 2
return (off_by_total/len(scores_dict)) ** .5
def constant11(title):
return 11
print(rate_guesses(constant11))
19Nice idea, but it's a shame the dataset isn't stable, so the scores will become invalid after a while. There's also a minor possibility of strategic voting: anyone who answers this question and earns a vox-populi badge in the same week should be viewed with suspicion! ;-) – Level River St – 2014-11-11T06:17:20.187
1Will the title include or exclude things like
[closed]and[on hold], where applicable? – es1024 – 2014-11-11T06:50:11.207@es1024 No, it will not. Closed questions and on hold questions are not in the data set, I believe. See the link for details. – isaacg – 2014-11-11T06:56:42.227
4@steveverrill Well, the flipside of that is as time progresses, we'll be able to see whether the programs do well on future posts as well as past ones. – isaacg – 2014-11-11T06:58:52.870
6It's difficult to defeat hard-coding. Each hard-coded top-voted question can reduce as much as 0.4 score. And there seems to be not much common pattern also, haha. I'm predicting that the answers will just compete on how to fit as many hard-coded result in 1000 bytes. – justhalf – 2014-11-11T09:58:58.727
Are we allowed to use the entire list of titles when processing a single title? – Nathan Merrill – 2014-11-11T18:29:02.413
5You should not use the complete body of questions as your test set. You should pre-select a certain number (10%-20%) at random, and define them as your test set (but not tell anyone what that is). It's much easier to make an algorithm that predicts past history, than one that has future predictive value (ie, one that works well on any given subset). (It would be even better to remove those 10% from what we can see at all, but that wouldn't really work well.) – Joe – 2014-11-11T20:55:19.020
@NathanMerrill Only if you can fit the entire list of titles in 1000 bytes. – isaacg – 2014-11-11T23:52:42.633
@isaacg:: I think Nathan is referring to accessing a list of file containing the list of titles, like I did. – justhalf – 2014-11-12T06:09:27.623
@justhalf I guess I'll allow it for now, but I'm not sure how I feel about it. – isaacg – 2014-11-12T07:00:32.043