Let's Judge Some Books By Their Covers

47

8

Everyone knows that the content makes the question. But a good title helps too, and that's the first thing we see. It's time to turn that first impression into a program, and figure out what kinds of titles get more upvotes.

You are hereby challenged to write a program or function that takes the title of a PPCG question as input, and returns a prediction of its score.

For instance, you might receive Counting Grains of Rice as an input, and you would be trying to return something close to the score, 59 in this case. Non-integer guesses are fine, but guesses at or below -20 are not.

Here is the data, for testing and scoring:

http://data.stackexchange.com/codegolf/query/244871/names-and-upvotes

Scoring: Your program will be run on every question in this site's (PPCG) history, not counting closed questions. The function ln(score + 20) will then be applied to each score, and to each guess. The root-mean-squared-error between the two resultant sets of values is your score. Lower is better.

For instance, a program that guessed 0 every time would score 0.577, while one that guessed 11 every time would score 0.362.

Please calculate your score and include it in the title of your answer. Please also include your program's prediction for how many upvotes this question will get.

Restrictions:

  • To prevent excessive hard-coding, no more than 1000 characters.

  • Must run on the entire data set above in under a minute on a reasonable machine.

  • Standard Loopholes are closed.


Here is a tester written in Python, for your use and/or to clear up ambiguities:

import sys
import math
import csv

scores_dict = {}

with open(sys.argv[1], 'r') as csv_file:
    score_reader = csv.reader(csv_file)
    for score, title in score_reader:
        if score == 'Score':
            continue
        scores_dict[title] = int(score)

def rate_guesses(guesser):
    def transform(score):
        return math.log(score + 20) if score > -20 else 0
    off_by_total = 0
    lines_count = 0
    for title in scores_dict:
        guessed_score = guesser(title)
        real_score = scores_dict[title]
        off_by_total += (transform(real_score) - transform(guessed_score)) ** 2
    return (off_by_total/len(scores_dict)) ** .5

def constant11(title):
    return 11

print(rate_guesses(constant11))

isaacg

Posted 2014-11-11T05:31:45.507

Reputation: 39 268

19Nice idea, but it's a shame the dataset isn't stable, so the scores will become invalid after a while. There's also a minor possibility of strategic voting: anyone who answers this question and earns a vox-populi badge in the same week should be viewed with suspicion! ;-) – Level River St – 2014-11-11T06:17:20.187

1Will the title include or exclude things like [closed] and [on hold], where applicable? – es1024 – 2014-11-11T06:50:11.207

@es1024 No, it will not. Closed questions and on hold questions are not in the data set, I believe. See the link for details. – isaacg – 2014-11-11T06:56:42.227

4@steveverrill Well, the flipside of that is as time progresses, we'll be able to see whether the programs do well on future posts as well as past ones. – isaacg – 2014-11-11T06:58:52.870

6It's difficult to defeat hard-coding. Each hard-coded top-voted question can reduce as much as 0.4 score. And there seems to be not much common pattern also, haha. I'm predicting that the answers will just compete on how to fit as many hard-coded result in 1000 bytes. – justhalf – 2014-11-11T09:58:58.727

Are we allowed to use the entire list of titles when processing a single title? – Nathan Merrill – 2014-11-11T18:29:02.413

5You should not use the complete body of questions as your test set. You should pre-select a certain number (10%-20%) at random, and define them as your test set (but not tell anyone what that is). It's much easier to make an algorithm that predicts past history, than one that has future predictive value (ie, one that works well on any given subset). (It would be even better to remove those 10% from what we can see at all, but that wouldn't really work well.) – Joe – 2014-11-11T20:55:19.020

@NathanMerrill Only if you can fit the entire list of titles in 1000 bytes. – isaacg – 2014-11-11T23:52:42.633

@isaacg:: I think Nathan is referring to accessing a list of file containing the list of titles, like I did. – justhalf – 2014-11-12T06:09:27.623

@justhalf I guess I'll allow it for now, but I'm not sure how I feel about it. – isaacg – 2014-11-12T07:00:32.043

Answers

9

Python 2, 991 chars, score 0.221854834221, predict 11

import base64
e={}
d=base64.decodestring('vgAcRAEVDAIsqgQYalYaggcjQKwVXAoZWAsYQg0Ckg4VlWEX9hEDRhMe0hckCBkeuhsW3CAWQiEm\nSiMZMiwgTDAZZjIcSLMZfDQDnjwCe2AVaEQCYWEBIEgnDEoXzk0e3lQb5FYVKlkVZlwB+F0XwmI/\nGmRcuGUXWmYdVGkbzmwafG8eaHMdInkggHonRn5sKoMXgIkpbowVOI4cNJAubpQdYpcydJgVypkA\nZpweMp8ZsqEcRKMghKQYkKVPPXEWMqkWHKwbjrEdzLIBNLMf1LQivrYC99UV9rxNRsABNMEiPzob\npc0ActAhn3gcrNUZYNcWYNov/t8VgOEXAuMYqOUWsqUiCPIefPWNbvtKevwWvP0Cl9UsjQMdWwQb\nfQdpJQgWYwkCZRLBjxMWWdkqHRkWNxwB6x8p2SEZyyICSyYcgysaOS0CUy8hezAaGeEVpTRQ3zUz\nZzkZRzohizwwST4c8UAdF0OqG9AXIuEYYRN6208nU1AktVEVJ1IVWeMkIVQXdY4D2VYYD/cYb1om\nG1xA0zoY3uUaRWAhWpBSHWUXQTxGe+cad20CO3AZX3EBw3IiMXcef3gecXsVGXwhw30VbX4W24BD\n6qyQ45YaYYgZ4YobbYwauY4bMY82HZEdO5YmQ5cV35sVxaMbY6gYNas576ws+bADO7QpN7hdLJ8B\n4Eoks8EYX8VU68cYWfcar82QOdAaxdEfQ9UiW/kXL9k2ddwCW90m694enqUCkeEBE+IYWvsfA1FC\nJ+spMVIjhe4WEP0fAfYax/c3MfgbgfkqP/0DLf4V\n')
for i in range(0,600,3):
 e[ord(d[i])*256+ord(d[i+1])]=ord(d[i+2])*2-8
def p(t):
 return e.get(hash(t)%256**2,11)

Explanation:

This is shameless hardcoding, but trying to do it efficiently.

Preprocessing:

In a separate code, I hashed each title to a value between 0 and 256^2-1. Let's call these values bins. For each bin, I calculated the average score. (Average is needed because for a tiny fraction of the bins, there are collisions - more than 1 title hashes to the same bin. But for the vast majority each title maps to a bin of its own).

The idea behind the 2-byte code for each title is that 1 byte is not enough - we get too many collisions, so we don't really know what score to assign to each 1-byte bin. But with 2-byte bins, there are almost no collisions, and we effectively get a 2 byte representation of each title.

Then rank the bins - compute the gain in score if we assign this bin its computed value, instead of just guessing 11. Take the top N bins, and code them into a string (which is d in the actual code).

The encoding: the key of the bin is coded as 2 bytes. the value is encoded using 1 byte. I found values between -8 and 300+something, so had to squeeze a bit to get it into 1 byte: x -> (x+8)/2.

Actual code:

read d as byte triplets, decoding the encoding explained above. When a title is given, compute its hash (modulo 256^2), and if that key is found in the dict, return the value it maps to. Otherwise, return 11.

Ofri Raviv

Posted 2014-11-11T05:31:45.507

Reputation: 206

3A suggestion: Average score is not that good. Look at the challenges scoring function, it's non-linear. – Deduplicator – 2014-11-12T01:04:46.747

1@Deduplicator Thanks, I realised that after I was done. The thing is, for 99% of the bins, there is no collisions, so the average is actually just the score of the single title that maps to that bin. – Ofri Raviv – 2014-11-12T07:13:25.607

16

Javascript ES6

Score: 0.245663
Length: 1000 bytes
Predicts: 5

(I guess the question is due for an unexpected avalanche of downvotes. :P)

Minified

E="ABCDEFGHIJKLMNOPQRSTUVWXYZ";E=E+E.toLowerCase()+"0123456789!@#$%;*()";P="RCRHIFMGPGIDQKHMJLLRMLFJGKHEqHPOgJNKGPCHPJNUPOSGJQKKKMELMIJHLKIKNcKDOfSJLFHDILGKIOUKLMLLKMKIHGHJGIIJDGJKIHIIFIGMTIHFJMIKDJGQJKGMKJHPRJPLMGIOPIIIIPBYFMGLEIKIMMRUKFLFGJFKFTHPFODEQTGTHKIJDHNJGDGiHKNYLHHDLJHGILOEViKNEKGQZfHJMIISIJFRHKLJMLPIFJILKKKJKKJESHNLLIKIGKGJJJHKJRGULLSLRONJNEeLKIQGGPQIHPLEIHHIDXjQHNBKOGWWIENRbYoHINHNMKTNKHTGMIPXFJLLMLIHPPLDJJKFUMIQMOKJLJJKIJKNLHRILJIAIbJEZOGIELGHGLOINDPJMJeJWRIIJHSOZDOsOFRRIOIOTIJSGGJKFUIDOINGOcLQEJFEITLMNIIGIGIMG7LPSNLKVOKIFGHJITGOFUJIIRN";K={};"99r1501vz076mip077myv0733it280exx081gt9118i1g279dyx102uho203ejd07433z087uje097kdg1567ft2088rk275dmu1203ez106lih1763ei126f6q101aax084owh088aid161i9y179gvn236ptn3338vf132i55080fke101l4z3789ai281ulm081blm124euz074o5m07513z14117l095qdn092gl30757n5153".replace(/(...)(...)/g,(_,a,b)=>K[a]=1*b);D=40655;N=479;H=(s,T)=>(h=7,[...s].map(c=>h=~~(h*T+c.charCodeAt(0))),h);S=s=>(y=H(s,31),K[((y%D+D)%D).toString(36)]||E.indexOf(P[(H(s,48)%N+N)%N]));

Expanded

E = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
E = E + E.toLowerCase() + "0123456789!@#$%;*()";
K = {};
P = "RCRHIFMGPGIDQKHMJLLRMLFJGKHEqHPOgJNKGPCHPJNUPOSGJQKKKMELMIJHLKIKNcKDOfSJL" +
    "FHDILGKIOUKLMLLKMKIHGHJGIIJDGJKIHIIFIGMTIHFJMIKDJGQJKGMKJHPRJPLMGIOPIIIIP" +
    "BYFMGLEIKIMMRUKFLFGJFKFTHPFODEQTGTHKIJDHNJGDGiHKNYLHHDLJHGILOEViKNEKGQZfH" +
    "JMIISIJFRHKLJMLPIFJILKKKJKKJESHNLLIKIGKGJJJHKJRGULLSLRONJNEeLKIQGGPQIHPLE" +
    "IHHIDXjQHNBKOGWWIENRbYoHINHNMKTNKHTGMIPXFJLLMLIHPPLDJJKFUMIQMOKJLJJKIJKNL" +
    "HRILJIAIbJEZOGIELGHGLOINDPJMJeJWRIIJHSOZDOsOFRRIOIOTIJSGGJKFUIDOINGOcLQEJ" +
    "FEITLMNIIGIGIMG7LPSNLKVOKIFGHJITGOFUJIIRN";

   ("99r1501vz076mip077myv0733it280exx081gt9118i1g279dyx102uho203ejd07433z087u" +
    "je097kdg1567ft2088rk275dmu1203ez106lih1763ei126f6q101aax084owh088aid161i9" +
    "y179gvn236ptn3338vf132i55080fke101l4z3789ai281ulm081blm124euz074o5m07513z" +
    "14117l095qdn092gl30757n5153").
        replace( /(...)(...)/g, (_,a,b) => K[a] = 1*b );

D = 40655;
N = 479;
H = (s,T) => (
    h = 7,
    [...s].map( c => h = ~~(h*T + c.charCodeAt(0)) ),
    h
);

S = s => (
    y = H( s, 31 ),
    K[((y%D + D)%D).toString( 36 )] || E.indexOf( P[(H( s, 48 )%N + N)%N] )
);

The function S accepts a string (title) and returns its score.

Notes on Behaviour:

  • ≤ 70 vote titles are handled separately from > 70 vote titles
  • ≤ 70 vote titles are sorted into bins using a highly sophisticated holonomic keyword tracking potential optimization algorithm that in no way resembles a string hash function
  • after a bit of happy calculus it turns out that the optimal guess for each bin is simply the geometric average of the (votes + 20)'s for all titles in the bin, minus 20
  • the optimal guesses for all 479 bins are then encoded as a 479-character base-70 string
  • for > 70 vote titles, the titles are assigned unique 3-digit base-36 codes generated using a state-of-the-art hashing technique that guarantees no collisions with other > 70 vote titles and no false detections of ≤ 70 vote titles. This state-of-the-art technique in no way resembles trying random bin counts until one produces no collisions.
  • the > 70 vote title codes and their vote counts are encoded in a string (6 bytes per title), which is converted to a simple lookup table. The routine therefore has zero error for all > 70 vote titles.

COTO

Posted 2014-11-11T05:31:45.507

Reputation: 3 701

10

Python 2, Score = 0.335027, 999 chars, Predict 11.34828 for this question

Just to get the ball rolling. This is nowhere optimal.

The fancy SVM thing is just my random idea and I felt like implementing it, so here it is. It does improve the baseline by 0.02 points, so I'm happy enough with that. But to show that hard-coding the input is where the majority of the improvement comes from, I'm hard coding some answer also.

Without the hard-coding the score is 0.360 (and actually all the predictions are around 11, haha)

I'm using scikit-learn and nltk

import sys
import math
import csv
from sklearn.feature_extraction.text import TfidfVectorizer as TV
from sklearn.svm import SVR
from nltk.stem.porter import PorterStemmer as PS
sd = {}
lr = None
tv = None
with open(sys.argv[1], 'r') as cf:
    sr = csv.reader(cf)
    for sc, t in sr:
        if sc == 'Score':
            continue
        sd[t] = int(sc)
ts,scs = zip(*sd.items())
s = PS()
def analyzer(st):
    r = []
    for word in st.split():
        if word.isdigit():
            r.append('<NUM>')
        elif not word.isalnum():
            r.append('<PUNCT>')
        else:
            r.append(s.stem(word.lower()))
    return r
tv = TV(min_df=25, stop_words='english', analyzer=analyzer)
ti = tv.fit_transform(ts)
lr = SVR()
lr.fit(ti, scs)
d={'4 w':378,'y 42':333,'eeta':280,'an Got':279,'s 2 ':275,"e I'":208,'r CP':203,'? No':179,'B.O':156}
def c(t):
    for s in d.keys():
        if s in t:
            return d[s]
    t = tv.transform([t])
    r = lr.predict(t)[0]+1.5
    return r

justhalf

Posted 2014-11-11T05:31:45.507

Reputation: 2 070

I'm not sure I understand - you read the scores from an external file? So why not just predict sd[t]? This would give a score of 0... – Ofri Raviv – 2014-11-11T16:37:05.970

2Because that would not be fun =p – justhalf – 2014-11-11T23:19:29.177

4

Python 2, 986 chars, score 0.3480188, predict 12

M,S=14.29,23.02
D=lambda x:[ord(y)-32 for y in x]
w='fiLoNoNuMiNoTiMoShOnLoGoLeThRantexgensuBaSqUnInPeReGaMuLinprOuThInThEvEnClOpExPyThADdiSoLiSiMaTrEqUsImAsCoManstARrePoWoReawhapoLandradeTyPaLOsoReCreprediVeReSpebeTiPrImPladaTrihelmakwhicolpaReValpaTrafoROsoumovfinfunpuzyoufaSeQuiwhecoDeChagivcouchehanpoStrdiGraconjavwricalfrowitbinbrafrabuipoi'
for i in range(26):w=w.replace(chr(65+i),chr(97+i)*2)
w=[w[i:i+3]for i in range(0,372,3)]
m=D("(+),+)+=*...+..++'(*)5)/2)*43++16+0,+33*4*/(0.'+>-)+13+(2?8+6;,3;43+4(+.('(,.*.*+56+6)0((++(B,))+E0,-7/)/*++),+***)2+(3(.*)'")
s=D("))'B)'*j+:51(*3+0')+)^'/<-+MG,-1=),-0:A+T&J&K1%,O*()4Y-%:_A.=A*C]MJ-N%)5(%%-0+).*3Q(M&0'%(+$p*)()a8:-T?%5(-*'/.'+)+@,'J&1'&&")
def G(x,m,s):s=s or 1e-9;return(.4/s)*(2.78**(-(x-m)**2./(2*s*s)))
d={w[i]:(m[i],s[i])for i in range(124)}
def Z(t,c):
 p=1
 for W in t.split():
  if len(W)>3:
   W=W[:3].lower()
   if W in d:p*=G(c,*d[W])
 return p*G(c,M,S)
def J(t):return max([(Z(t,r),r)for r in range(-9,99)])[1]

The relevant function is J.

The program is essentially Naive Bayes using title words as features, but it's extremely restricted thanks to the char limit. How restricted? Well...

  • For each title we convert to lower case and only look at words at least 4 letters long. Then we take the first three letters of each of those words as features. We skip stripping punctuation to save on chars.
  • We choose only letter triplets which are at the start of at least 19 words (these are stored in w above). Compression is done by rearranging the triplets so that as many doubled letters are present as possible, and these pairs are replaced with their corresponding ASCII uppercase (e.g. fiLoNoN... → fil, lon, non, ...)
  • For each triplet, we look at the scores of the titles in which it appears and calculate the mean and standard deviation of the scores. Then we convert those to integers and store them in m, s above, by using the fact that the mean/sd are at most 90 (allowing a direct ASCII encoding, since there are 95 printable ASCII)
  • G is the normal distribution function - we round e to 2dp and the inverse square root of 2 pi to 1 dp to save on chars.

Altogether the extreme char limit made this one of the worst ideas I've ever come up with, but I'm pretty happy with how much I managed to cram in (even though it doesn't work very well). If anyone has better ideas for compression, please let me know :)

(Thanks to KennyTM for pointing out my pointless compression)

Sp3000

Posted 2014-11-11T05:31:45.507

Reputation: 58 729

Unless I've executed the code wrongly, your compression code is even longer than the decompressed result... w='grge…scse';w=[w[i:i+2]for i in range(0,len(w),2)] is 165 bytes while yours C=lambda:…;w=C('…') is 179 bytes. – kennytm – 2014-11-11T15:49:46.450

@KennyTM Oh thanks - I've been messing around with the code so much, trying to fit the char limit that I'd lost track of all the compressing. :P – Sp3000 – 2014-11-11T15:54:01.533

4

Python 2, 535 chars, score 0.330910, predicts 11.35

Average the score for titles containing each word, then use the top and bottom 50 words to possibly modify the BASE score in the guess(title) function.

Python code:

BASE = 11.35
word={}
for sc,ti in csv.reader(open(sys.argv[1])):
    if sc == 'Score': continue
    parts = re.findall(r"[-a-zA-Z']+", ti.lower())
    for p in parts:
        a, c = word.get(p, (0.0,0))
        word[p] = (a+int(sc), c+1)

rank = sorted([(sc/ct,wd) for wd,(sc,ct) in word.items() if ct > 2])
adjust = rank[:50] + rank[-50:]

def guess(title):
    score = BASE
    parts = re.findall(r"[-a-zA-Z']+", title.lower())
    for sc,wd in adjust:
        if wd in parts:
            score *= 0.7 + 0.3*sc/BASE
    return score

Logic Knight

Posted 2014-11-11T05:31:45.507

Reputation: 6 622

3

C

Score: Unknown
Length: 5 bytes
Predicts: 5

Golfed:

int s(char *p){return 5;}

Ungolfed:

int s(char *p)
{
   return 5;
}

A query of the scores gives an average score of 5.

I've not got the ability to test it at the moment, others are welcome to run/edit.

Joshpbarron

Posted 2014-11-11T05:31:45.507

Reputation: 787

Further Gulfed: int s(){return 5;} – Joshua – 2014-11-12T21:22:35.340

"You are hereby challenged to write a program or function that takes the title of a PPCG question as input, and returns a prediction of its score." - Sorry but nope :0 – Joshpbarron – 2014-11-12T21:38:19.627

I've seen a platform by which if you forgot main(), your first function was main(). Maybe he's depending on that. – Joshua – 2014-11-12T22:08:51.207