In this challenge, your task is to make write a program with less than 300 characters that takes a short paragraph or a few sentences that a candidate has said and output who said it.

Input: Can be taken as a parameter to a function, input to a program etc. It will be a short paragraph, properly punctuated.

Output: The candidate that you think it is. This could be one of

Ben Carson (1)
Ted Cruz (2)
John Kasich (3)
Marco Rubio (4)
Donald Trump (5)
Hillary Clinton (6)
Bernie Sanders (7)

I've left off the names of people who have dropped out as of March 1st. You may output the name itself, or, more conveniently, the number that corresponds to the name.

Scoring: Your score is the percentage of test cases you get right. Highest score wins. Ties (or perfect scores) are broken by code length as in a code golf.

The test cases can be pulled from:

http://www.presidency.ucsb.edu/debates.php

Click on each debate, both Democrat and Republican that has happened so far (before March 1st). Every paragraph is a test case, unless the "paragraph" is less than 20 characters long.

Here is code that pulls out the test cases from a particular page:

var t = $(".tools").parentNode.querySelectorAll("p");
var categ = {}, cur = 0;
for (var i = 0; i < t.length; ++i) {
  var p = t[i], str = p.innerText;
  if (p.querySelector("b")) {
    cur = p.querySelector("b").innerText.replace(':', '');
    str = str.replace(/^.*?:\s/, '');
  }
  str = str.replace(/\[applause\]/g, '')
  if (str.length < 20) continue;
  if (categ[cur] == null) categ[cur] = [];
  categ[cur].push(str);
}

You can then do categ.SANDERS to get a list of all the paragraphs that Senator Sanders has said.

You can discard anything that isn't said by the candidates listed above (e.g. categ.BUSH or categ.CHRISTIE).

Here is the file with all the test cases: https://drive.google.com/file/d/0BxMn8--P71I-bDZBS2VZMDdmQ28/view?usp=sharing

The file is organized by candidate

CANDIDATE CANDIDATE_LAST_NAME
(empty line)
Series of statements. Each paragraph is separated by (NEW PARAGRAPH)-
(empty line)
CANDIDATE NEXT_CANDIDATE_LAST_NAME
(empty line)
etc.

An example partial submission would be:

if (/ win | wall | great | beautiful/.test(p)) return 5;
if (/ percent | top one | rigged /.test(p)) return 7;
// etc. for all candidates

var words = p.split(' ');
// majority of words have less than 5 characters
if (words.length - words.filter(a => a.length < 5).length < 4) evidence[5]++;
// at the end
return /* index with the most evidence */

Here is a place where you can test javascript solutions: https://jsfiddle.net/prankol57/abfuhxrh/

The code uses the parameter p to represent the phrase to classify. Example code that scores around 20% (guessing would get around 11%):

if (/ rigged | top | percent | Wall Street /.test(p)) return 'Sanders';
return 'Trump';

Exactly what I'm asking: Write a program/function in less than 300 characters that takes as input a phrase that a candidate has said and returns as output which candidate said it. Your score is the percentage of test cases you get right. Highest score wins.

Yes, I know that a lot of lines have [laughter] or [cheering] in them. These will not be removed. At worst, they are extra information you can ignore; at best, they are extra information you can use (e.g. I made this up, but maybe people laughter is evidence that Marco Rubio is speaking). The test cases are as they appear in the text file.

soktinpk

Posted 2016-03-02T01:58:19.510

Reputation: 4 080

1I have a suggestion. How about you make it code-golf, but you have to get all the quotes right? Also, you might want to make the quotes a lot shorter, as this is a little ridiculous to solve as-is. – Cyoce – 2016-03-02T03:07:09.760

@Cyoce The point of the challenge is you look for key words and phrases that are associated with the candidates. You aren't supposed to get all of them right. – soktinpk – 2016-03-02T03:07:48.267

2@Cyoce getting all the quotes right would be ridiculous (I think) considering the sheer number of quotes. – soktinpk – 2016-03-02T03:13:32.413

which is why I suggested reducing the quotes. Paired together, it would make this challenge a lot more manageable – Cyoce – 2016-03-02T03:15:10.050

@Cyoce The reasons I made this a code challenge is because a) if it is code golf, I have to limit the number of quotes, and I feel like the answer won't be interesting. In other words, the best answer will be some sort of hash that happens to work but isn't very general. b) The candidates have pretty distinct personalities and word choice so it can't be too hard. The point of the challenge is to figure out what kinds of things candidates say and apply them. – soktinpk – 2016-03-02T03:21:10.160

ok. I see what you're going for. I'll have to run a word-frequency count on the quotes to determine my approach – Cyoce – 2016-03-02T03:30:04.603

1Clever challenge idea, may need some refining though. Have you considered posting in Sandbox for some feedback? – Ashwin Gupta – 2016-03-02T06:19:47.113

1What is the winning criterion? (And why do you think that no-one will get a perfect score?) – Peter Taylor – 2016-03-02T11:00:32.870

@PeterTaylor The submission that classifies the most of the candidates' speech correctly wins. – user48538 – 2016-03-02T11:03:17.510

@PeterTaylor isn't [tag:test-battery] an objective winning criterion in itself? – plannapus – 2016-03-02T12:00:33.103

@soktinpk it would be nicer if you cleaned up the test cases (like one file for each candidate containing all test-cases clearly separated). Are elements between brackets such as [laughter] etc. part of the test-cases? – plannapus – 2016-03-02T12:06:45.243

1@plannapus, I don't think so. It's like code-challenge: it tells you something about the winning criterion, but doesn't fully specify it. – Peter Taylor – 2016-03-02T13:29:27.277

@PeterTaylor fair enough, I see your point. – plannapus – 2016-03-02T14:12:11.597

@PeterTaylor I edited it, is it clearer now? Sorry for the delay. – soktinpk – 2016-03-06T00:33:34.423

Interesting challenge. Inspired by this perhaps? http://googleresearch.blogspot.co.uk/2016/02/on-personalities-of-dead-authors.html

– Dave – 2016-03-06T08:51:27.837

The source data you provided is a little messy (difficult to parse automatically), which I think takes away some of the spirit of the challenge. I've made a cleaned-up version which uses one line per quote, with a blank line separating the next candiate name. This is much easier to parse in most languages. I've uploaded it here: https://drive.google.com/file/d/0B3uyVnkMpqbVSnVrZkVwTUhDODg (other than changing newlines, I've left the data untouched. That includes what looks like an encoding issue for –)

– Dave – 2016-03-07T19:09:35.527

Answers

Polyglot, ~18.6%

This works in: Cjam, Pyth, TeaScript, Japt, Seriously, 05AB1E, GolfScript, Jelly, and probably many more.

This outputs Hillary for all inputs. This is because Hillary said the most. While it's not the most ingenious way to do this. It works ¯\_(ツ)_/¯

Downgoat

Posted 2016-03-02T01:58:19.510

Reputation: 27 116

I like how this gets flagged as low quality post. :P – Denker – 2016-03-05T01:24:04.200

1@DenkerAffe probably for being short – Downgoat – 2016-03-05T01:27:46.860

1Any reason for using JavaScript? You could've golfed it down to one character in some other language :P – ghosts_in_the_code – 2016-03-07T07:14:37.063

@ghosts_in_the_code fixed – Downgoat – 2016-03-07T22:58:59.223

Pyth, 34.16% (297 bytes)

FNc"7creta
6enato
3ohio
2donal
7 major 
6o try t
5tot
5se me
7nai
4m pres
2he ob
3 bala
5jeb
6e aff
5mendous 
2mnest
5. we'r
7ave got to
2c ter
4ntur
7 campaign 
2flat
5obo
4is pre
4-here'
2note
2m el
4 issue 
5, very
6o af
1fact o
6en's
5pany
6he republicans
7 -- 
4meon
5bea
4ory o
7"bI}tNrzZhNB

(note that some lines end in spaces)

I've gone with the simplest option I could think of: check a list of patterns, and as soon as you find a match, output the corresponding candidate. If all else fails, output the most likely candidate from the remainder. After that, it's all about cramming as much data into 300 bytes as possible.

FNc"<data>"bI}tNrzZhNB

Breakdown:

FN                      for N in ...
   "<data>"              the hard-coded data (newline separated)
  c                      split using...
           b             '\n' constant,
            I           if
              tN         tail (all but first char) of current item
             }           is contained within
                rzZ      the input (lowercased),
                        then:
                   hN    print the head (first char) of the current item
                     B   and break out of the loop.

So where does that data come from? Well the structure is simply:

<candidate_number><phrase>
<candidate_number><phrase>
<etc.>

(with an entry at the end with no phrase to act as the final fall-back)

But why those particular items? I wrote a C++ program to analyse the provided dataset (with some manual cleaning of newlines first to make the structure consistent). It looks at all substrings ("tokens") in each quote (1-16 characters), then repeatedly checks for the token which gives the most benefit to go next in the list. Once a pattern is in the list, remove any quotes which match it and repeat (it gets a bit more complicated to keep it fast but that's the basics). The code is probably too long to include here, but I might put it on github later (when I've cleaned it up a little).

I tried a couple of scoring systems. In the end I went with this one:

score = (
    + matching_quote_count_for_most_likely_author * 10
    - matching_quote_count_for_other_authors * 7
    - token_length
)

A stricter approach of only allowing new items which don't introduce incorrect answers seemed to get stuck at about 20-25%, needing a lot of patterns to get higher. This fuzzier approach does much better, and can still reach ~80% accuracy (with 550 items). The submitted score has 38 items, which was the most I could fit in the 300 character limit.

The 34% result actually comes from a test C++ program which performs the same steps. It should match, but I haven't got a Pyth test harness to check it with.

This is the first time I've used Pyth, so I imagine some more bytes could be squeazed out, allowing a bit more data.

Dave

Posted 2016-03-02T01:58:19.510

Reputation: 7 519

4Also I now know that Sanders loves talking about secretary Clinton, Clinton's obsessed with senator Sanders, Kasich loves Ohio, Cruz is always mentioning Donald Trump, Rubio is obsessed with Centuries, Carson has all the "fact[s] of the matter", and Trump totally loves saying "totally". This feels like the beginnings of a politics-bingo generator. I'll have to try it on some UK personalities… – Dave – 2016-03-07T19:02:08.353

I think you could save some bytes here by packing the string with .". – lirtosiast – 2016-03-14T16:59:42.240

Javascript, 32.87%

299 Characters:

function Q(a,b){return p.toLowerCase().split(' ').join('').includes(a)<<b}z=Q('ink',0)+Q('int',1)+Q('ona',2)+Q('rica',3)+Q('twe',4)+Q("we'",5)+Q('youkn',6);return '55472726464727446676664676767676563641233643334456364233336141745116222136477126111113361611262316263122216111673336225611363276'[z]*1

Strategy:

I did a bruce force search on which word segments to include in a "hash". Then a string lookup happens with that hash in such a way that chooses the most likely candidate for that hash.

The code itself:

// The Q function checks if a string is present.
// Then left-shifts the true/false result up to e.g. 64,32,16,8,4,2,1
// This way we can combine results into any number 0 to 127.
function Q(a,b){return p.toLowerCase().split(' ').join('').includes(a)<<b}

// Now we check for key string occurrences:
z=Q('ink',0)+Q('int',1)+Q('ona',2)+Q('rica',3)+Q('twe',4)+Q("we'",5)+Q('youkn',6)

// Finally, use this as an index into the lookup string. (Multiply by 1 to convert char to int.)
return '55472726464727446676664676767676563641233643334456364233336141745116222136477126111113361611262316263122216111673336225611363276'[z]*1

This is my first ever code golf submission, so suggestions are welcome :)

Ru Hasha

Posted 2016-03-02T01:58:19.510

Reputation: 181

Mathematica, 23.7775 %

(l=ToLowerCase@#;Ordering[-StringCount[l,#]&/@{"fact","donald"|"obama","done"|"ohio","issue"|"united"|"why"|"world","great"|"many","senator","american"|"believe"|"campaign"|"secretary"|"street"|"wall"},1])[[1]]&

It counts the occurences of common keywords unique to each candidate and outputs the number of the candidate with the highest score.

Basically, I found the commonest words of all candidates

t = Import["~/Documents/candidate quotes.txt"];
ts = DeleteCases[StringSplit[t, "\n\n"], ""];
tss = Split[ts, StringLength[#2] > 20 &][[{3, 4, 5, 6, 7, 1, 2}]];
names = StringSplit[#][[2]] & /@ tss[[All, 1]];
quotes = StringSplit[#, "(NEXT PARAGRAPH)"] & /@ StringJoin /@ tss[[All, 2 ;;]];
(* remove the 100 commonest english words *)
wd = WikipediaData["Most common words in English", "ArticleWikicode"];
Flatten[StringSplit[StringCases[wd, 
  Shortest["{| class=\"wikitable\"" ~~ w__ ~~ "}"] -> w], "\n"]];
common100 = 
Alternatives @@ ToLowerCase@DeleteDuplicates@Flatten[StringSplit /@ 
     StringCases[#, 
      "|| " ~~ ("[[" | "") ~~ w : ((WordCharacter | " ") ..) -> 
       w] & /@ %];
commonest = 
  Commonest[
Flatten[StringSplit[
    StringDelete[ToLowerCase[#], 
     PunctuationCharacter | (WordBoundary ~~ (common100) ~~ 
        WordBoundary)]] & /@ #], 20] & /@ quotes;

and chose the common keywords that are unique for each candidate.

keywords = 
 Alternatives @@@ 
  Table[Complement[commonest[[n]], 
    Union[Flatten[Delete[commonest, n]]]], {n, Length[names]}];

After manually deleting some of the keywords, this is the final table:

Carson    fact
Cruz      donald|obama
Kasich    done|ohio
Rubio     issue|united|why|world
Trump     great|many
Clinton   senator
Sanders   american|believe|campaign|secretary|street|wall

With these keywords the total function length is 211 characters. I tested the function over all quotes:

pairs = Flatten[MapThread[Table[q -> #1, {q, #2}] &, {names, quotes}]];
test[q_ -> n_] := Boole[n === names[[p@q]]] (* here p is my function that outputs the predicted candidate's number *)
Total[ParallelMap[test, pairs]]/Length[pairs] // N

which gives an accuracy of 23.7775 %.

shrx

Posted 2016-03-02T01:58:19.510

Reputation: 462

Python, 25.677868%

Arbitrarily picked four different characters that would be used to identify the candidates. Each candidate is given a score factor per character based on a hill climbing search that I ran for a few minutes to end up at 25.68%.

I suppose that this at least proves that the concept is better than picking a candidate blindfolded or just picking Clinton, but I would be interested in seeing someone apply a better search algorithm, both for the factors and for the characters used.

w=dict(zip("hr?.",((.847,.491,.821,.54,.744,.765,.234),(.494,.777,.202,.587,.7,.852,.484),(.915,.187,.161,.559,.748,.244,.43),(.11,.013,.628,.974,1.037,.484,.302))))
def f(t,r=(0,0,0,0,0,0,0)):
 s=r
 for c in t:s=map(lambda a,b:a+b,s,w.get(c,r))
 return s.index(max(s))+1

boomlinde

Posted 2016-03-02T01:58:19.510

Reputation: 191

Javascript, TBD

a=[...p].reduce((a,b)=>(a<<5)-a+b.charCodeAt(0)|0,0)%1000,alert(a>0?0:1000,a<79?1:a<226?2:a<333?3:a<497?4:a<697?5:a<849?6:7)

Converts each string to a hash code, then uses probabilistic methods to determine the speaker. Would be nice if someone with a good setup could test this for me.

LegionMammal978

Posted 2016-03-02T01:58:19.510

Reputation: 15 731

I count about 16.1%, but I'm not really sure what it does. What does the a+=a?0:1000 do? (I had to replace the alert with a return so I wasn't sure exactly what to do) – soktinpk – 2016-03-06T21:41:26.387

@soktinpk Sorry, the a+= must have been a typo. – LegionMammal978 – 2016-03-06T22:07:14.520

Who said that? 2016 Presidential election

Answers

Polyglot, ~18.6%

Pyth, 34.16% (297 bytes)

Javascript, 32.87%

Mathematica, 23.7775 %

Python, 25.677868%

Javascript, TBD