Predict whether a message will be starred or not in 50 bytes

41

6

Given an input of a string consisting of any message from our site chatroom taken from the list described and linked below, output either a truthy or a falsy value attempting to predict whether that message was starred or not in 50 bytes or less.

You may use any truthy or falsy values, but they must be identical (i.e. there should only be two possible outputs, one truthy and one falsy). The input will be given as raw HTML with newlines removed, and it may contain non-ASCII Unicode characters. If you require input in something other than UTF-8, please say so in your answer.

The winning submission to this challenge will be the one that predicts the highest percentage of chat messages correctly, out of the list linked below. If two given submissions have the same success rate, the shorter submission will win.

Please provide instructions for running your code on the entire set of messages and calculating the percentage correct. Ideally, this should be a bit of boilerplate code (not counted towards your 50 bytes) that loops through the positive test cases and outputs how many of them your code got correct and then does the same for the negative test cases. (The overall score can then be calculated manually via (correctPositive + correctNegative) / totalMessages.)

So that your code is reasonably testable, it must complete in 5 minutes or less for the entire list of chat messages on reasonable modern-day hardware.

The full list of chat messages can be found here, and it consists of the 1000 latest starred messages as truthy test cases and the 1000 latest unstarred messages as falsy test cases. Note that there are two files in the gist; scroll about halfway down for the unstarred messages.

Doorknob

Posted 2016-01-17T21:53:52.930

Reputation: 68 138

4Knowing the behaviors of chat, I think the following Pyth would suffice: O2 – Arcturus – 2016-01-17T22:18:10.807

9Considering the history of past starred messages, Regex, 11 bytes: Don'?t star – Downgoat – 2016-01-17T22:26:18.753

What kind of encoding is used in the test files (utf-8 right?)? Whatever I try, I always get more than 1000 lines. Is there something else I'm doing wrong? – flawr – 2016-01-17T22:32:44.573

Ok I think I get it: When there are messages with code (those with <pre> and <br>) all those linebreaks within are carriage returns \r and not newlines. But when we download these files, it seems there will be \ns inserted. Suggestion @Doorknob : Remove the \rs from your testfiles. – flawr – 2016-01-17T22:46:28.527

11This would be much easier if you were also given the user as part of the input. – Mama Fun Roll – 2016-01-18T01:07:43.077

And whether I was online at that time? – None – 2016-01-18T03:14:22.617

3At some point I would've answered Regex, 2 bytes \^ – PurkkaKoodari – 2016-01-18T08:46:17.487

2The title should have been something like: "Don't star this challenge" – James – 2016-01-18T16:37:20.830

1+1 for good challenge, -1 for "Do X in Y bytes". Solid sidevote here. – Mego – 2016-01-18T17:09:47.080

14I think you should run this again on the next 1,000 messages, and see which one really predicted starredness – abligh – 2016-01-18T23:03:15.597

@mego i usually would agree with this but it seems like the best way to do [tag:test-battery] – undergroundmonorail – 2016-01-19T00:31:45.857

Random question: how did you compile these lists? – ETHproductions – 2016-01-19T19:54:42.230

by what starring is defined? and how to access it? BTW -1 for not a real coding but more a "know your api" challange. Since I as embedded developer have not even a chance to create a body that would inspect the input in under 50 bytes (C and C++ here...) – Zaibis – 2016-01-20T11:24:05.363

@Zaibis See the last paragraph. Why can't you use C? There's only a few bytes of boilerplate (int f(char*s){}). – Doorknob – 2016-01-20T11:45:10.370

@ETHproductions I wrote a short Ruby script to scrape the transcript URLs. – Doorknob – 2016-01-20T11:45:45.110

Answers

29

Retina, 50 bytes, 71.8% 72.15%

^.*([[CE;ಠ-ﭏ]|tar|ol|l.x|eo|a.u|pin|nu|o.f|"$)

Tried some regex golfing at @MartinBüttner's suggestion. This matches 704 starred messages and doesn't match 739 unstarred messages.

The ^.*( ... ) is to make sure that there is always either 0 or 1 match, since Retina outputs the number of matches by default. You can score the program on the input files by prepending m` for multiline mode, then running

Retina stars.retina < starred.txt

and likewise for unstarred.txt.


Analysis / explanation

I generated the above snippets (and many more) using a program, then selected the ones I wanted manually. Here's some intuition as to why the above snippets work:

  • C: Matches PPCG, @CᴏɴᴏʀO'Bʀɪᴇɴ
  • E: Matches @ETHproductions, @El'endiaStarman
  • ;: Because the test cases are HTML, this matches &lt; and &gt;
  • ಠ-ﭏ: Matches a range of Unicode characters, most prominently for ಠ_ಠ and @Doorknob冰
  • tar: Matches variations of star, @El'endiaStarman (again) and also gravatar which appears in the oneboxes posted by new posts bots
  • ol: Matches rel="nofollow" which is in a lot of links and oneboxes
  • l.x: Matches @AlexA., @trichoplax
  • eo: Mainly matches people, but also three cases for @Geobits
  • a.u: Mainly matches graduation, status, feature and abuse
  • pin: Matches ping and words ending in ping. Also matches a few posts in a discussion about pineapple, as an example of overfitting.
  • nu: Matches a mixed bag of words, the most common of which is number
  • o.f: Matches golf, conf(irm|use)
  • "$: Matches a double quote as a last character, e.g. @phase He means "Jenga."

The [ is nothing special - I just had a character left over so I figured I could use it to match one more case.

Sp3000

Posted 2016-01-17T21:53:52.930

Reputation: 58 729

(I haven't posted the testing code yet because it seems to be running rather slowly, and I'd like to figure out why. It's too late now though.) – Sp3000 – 2016-01-18T14:41:35.250

1Executing Retina once for each test case will take a long time. Multi-line mode reports the claimed score pretty much instantly. – Dennis – 2016-01-19T23:32:39.513

@Dennis Thanks, I completely forgot I could do that. – Sp3000 – 2016-01-20T09:50:23.953

3LOL, now my name is a star magnet? – ETHproductions – 2016-01-28T20:56:33.693

18

JavaScript ES6, 50 bytes, 71.10%

Correctly identifies 670 starred and 752 non-starred.

x=>/ .[DERv]|tar|a.u|l.x|<i|eo|ol|[C;ಠ]/.test(x)

Now across the 70% barrier, and beating everyone except Retina!

Returns true if the message contains any of these things:

  • A word of which the second letter is D, E, R, or v;
  • tar (usually star);
  • a and u with one char in between;
  • l and x with one char in between (usually alex);
  • italic text;
  • eo or ol;
  • a C, a semicolon, or a .

Here's a few more fruitful matches that don't seem to be worth getting rid of others:

  • nf
  • nu
  • yp
  • n.m

This has been growing closer and closer to the Retina answer, but I have found most of the improvements on my own.

Test it out in the console of one of these pages: star texts, no-star texts

var r=document.body.textContent.replace(/\n<br/g,"<br").split("\n").slice(0,-1);
var s=r.filter(function(x){return/ .[DERv]|tar|a.u|l.x|<i|eo|ol|[C;ಠ]/.test(x)}).length;
console.log("Total:",r.length,"Matched:",s,"Not matched:",r.length-s);

Here's an alternate version. /a/.test is technically a function, but doesn't satisfy our criteria:

/ .[ERv]|a.u|l.x|<i|eo|yp|ol|nf|tar|[C;ÿ-ff]/.test

This scores 71.90% (697 starred, 741 unstarred).


I've been running some analyses on the lists to see which regex groups match the most starred and the least unstarred posts. The analyses can be found in this Gist. So far, I've checked aa and a.a matches. a.u is down at around #50 with a score of 28, yet it's the most efficient match of its format...

ETHproductions

Posted 2016-01-17T21:53:52.930

Reputation: 47 880

There are only 1000 messages...? – Conor O'Brien – 2016-01-17T22:40:40.600

2@CᴏɴᴏʀO'Bʀɪᴇɴ Some were multi-line, which wasn't accounted for in the snippet. This has been fixed. – ETHproductions – 2016-01-17T22:50:10.993

Why no one uses /regexp/.test()? I think it's possible to squeeze in a few more cases with that. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2016-01-18T02:36:03.787

8Today I learned I can get chat stars just by saying my own name. – Alex A. – 2016-01-18T02:40:57.083

@n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ Thanks, dunno how I didn't think of that – ETHproductions – 2016-01-18T02:43:06.657

Quick question: would f=/t[ao]r|l.x|bo[tx]|a.u|ol|<.>|[C;ÿ-ff]/.test count as a valid answer? – Bojidar Marinov – 2016-01-19T14:38:59.870

@BojidarMarinov Thanks for the suggestion, but that throws an error for me. However, /t[ao]r|l.x|bo[tx]|a.u|ol|<.>|[C;ÿ-ff]/.test seems borderline, since you can still call it like ("test")... – ETHproductions – 2016-01-19T16:40:15.057

I don't think this is valid by our current consensus "The submitted code should either define a named function or evaluate to **an unnamed function which can be captured in a name if desired (or if the language provides other mechanisms which allow the function to be used multiple times without repeating its definition).**"

– Martin Ender – 2016-01-19T18:58:37.517

@MartinBüttner That makes sense. Thanks for letting me know. – ETHproductions – 2016-01-19T19:08:14.993

@n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ cc ^ – Martin Ender – 2016-01-19T19:08:50.187

15

Pyth, 50 bytes, 67.9 %

0000000: 21 40 6a 43 22 03 91 5d d3 c3 84 d5 5c df 46 69 b5 9d  !@jC"..]....\.Fi..
0000012: 42 9a 75 fa 74 71 d9 c1 79 1d e7 5d fc 25 24 63 f8 bd  B.u.tq..y..].%$c..
0000024: 1d 53 45 14 d7 d3 31 66 5f e8 22 32 43 7a              .SE...1f_."2Cz

This hashes the input in one of 322 buckets and chooses the Boolean depending on that bucket.

Scoring

$ xxd -c 18 -g 1 startest.pyth
0000000: 72 53 6d 21 40 6a 43 22 03 91 5d d3 c3 84 d5 5c df 46  rSm!@jC"..]....\.F
0000012: 69 b5 9d 42 9a 75 fa 74 71 d9 c1 79 1d e7 5d fc 25 24  i..B.u.tq..y..].%$
0000024: 63 f8 bd 1d 53 45 14 d7 d3 31 66 5f e8 22 32 43 64 2e  c...SE...1f_."2Cd.
0000036: 7a 38                                                  z8
$ echo $LANG
en_US
$ pyth/pyth.py startest.pyth < starred.txt
[[345, False], [655, True]]
$ pyth/pyth.py startest.pyth < unstarred.txt
[[703, False], [297, True]]

Dennis

Posted 2016-01-17T21:53:52.930

Reputation: 196 637

14

CJam, 45 bytes, 65.55%

l_c"\"#&'(-.19<CEFHIJLMOPSTXY[_qಠ"e=\1b8672>|

This checks if the first character is in a specific list or the sum of all code points is larger than 8,672.

Scoring

$ cat startest.cjam
1e3{l_c"\"#&'(-.19<CEFHIJLMOPSTXY[_qಠ"e=\1b8672>|}*
$ java -jar cjam-0.6.5.jar startest.cjam < starred.txt | fold -1 | sort | uniq -c
    308 0
    692 1
$ java -jar cjam-0.6.5.jar startest.cjam < unstarred.txt | fold -1 | sort | uniq -c
    619 0
    381 1

Dennis

Posted 2016-01-17T21:53:52.930

Reputation: 196 637

+1 for teaching me about the fold command, along with the actual answer. – Doorknob – 2016-01-17T23:42:41.893

6

Matlab/Octave, 17 bytes 60.15%

Correctly classifies 490 messages as stared, 713 messages as unstared

Current version:

Just checking the length.

f=@(w)numel(w)>58

Old version:

Could be translated to any other language. It just checks whether the message contains the words star or not. score: 59/911/52.5%

f=@(w)nnz(strfind(lower(w),'star'))>0 %

Results for testcases using this code:

slCharacterEncoding('UTF-8');

fid = fopen('codegolf_starred_messages_starred.txt');
line = fgetl(fid);
starred = 0;
while ischar(line)
    if f(line);
        starred = starred +1;
    end

    disp(line)
    line = fgetl(fid);
end
fclose(fid);


fid = fopen('codegolf_starred_messages_unstarred.txt');
line = fgetl(fid);
unstarred = 0;
while ischar(line)
    if ~f(line);
        unstarred = unstarred +1;
    end

    disp(line)
    line = fgetl(fid);
end
fclose(fid);

disp(['  correctly classified as *ed: ',num2str(starred)])
disp(['correctly classified as un*ed: ',num2str(unstarred)])
disp(['                  total score: ',num2str((starred+unstarred)/20),'\%'])

flawr

Posted 2016-01-17T21:53:52.930

Reputation: 40 560

3

JavaScript ES6, 0.615 = 61.5%

342 correctly identified as starred, 888 correctly identified as unstarred, (342+888)/2000 = 0.615

x=>-~x.search(/(bo|le)x|sta|ಠ|ツ/i)

Test like this on this or this:

r=document.body.innerHTML.replace(/<\/*pre>/g,"").split`
`.filter(x=>-~x.search`(bo|le)x|sta|ಠ|ツ`).length

I STILL MIGHT GET YOU, MY PRETTY!

Conor O'Brien

Posted 2016-01-17T21:53:52.930

Reputation: 36 228

1I've got you now ;) – ETHproductions – 2016-01-17T23:27:58.310

@ETHproductions GG. I'll look for some more common patterns. – Conor O'Brien – 2016-01-17T23:30:35.350

3

CJam, 32 bytes, Overall score of 0.5605 (56%).

Correctly identifies 428 starred messages and 693 unstarred messages. Total score is (360+730)/2000=0.545.

l_el"sta"/,1>\,)4%!|

Not expecting to win, Ill see how it performs. Above is the code for a single message, to run with multiple use this modified version that returns amount of starred messages:

1000{l_el"star"/,1>\,)6%!|}fA]:+

Just test it with STDIN being the raw text of either file. Returns true if the message contains "star" or if length + 1 mod 4 = 0.

GamrCorps

Posted 2016-01-17T21:53:52.930

Reputation: 7 058

2So... if four divides one more than the length of a message, then it has a chance of being starred? – Conor O'Brien – 2016-01-17T22:47:30.063

2@CᴏɴᴏʀO'Bʀɪᴇɴ Yes, but it provides for a high score – GamrCorps – 2016-01-17T22:50:06.240

3

Retina, 46 bytes, 68.55

^.*([zj_C;&¡-ff]|sta|san|soc|bo|eo|xk|l.x|<.>)

679 star : 692 unstar

Switched to Retina to get some more regexes in... Still not done.

Mama Fun Roll

Posted 2016-01-17T21:53:52.930

Reputation: 7 234

Oh yeah, forgot about that. I'll fix it. – Mama Fun Roll – 2016-01-20T01:43:24.713

1

C# 6.0 (.NET Framework 4.6), 50 Bytes, 63,60%

bool s(string i)=>Regex.IsMatch(i,"ol|tar|l.x|ಠ");

Program i used for testing purposes:

void Main()
{
    var starred = @"C:\starred.txt";
    var unstarred = @"C:\unstarred.txt";

    var linesStarred = File.ReadAllLines(starred);
    var linesUnstarred = File.ReadAllLines(unstarred);

    var cls = linesStarred.Count();
    var clsc = 0;

    foreach (var line in linesStarred)
    {
        if ( s(line) ) clsc++;
    }

    var clu = linesUnstarred.Count();
    var cluc = 0;

    foreach (var line in linesUnstarred)
    {
        if (!s(line)) cluc++;
    }

    $"Starred {clsc}/{cls} correct ({(clsc/cls*100):0.00}%)".Dump();
    $"Unstarred {cluc}/{clu} correct ({(cluc /clu*100):0.00}%)".Dump();
    $"{(((clsc+cluc)/(decimal)(cls+clu))*100):0.00}".Dump();
}

bool s(string i)=>Regex.IsMatch(i,"ol|tar|l.x|ಠ");

Stephan Schinkel

Posted 2016-01-17T21:53:52.930

Reputation: 596