The Onion, or Not The Onion?

11

2

The Onion (warning: many articles are NSFW) is a satirical news organization that parodies traditional news media. In 2014, The Onion launched ClickHole (warning: also frequently NSFW), a satirical news website that parodies "clickbait" sites like BuzzFeed. Thanks to Poe's Law, it's fairly common for people to read the headlines of articles from The Onion or ClickHole and believe them to be true, not knowing that they are intended to be satire. The converse also happens with ridiculous-sounding real news stories - people often think that they are satire when they are not.

This confusion naturally lends itself to a game - given a news headline, try to guess whether or not it is satire. This challenge is about doing exactly that with a program.

Given a news headline (a string consisting of only printable ASCII characters and spaces), output 1 if the headline is satire, or 0 if it is not. Your score will be the number of correct outputs divided by the total number of headlines.

As per usual, standard loopholes (especially optimizing for the test cases) are not allowed. To enforce this, I will run your programs on a set of 200 hidden test cases (100 from The Onion, 100 from Not The Onion). Your solution must score no more than 20 percentage points less than your score on the public test cases in order to be valid.

Test Cases

To come up with test cases for this challenge, I picked 25 headlines from The Onion subreddit (where articles from The Onion and its child sites, like ClickHole, are posted), and 25 headlines from the Not The Onion subreddit (where real news articles that sound like satire are posted). The only changes I made to the headlines were replacing "fancy" quotes with regular ASCII quotes and standardizing capitalization - everything else is left unchanged from the original article's headline. Each headline is on its own line.

The Onion headlines

Trump Warns Removing Confederate Statues Could Be Slippery Slope To Eliminating Racism Entirely
'No Way To Prevent This,' Says Only Nation Where This Regularly Happens
My Doctor Told Me I Should Vaccinate My Children, But Then Someone Much Louder Than My Doctor Told Me I Shouldn't
Man At Park Who Set Up Table Full Of Water Cups Has No Idea How Passing Marathon Runners Got Impression They Can Take Them
This Child Would Have Turned 6 Today If His Mother Hadn't Given Birth To Him In October
Incredible Realism: The Campaign In The Next 'Call Of Duty' Will Begin At Your Avatar's High School Cafeteria When He's Being Tricked Into Joining The Military By A Recruiter
'Sometimes Things Have To Get Worse Before They Get Better,' Says Man Who Accidentally Turned Shower Knob Wrong Way
Report: Uttering Phrase 'Easy Does It' Prevents 78% Of Drywall Damage While Moving Furniture
Barbara Bush Passes Away Surrounded By Loved Ones, Jeb
Family Has Way Too Many Daughters For Them Not To Have Been Trying For Son
News: Privacy Win! Facebook Is Adding A 'Protect My Data' Button That Does Nothing But Feels Good To Press
Dalai Lama Announces Next Life To Be His Last Before Retirement
Researchers Find Decline In Facebook Use Could Be Directly Linked To Desire To Be Happy, Fully Functioning Person
Manager Of Combination Taco Bell/KFC Secretly Considers It Mostly A Taco Bell
Trump: 'It's My Honor To Deliver The First-Ever State Of The Union'
Daring To Dream: Jeff Bezos Is Standing Outside A Guitar Center Gazing Longingly At A $200 Billion Guitar
Area Dad Looking To Get Average Phone Call With Adult Son Down To 47.5 Seconds
Experts Warn Beef Could Act As Gateway Meat To Human Flesh
Jeff Bezos Named Amazon Employee Of The Month
Dad Suggests Arriving At Airport 14 Hours Early
Report: Only 3% Of Conversations Actually Need To Happen
Delta Pilot Refuses To Land Until Gun Control Legislation Passed
Family Wishes Dad Could Find Healthier Way To Express Emotions Than Bursting Into Full-Blown Musical Number
New Honda Commercial Openly Says Your Kids Will Die In A Car Crash If You Buy A Different Brand
Teacher Frustrated No One In Beginner Yoga Class Can Focus Chakras Into Energy Blast

Not The Onion headlines

Man Rescued From Taliban Didn't Believe Donald Trump Was President
Nat Geo Hires Jeff Goldblum To Walk Around, Being Professionally Fascinated By Things
Mike Pence Once Ratted Out His Fraternity Brothers For Having A Keg
Reddit CEO Tells User, "We Are Not The Thought Police," Then Suspends That User
Trump Dedicates Golf Trophy To Hurricane Victims
Uber's Search For A Female CEO Has Been Narrowed Down To 3 Men
ICE Director: ICE Can't Be Compared To Nazis Since We're Just Following Orders
Passenger Turned Away From Two Flights After Wearing 10 Layers Of Clothing To Avoid Luggage Fee
Somali Militant Group Al-Shabaab Announces Ban On Single-Use Plastic Bags
UPS Loses Family's $846k Inheritance, Offers To Refund $32 Shipping Fee
Teen Suspended From High School After Her Anti-Bullying Video Hurts Principal's Feelings
Alabama Lawmaker: We Shouldn't Arm Teachers Because Most Are Women
Cat Named After Notorious B.I.G. Shot Multiple Times - And Survives
EPA Head Says He Needs To Fly First Class Because People Are Mean To Him In Coach
Apology After Japanese Train Departs 20 Seconds Early
Justin Bieber Banned From China In Order To 'Purify' Nation
Alcohol Level In Air At Fraternity Party Registers On Breathalyzer
NPR Tweets The Declaration Of Independence, And People Freak Out About A 'Revolution'
Man Who Mowed Lawn With Tornado Behind Him Says He 'Was Keeping An Eye On It.'
After Eating Chipotle For 500 Days, An Ohio Man Says He's Ready For Something New
'El Chapo' Promises Not To Kill Any Jurors From Upcoming Federal Trial
After 4th DWI, Man Argues Legal Limit Discriminates Against Alcoholics
Palestinian Judge Bans Divorce During Ramadan Because 'People Make Hasty Decisions When They're Hungry'
Argentinian Officers Fired After Claiming Mice Ate Half A Ton Of Missing Marijuana
'Nobody Kill Anybody': Murder-Free Weekend Urged In Baltimore

Mego

Posted 2018-08-15T23:02:45.850

Reputation: 32 998

Sandbox – Mego – 2018-08-15T23:03:10.663

6Your score will be the number of correct outputs divided by the total number of headlines Is bytecount a tie breaker? – Skidsdev – 2018-08-15T23:19:10.093

9I'm a little bit confused. What kind of solution do you expect? Every solution is going to have to "optimize for the test cases" somewhat, bar writing an AI that can understand English and has a sense for humor. For example, Arnauld's solution detects /ly\b/ which works only because the 25 Onion headlines you picked happen to have more adverbs, but for all I know you could easily trip it up with a different test battery. And who's to say his coefficients aren't chosen to optimize his score? (Why wouldn't he optimize them?) – Lynn – 2018-08-16T10:41:36.273

1Is querying a search API (for example Google), considered a loophole? – Night2 – 2018-08-16T11:06:53.910

@Night2 Yes, because that's fetching the result from an external source. – Mego – 2018-08-16T13:58:17.253

@Mego I feel like we'd do better if we had the body text as well, since the sarcasm comes off a lot more easily there – Don Thousand – 2018-08-16T14:19:52.573

@RushabhMehta That would be a different challenge – Mego – 2018-08-16T17:02:53.940

@Lynn I think it's a safe assumption that these test cases are representative of the entire set of all headlines. Therefore, finding a pattern like -ly and extrapolating that that would hold with additional test cases seems fair to me. – Nathan Merrill – 2018-08-16T18:11:41.070

10This test battery does seem a bit unusual. It's like asking for a classifier that can detect dogs in a photograph, but taking your positive test cases as photos of dogs and your negative test cases from a Buzzfeed article titled "25 Photos of Objects You'll Swear Are Dogs, But Nope, Turns Out They Aren't! (#11 Will Blow Your Mind!)" It makes a hard enough problem harder. – Sophia Lechner – 2018-08-16T19:50:09.320

4Not only is the challenge hard, but it's also non-obvious (to me) what's the difference. If I can't solve it, of course my program can't solve it (that is, while convince me that it doesn't hardcode for the test cases) – user202729 – 2018-08-17T09:31:39.363

4Well I spent +36 hours training an artificial neural network using brain.js and LSTM, with samples in this issue and 100 other samples of each type from provided links, but the result wasn't good enough with new titles which weren't present in training sets. I'm done :P – Night2 – 2018-08-18T10:29:10.757

@Skidsdev Tiebreaker is the default - the first answer (by edit time) to reach the tied score. – Mego – 2018-08-18T21:47:28.403

Answers

7

JavaScript (ES7), 39 / 50 (78%)

63.5% (127/200) on hidden test cases

A simple heuristic based on the length of the title, number of spaces and use of the -ly suffix.

isOnion = str =>
  str.length ** 0.25 +
  str.split(' ').length ** 1.25 * 2 +
  str.split(/ly\b/).length ** 1.75 * 7
  > 76

Try it online!

Arnauld

Posted 2018-08-15T23:02:45.850

Reputation: 111 334

This is absurdly effective for how simple it is. – Don Thousand – 2018-08-16T15:42:19.493

This solution scored 63.5% on the hidden test cases, so it is valid. – Mego – 2018-08-17T00:03:37.220

Not as simple as was possible in the beginning of the sandbox (100%, utilizing capitalization differences before it was standardized) but this is really simple. – Zacharý – 2018-08-17T00:58:14.547

@Mego Just out of curiosity, does this NSFW version improve the score on the hidden test cases? :)

– Arnauld – 2018-08-17T22:51:42.500

@Arnauld 66% with that version – Mego – 2018-08-18T03:37:17.037

6

Python 3, 84%

Untested on hidden test cases.

This uses Keras LSTM RNN trained on various headlines. To run it you need Keras the following and the model which I've made available on GitHub: repo link. You will need the model .h5 and the word/vector mappings are in .pkl. The latest

The dependencies are:

import numpy as np
from pickle import load
from keras.preprocessing import sequence, text
from keras.models import Sequential
from keras.layers import Dense, Embedding, SpatialDropout1D, LSTM, Dropout
from keras.regularizers import l2
import re

The settings are:

max_headline_length = 70
word_count = 20740

The model is:

model = Sequential()
model.add(Embedding(word_count, 32, input_length=max_headline_length))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(64, kernel_regularizer=l2(0.005), dropout=0.3, recurrent_dropout=0.3))
model.add(Dropout(0.5))
model.add(Dense(32, kernel_regularizer=l2(0.005)))
model.add(Dropout(0.5))
model.add(Dense(2, kernel_regularizer=l2(0.001), activation='softmax'))

Now to load the model and the word embeddings:

model.load_weights('model.h5')
word_to_index = load(open('words.pkl', 'rb'))

And the code to test if a string is from 'NotTheOnion' or 'TheOnion' I've written a quick helper function which converts the string to the respective word embeddings:

def get_words(string):
  words = []
  for word in re.finditer("[a-z]+|[\"'.;/!?]", string.lower()):
    words.append(word.group(0))
  return words

def words_to_indexes(words):
  return [word_to_index.get(word, 0) for word in words]

def format_input(word_indexes):
  return sequence.pad_sequences([word_indexes], maxlen=max_headline_length)[0]

def get_type(string):
  words = words_to_indexes(get_words(string))
  result = model.predict(np.array([format_input(words)]))[0]

  if result[0] > result[1]:
    site = 'NotTheOnion'
  else:
    site = 'TheOnion'

  return site

Explanation

This code runs a model which analyzes the relationships between words by representing the words as a 'vector'. You can learn more about word embedding here.

This is trained on headlines but the test cases are excluded.

This processes is automated after a quite a bit of processing. I've distributed the final processed word list as a .pkl but what happens in word embedding is first we analyze the sentence and isolate the words.

After we now have the words the next step is to be able to understand the differences and similarities between certain words e.g. king and queen versus duke and duchess. These embeddings don't happen between the actual words but between numbers representing the words which is what is stored in the .pkl file. Words that the machine doesn't understand are mapped to a special word <UNK> which allows us to understand that there is a word there but that it is not known exactly what the meaning is.

Now that the words are able to be understood, the sequence of words (headline) needs to be able to be analyzed. This is what 'LSTM' does, an LTSM is a type of 'RNN' cell which avoids the vanishing gradient effect. More simply, it takes in a sequence of words and it allows us to find relationships between them.

Now the final layer is Dense which basically means it's kind of like an array meaning the output is like: [probability_is_not_onion, probability_is_onion]. By finding which one is larger we can choose which one is the most confident result for the given headline.

Downgoat

Posted 2018-08-15T23:02:45.850

Reputation: 27 116

3

Python 3 + Keras, 41/50 = 82%

83% (166/200) on hidden test cases

import json
import keras
import numpy
import re

from keras import backend as K

STRIP_PUNCTUATION = re.compile(r"[^a-z0-9 ]+")


class AttentionWeightedAverage(keras.engine.Layer):
    def __init__(self, return_attention=False, **kwargs):
        self.init = keras.initializers.get("uniform")
        self.supports_masking = True
        self.return_attention = return_attention
        super(AttentionWeightedAverage, self).__init__(**kwargs)

    def build(self, input_shape):
        self.input_spec = [keras.engine.InputSpec(ndim=3)]
        assert len(input_shape) == 3

        self.W = self.add_weight(shape=(input_shape[2], 1),
                                 name="{}_W".format(self.name),
                                 initializer=self.init)
        self.trainable_weights = [self.W]

        super(AttentionWeightedAverage, self).build(input_shape)

    def call(self, x, mask=None):
        logits = K.dot(x, self.W)
        x_shape = K.shape(x)
        logits = K.reshape(logits, (x_shape[0], x_shape[1]))

        ai = K.exp(logits - K.max(logits, axis=-1, keepdims=True))

        if mask is not None:
            mask = K.cast(mask, K.floatx())
            ai = ai * mask

        att_weights = ai / (K.sum(ai, axis=1, keepdims=True) + K.epsilon())
        weighted_input = x * K.expand_dims(att_weights)

        result = K.sum(weighted_input, axis=1)

        if self.return_attention:
            return [result, att_weights]

        return result

    def get_output_shape_for(self, input_shape):
        return self.compute_output_shape(input_shape)

    def compute_output_shape(self, input_shape):
        output_len = input_shape[2]

        if self.return_attention:
            return [(input_shape[0], output_len), (input_shape[0], input_shape[1])]

        return (input_shape[0], output_len)

    def compute_mask(self, input, input_mask=None):
        if isinstance(input_mask, list):
            return [None] * len(input_mask)
        else:
            return None


if __name__ == "__main__":
    model = keras.models.load_model("combined.h5", custom_objects={"AttentionWeightedAverage": AttentionWeightedAverage})
    with open("vocabulary.json", "r") as fh:
        vocab = json.load(fh)

    while True:
        try:
            headline = input()
        except EOFError:
            break

        tokens = STRIP_PUNCTUATION.sub("", headline.lower()).split()

        inp = numpy.zeros((1, 45))

        for i, token in enumerate(tokens):
            try:
                inp[0,i] = vocab[token]
            except KeyError:
                inp[0,i] = 1

        print(model.predict(inp)[0][0] > 0.3)

combined.h5 and vocabulary.json can be retrieved from here (very large) and here.

Fully-connected classifier connected to a pre-trained sentiment analysis model DeepMoji, which consists of stacked bi-directional LSTMs and an attentional mechanism. I froze the DeepMoji layers and took out the final softmax layer, trained just the fully connected layers, then unfroze the DeepMoji layers and trained them together for finetuning. The attentional mechanism is taken from https://github.com/bfelbo/DeepMoji/blob/master/deepmoji/attlayer.py (I didn't want to have to use all their code as a dependency for one class, especially since it's Python 2 and rather unwieldy to use as a module...)

This performs surprisingly poorly on Mego's test set, considering that on my own larger validation set it gets >90%. So I'm not done with this yet.

a spaghetto

Posted 2018-08-15T23:02:45.850

Reputation: 10 647

83% on hidden test cases, assuming I ran it correctly – Mego – 2018-08-18T04:25:21.350

1

JavaScript (Node.js), 98% (49/50)

96% (192/200) on hidden test cases

const words = require('./words');
const bags = require('./bags');

let W = s => s.replace(/[^A-Za-z0-9 ]/g, '').toLowerCase().split(' ').filter(w => w.length > 3);

let M = b => {
    for (let i = 0; i < bags.length; i++) {
        let f = true;
        for (let j = 0; j < bags[i].length; j++) if (!b.includes(bags[i][j])) {
            f = false;
            break;
        }
        if (f) return true;
    }
    return false;
};

let O = s => {
    let b = [];
    W(s).forEach(w => {
        let p = words.indexOf(w);
        if (p >= 0) b.push(p);
    });
    return (b.length > 0 && M(b));
};

This requires two large JSON files which I can't put them here or on "TiO". Please download them from the following links and save them with the words.json and bags.json names, in the same folder as JS file. There is also a link for a JS file with test cases and result/percent printing. You can put your hidden test cases in onions and nonOnions variables.

After saving all 3 files in same directory, run node onion.js.

The O function will return true if it is onion and false if it isn't. Uses a big list of word bags (without order) to detect if the input string is onion. Kind of hard coded, but works very well on a variety of random test cases.

Night2

Posted 2018-08-15T23:02:45.850

Reputation: 5 484

This solution gets 96% on the hidden test cases – Mego – 2018-08-20T00:20:27.563

0

Working off Arnauld's solution

JavaScript (ES6), 41 / 50

64% (128/200) on hidden test cases

str.includes("Dad") || str.length ** .25 +
  str.split(' ').length ** 1.25 * 2 +
  str.split(/ly\b/).length ** 1.75 * 7
 > 76

JavaScript (ES6), 42 / 50

62.5% (125/200) on hidden test cases (invalid)

isOnion = str =>
  str.includes("Dad") || str.length ** .25 +
  str.split(' ').length ** 1.25 * 2 +
  str.split(' ').filter(w => w.length > 3 && w.split(/ly/).length > 1).length * 23.54 +
 /\d/.test(str) * 8
 > 76

The length+word count+"ly " concept works pretty well, I was able to squeeze a few more points out by checking for the word "Dad" (when do real articles talk about people's dads in the third person in the title?) and an additional point by changing the "ly " search heuristic and checking for the presence of numbers in the title (which might be less valid in the general case outside the test, so I left both solutions)

TiKevin83

Posted 2018-08-15T23:02:45.850

Reputation: 121

I don't know about the dad part...seems a little bit like optimizing the test case to me... – Don Thousand – 2018-08-16T20:38:17.393

And yes, I can find plenty of Not the Onion articles mentioning dads – Don Thousand – 2018-08-16T20:38:48.950

There's probably a better way to do it as part of the heuristic and not just a hard "win" if it contains dad, but I imagine even outside the test database abstractly talking about a specific "Dad" is more common on The Onion – TiKevin83 – 2018-08-16T21:04:05.480

Your first solution scored 64% on the hidden test cases, so it is valid. Your second solution scored 62.5% on the hidden test cases, so it is not valid. – Mego – 2018-08-17T00:05:45.760

@Mego What a close margin... – user202729 – 2018-08-17T06:36:18.503