Turn nouns into plurals

7

0

Your task is to take as input an English noun and output its plural. To win the challenge you must trade off code length against accuracy.

Entries will be scored according to this comma-separated list of 95642 English nouns and their plurals. Your score is the length of your code in bytes, plus the number of nouns in this file that it does not correctly pluralise. If you manage to correctly pluralise every single line in the file then you may apply a 50% bonus. (I don't expect anyone to claim this, but you never know.) The lowest score wins.

Every line in the file consists of a word composed of letters [a-z], followed by ,, followed by the plural, which is also composed of letters [a-z].

Please note that the data in the file is pretty terrible. It was scraped from Wiktionary using someone else's script. There are duplicates, there are things that are not real words, and there are entries that are clearly completely wrong. But none of this matters from the point of view of the challenge. The goal is to produce a program that will correctly reproduce the plurals as written in this file; their correspondence or otherwise to correct English plurals is irrelevant. Also please note that, like any dictionary, the provided file contains its fair share of offensive terms.

Input and output may be performed in any way you find convenient. (Function parameters, command-line input, etc.)

To evaluate your score you will probably need to write some testbed code that iterates over the provided file. Please include this in your answer. This code need not be in the same language as your pluralising program, and should not be golfed. (It should be readable, so we can verify that it works.)

A useful tip: while developing your code, you will probably find it helpful to modify your testbed code so that it prints out every line for which your function fails. That way you can easily see which additional cases you need to consider.

Illustrative examples

the Python code

def p(a):
 return a+'s'

is 24 bytes. However, if we run it through the following evaluation function

def testPlural(func):
    badLines = 0
    with open("simpleNouns.csv") as file:
        for line in file:
            noun, plural = line.rstrip().split(',')
            if plural != func(noun):
                badLines += 1
    return badLines

then we can see that it fails in 11246 cases, giving a score of 24 + 11246 = 11270.

However, if we update the function to

import re
def p(a):
 a+='s'
 return re.sub('ys$','ies',a)

then its length has increased to 58 bytes, but it now fails in only 7380 cases, giving a score of 58 + 7380 = 7438.

Nathaniel

Posted 2014-10-14T13:07:19.117

Reputation: 6 641

40 of the singulars are mapped to multiple conflicting plurals: uniq simpleNouns.csv | cut -d, -f1 | uniq -d -> axis ban batman bob brit chela ... why woolly yen – ngn – 2018-01-17T05:35:10.810

@ngn hmm, I thought I had removed those, but this was a very long time ago -- I will have to update the file. (I don't really want to bump this challenge while the Moby Dick one is active, but I will update it in the near future.) – Nathaniel – 2018-01-17T06:46:33.170

As a warning to potential answerers: when I do update the challenge I will remove the 50% bonus for getting all the plurals correct. That bonus is clearly a bad idea in retrospect, as it opens up the possibility of storing a compressed version of the data and just spitting it out. Luckily, the problem @ngn points out makes it impossible to achieve the bonus currently, so no answers will be invalidated by removing it. – Nathaniel – 2018-01-17T06:48:38.000

1There is an existing challenge with the same theme, but this version is different in that the number of test cases is far higher, and the scoring system is different, which I think makes it a challenge of quite a different character. – Nathaniel – 2014-10-14T13:08:07.570

Have you run the answers there against this list? If they are competitive here, that still makes this a duplicate. I'm unsure at the moment, so I won't vote just yet. – Geobits – 2014-10-14T13:21:19.693

The input file contains more than 2000 duplicate lines. Is this an intentional strategy to give extra weight to some test cases? – Peter Taylor – 2014-10-14T13:23:24.703

@Geobits the top answer there would make a reasonable answer here. However, the point of this challenge is in optimising the solution for the details of this particular list. By doing that even a little, that answer could easily be beaten. – Nathaniel – 2014-10-14T14:49:22.367

7I can see why you're not expecting anyone to get a perfect match. There's some real crap in the data. schwannomatosis,neurofibromatoses? diphosphine,si??!! – Peter Taylor – 2014-10-14T14:54:32.687

I'm not sure that makes it a non-duplicate. It implies I could post another one by just changing the list... – Geobits – 2014-10-14T14:54:49.727

@PeterTaylor I'm in the process of removing the duplicates. Honestly I didn't look at the data that closely. Whether the data is "good" or not isn't really all that relevant to the challenge - its purpose is just to provide an objective winning condition. – Nathaniel – 2014-10-14T14:57:50.733

2@Geobits you must vote according to your conscience I suppose. It's not just the list that's different though. The scoring system in the other challenge means that most of the answers just add 's' to the end of the word, which I think made it rather disappointing, and this version definitely fixes that. Unlike other Stack Exchanges, the execution of an idea really makes a difference here, and if you allow a poorly executed challenge to block people from ever using the same idea, that would seem sub-optimal to me. But I'm new here and don't get to make that call. – Nathaniel – 2014-10-14T15:04:47.557

Answers

1

I'll start us out with something simple.

Javascript 232+2730=2962

w=prompt();r="ay,ays;ey,eys;oy,oys;y,ies;us,i;um,a;is,es;x,xes;man,men;s,ses;sh,shes;ch,ches;z,zes".split(";");for(i=0;i<r.length;i++){b=r[i].split(",");if(w.slice(-b[0].length)==b[0])return w.slice(0,-b[0].length)+b[1];}alert(w+"s")

Uses a small list of replacement rules. Fails 2730 test cases out of 95642.

Peter Olson

Posted 2014-10-14T13:07:19.117

Reputation: 7 412

1

Python, 262+2606 = 2868

Small improvement over Peter Olson's answer. Perhaps can be better reimplemented in other language.

import re
def p(n):
 s=n
 for a in r'fe;ves lf;lves ([iln])um;\1a ([cgil])us;\1i (?<![ar])is;es man;men ea;eae (?<!f)ix;ices ((?<!k)s|x|sh|ch|z|j);\1es ((?<=.)[^aeiou])y;\1ies (?<!s);s'.split():
  b,c=a.split(';')
  n=re.sub(b+'$',c,n)
  if n!=s:break
 return n

justhalf

Posted 2014-10-14T13:07:19.117

Reputation: 2 070

I wonder if there's any better solution that takes a fundamentally different approach – Claudiu – 2014-10-17T21:27:08.597

2@Claudiu I'm sure the bulk of transformations will be done using something similar. I guess the real challenge is to figure out the most efficient way to handle the bad apples. – user0721090601 – 2014-10-17T23:42:35.247

Agree with guifa, the challenge is how to get most of the irregular nouns. Because for the regular ones, transformation by rule looks very natural. – justhalf – 2014-10-18T01:17:41.340