How to count the syllables in a word

22

3

Input: You will be passed a string containing a single english word. All letters will be lowercase, and there will be no non-alphabetic characters in the string.

Output: You will return an integer from 1 to 7 representing how many syllables you think are in the word.

Scoring: Your program will be run against all of the words found on this repository. If you get N words correct, and your program is M bytes large, then your score is N-(M*10). Largest score wins.

To generate my syllable count, I used this as my word list and this to count the syllables.

Nathan Merrill

Posted 2015-03-04T02:35:45.840

Reputation: 13 591

The 3-syllable-words contains "inn" and "ruby". The 2-syllable-words contains these: "irs", "ore", "roy", "yer". Other than that the lists seem accurate enough. – justhalf – 2015-03-04T03:23:00.133

@justhalf thank you for those catches. Creating the lists was definitely the hardest part of the challenge. – Nathan Merrill – 2015-03-04T03:29:03.960

Related: http://codegolf.stackexchange.com/questions/10533/build-a-readability-index

– Digital Trauma – 2015-03-04T04:19:04.517

3This challenge is making me realise how silly English can be. Take resume for example... – Sp3000 – 2015-03-04T08:26:35.133

Answers

12

Ruby, 8618 correct (91.1%), 53 bytes, 8618 - 10 * 53 = 8088 score

->s{s.scan(/[aiouy]+e*|e(?!d$|ly).|[td]ed|le$/).size}

This is an anonymous Ruby function which uses regexes to count syllables.

The function adds a syllable for every instance of:

  • A run of non-e vowels, followed by zero of more es
  • An e which is not part of a trailing ed or ely, with the exception of trailing ted or deds
  • A trailing le

Analysis

The basic idea is to count runs of vowels, but this by itself isn't very accurate ([aeiouy]+ gets 74% correct). The main reason for this is because of the silent e, which modifies the previous vowel sound while not being pronounced itself. For example, the word slate has two vowels but only one syllable.

To deal with this, we take e out of the first part of the regex and treat it separately. Detecting silent es is hard, but I found two cases where they occur often:

  • As part of a trailing ed (unless it's a ted or ded like settled or saddled),
  • As part of a trailing evy (e.g. lovely)

These cases are specifically excluded in what would otherwise be e..

The reason for the . in e(?!d$|ly). is to consume the next char if there is a double vowel (e.g. ea or ee), and so that e at the end of the word are not counted. However a trailing le is usually pronounced, so that is added back in.

Finally, vowel runs are counted as one syllable. While this may not always be the case (e.g. curious), it's often difficult to work out whether there are multiple syllables. Take the ia of celestial and spatial, as an example.

Test program

I don't really know Ruby so I'm not sure how well it can be golfed. I did manage to scrape together a test program by consulting a lot of SO though:

cases = 0
correct = 0

s = "->s{s.scan(/[aiouy]+e*|e(?!d$|ly).|[td]ed|le$/).size}"

f = eval s

for i in 1 ... 8
    filepath = i.to_s + "-syllable-words.txt"
    file = File.open(filepath)

    while (line = file.gets)
        word = line.strip
        cases += 1
        if f.call(word) == i
            correct += 1
        end
    end
end

p "Correct: #{correct}/#{cases}, Length: #{s.length}, Score: #{correct - s.length*10}"

Sp3000

Posted 2015-03-04T02:35:45.840

Reputation: 58 729

Aww, you made the standard so high. In Python the code length is exactly 20 chars longer, so my implementation of your "Vowel followed by a letter that's not e" gives 6638 (7158 correct) – justhalf – 2015-03-04T04:00:39.147

2@justhalf It's basically the only reason why I'm using Ruby :P I normally use Python for everything else. – Sp3000 – 2015-03-04T04:02:36.557

5

Python3, 7935 - 10 * 71 = 7225

My quick-and-dirty answer: count runs of consecutive vowels, but remove any final e's first.

lambda w:len(''.join(" x"[c in"aeiouy"]for c in w.rstrip('e')).split())

After stripping off the e's, this replaces vowels with x and all other characters with a space. The result is joined back into a string and then split on whitespace. Conveniently, whitespace at the beginning and end is ignored (e.g. " x xx ".split() gives ["x","xx"]). The length of the resulting list is therefore the number of vowel groups.

The original, 83-byte answer below was more accurate because it only removed a single e at the end. The newer one thus has problems for words like bee; but the shortened code outweighs that effect.

lambda w:len(''.join(" x"[c in"aeiouy"]for c in(w[:-1]if'e'==w[-1]else w)).split())

Test program:

syll = lambda w:len(''.join(c if c in"aeiouy"else' 'for c in w.rstrip('e')).split())

overallCorrect = overallTotal = 0
for i in range(1, 7):
    with open("%s-syllable-words.txt" % i) as f:
        words = f.read().split()
    correct = sum(syll(word) == i for word in words)
    total = len(words)
    print("%s: %s correct out of %s (%.2f%%)" % (i, correct, total, 100*correct/total))
    overallCorrect += correct
    overallTotal += total

print()
print("%s correct out of %s (%.2f%%)" % (overallCorrect, overallTotal, 100*overallCorrect/overallTotal))

Evidently this was too dirty and not quick enough to beat Sp3000's Ruby answer. ;^)

DLosc

Posted 2015-03-04T02:35:45.840

Reputation: 21 213

->s{s.scan(/([aiouy]|e(?!$))+/).size} scores 7583. 84%'s pretty impressive for something so simple. – Sp3000 – 2015-03-04T04:14:57.547

1

Perl, 8145 - 3 * 30 = 7845

Using the lists from before the recent commits.

#!perl -lp
$_=s/(?!e[ds]?$)[aeiouy]+//g

nutki

Posted 2015-03-04T02:35:45.840

Reputation: 3 634

The files have been updated recently. I took a look and didn't see the words you named in the 1 syllable file. – Sp3000 – 2015-03-04T11:06:33.870

@Sp3000, weired. They where updated 7 hours ago according to what I see, and there are still those words under that link: https://github.com/nathanmerrill/wordsbysyllables/blob/master/1-syllable-words.txt

– nutki – 2015-03-04T11:11:36.773

Looks like @NathanMerrill messed up the update 7 hours ago: history.

– Sp3000 – 2015-03-04T11:14:01.907

@Sp3000, thanks. I update the score to the older version. Those lists still have quite a few errors, but not as severe. – nutki – 2015-03-04T11:21:29.600

0

Python, 5370-10*19 = 5180

This program simply assumes that longer words means more syllables.

lambda x:len(x)/6+1

The tester program I use is:

correct = 0
y = lambda x:len(x)/6+1
for i in xrange(1,8):
    f = file(str(i)+"-syllable-words.txt")
    lines = f.read().split("\n")
    f.close()
    correct += len([1 for line in lines if y(line)==i])
print correct

Nathan Merrill

Posted 2015-03-04T02:35:45.840

Reputation: 13 591

Should we create a program or a function? Yours is not a program, it outputs nothing when run. – justhalf – 2015-03-04T03:24:53.213

@justhalf I need something that accepts input and produces output (even if that output isn't STDIO) – Nathan Merrill – 2015-03-04T03:28:11.410

Btw I didn't get 5150 for using 7, but 4391. In my test it's better to use len(x)/6 instead (5377-190 = 5187). – justhalf – 2015-03-04T04:19:16.230

@justhalf With the updates, I'm getting 5343, but definitely getting a worse score with len(x)/6. I'll post my test program. – Nathan Merrill – 2015-03-04T04:26:01.180

readlines() includes the newline in the result. So yours is actually (len(x)+1)/7+1. You should use read().split('\n') instead. Although I got 5352 for that formula, though. – justhalf – 2015-03-04T04:36:05.950