Build a readability index

13

The Flesch-Kincaid readability algorithm depends on the measures of word count and syllable count, neither of which is entirely objective, or easily automatable using a computer. For example, does "code-golf", with the hyphen, count as one word or two? Is the word "million" two or three syllables? In this task, you will need to approximate, as counting exactly will take too much time, space, and most importantly, code.

Your task is to build the smallest possible program (i.e. a function) in any language that will take an English reading passage (assumed to be in complete sentences), and calculate the Flesch Reading Ease index to a tolerance of eight points (to account for variations in syllable counting and word counting). It is calculated as follows:

FRE = 206.835 - 1.015 * (words per sentence) - 84.6 * (syllables per word)

Your program must be aligned to the reference passages below, whose indices were calculated using manual counting:

I would not, could not, in the rain.
Not in the dark, not on a train.
Not in a car, not in a tree.
I do not like them, Sam, you see.
Not in a house, not in a box.
Not with a mouse, not with a fox.
I will not eat them here or there.
I do not like them anywhere!

Index: 111.38 (64 syllables in 62 words in 8 sentences)

It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape
the vile wind, slipped quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty dust from entering
along with him.

Index: 65.09 (74 syllables in 55 words in 2 sentences)

When in the Course of human events, it becomes necessary for one people to
dissolve the political bands which have connected them with another, and to
assume among the powers of the earth, the separate and equal station to
which the Laws of Nature and of Nature's God entitle them, a decent respect
to the opinions of mankind requires that they should declare the causes
which impel them to the separation.

Index: 3.70 (110 syllables in 71 words in 1 sentence)

If you have any other passages for which you have manually counted the syllables and words and calculated the index, you may show them as verification.

Joe Z.

Posted 2013-01-25T02:07:43.577

Reputation: 30 589

Can it be a function? Or does it have to take STDIN? – Brigand – 2013-01-25T04:58:18.190

2Do you have the syllable count available for the 3 example passages, or just the index? If you have it, the syllable count would be handy for comparison. – Strigoides – 2013-01-25T06:55:48.490

It can be a function. In fact, it should be a function. – Joe Z. – 2013-01-25T12:02:08.713

Answers

6

Perl 120 bytes

#!perl -pa0
s@\w+|([.!?])@$s+=$#-,lc($&)=~s![aeiou]+\B|([aeiouy]$)!$y+=1-$#-/3!ger@ge}
{$_=206.835-1.015*@F/$s-84.6*$y/@F

Sample I/O:

$ perl flesch-kincaid.pl < input1.dat
110.730040322581

$ perl flesch-kincaid.pl < input2.dat
65.6097727272728

$ perl flesch-kincaid.pl < input2.dat
1.71366197183096

Counting of syllables is done by assuming that each vowel cluster is a single syllable, except for lone vowels at the end of a word, which are only counted two thirds of the time; a heuristic that seems to be fairly accurate.

primo

Posted 2013-01-25T02:07:43.577

Reputation: 30 891

3

K&R c -- 188 196 199 229 characters

With the spec changed to specify a function I can get of lot of the c overhead out of the count. Also changing to use Strigoides' syllable counting hack which is better than my formula tweak and extended to deal with overcounting of words.

After I found a shorter way to do the vowel detection which was sadly based on stdchr, I had incentive to squeeze a few more out of the bit-twiddling abomination I've been using so that I didn't have to be boring.

d,a,v,s,t,w;float R(char*c){for(;*c;++c){s+=*c=='.';if(isalpha(*c)){
w+=!a++;d=(*c&30)>>1;if(*c&1&(d==7|((!(d&1))&(d<6|d>8)))){t+=!v++;}
else v=0;}else v=a=0;}return 206.835-1.*w/s-82.*t/w;}

The logic here is a simple state machine. It counts sentences by periods only, words by strings of alphabetic characters, and syllables as strings of vowels (including y).

I had to diddle the constants a little to get it to come out with the right figures, but I've borrowed Strigoides' trick of just undercounting the syllables by a fixed fraction.

Un-golfed, with comments and some debugging tools:

#include <stdlib.h>
#include <stdio.h>
d,a,/*last character was alphabetic */
  v,/*lastcharacter was a vowel */
  s, /* sentences counted by periods */
  t, /* syllables counted by non-consequtive vowels */
  w; /* words counted by non-letters after letters */
float R/*eadability*/(char*c){
  for(;*c;++c){
    s+=*c=='.';
    if(isalpha(*c)){ /* a letter might mark the start of a word or a
               vowel string */
      w+=!a++; /* It is only the start of a word if the last character
              wasn't a letter */
      /* Extract the four bits of the character that matter in determining
       * vowelness because a vowel might mark a syllable */
      d=(*c&30)>>1;
      if( *c&1  & ( d==7 | ( (!(d&1)) & (d<6|d>8) ) ) 
      ) { /* These bits 7 or even and not 6, 8 make for a
         vowel */
    printf("Vowel: '%c' (mangled as %d [0x%x]) counts:%d\n",*c,d,d,!v);
    t+=!v++;
      } else v=0; /* Not a vowel so set the vowel flag to zero */
    }else v=a=0; /* this input not alphabetic, so set both the
            alphabet and vowel flags to zero... */
  }
  printf("Syllables: %3i\n",t);
  printf("Words:     %3i       (t/w) = %f\n",w,(1.0*t/w));
  printf("Sentences: %3i       (w/s) = %f\n",s,(1.0*w/s));
  /* Constants tweaked here due to bad counting behavior ...
   * were:       1.015     84.6 */
  return 206.835-1.   *w/s-82. *t/w;
}
main(c){
  int i=0,n=100;
  char*buf=malloc(n);
  /* Suck in the whole input at once, using a dynamic array for staorage */
  while((c=getc(stdin))!=-1){
    if(i==n-1){ /* Leave room for the termination */
      n*=1.4;
      buf=realloc(buf,n);
      printf("Reallocated to %d\n",n);
    }
    buf[i++]=c;
    printf("%c %c\n",c,buf[i-1]);
  }
  /* Be sure the string is terminated */
  buf[i]=0;
  printf("'%s'\n",buf);
  printf("%f\n",R/*eadability*/(buf));
}

Output: (using the scaffold from the long version, but the golfed function.)

$ gcc readability_golf.c
readability_golf.c:1: warning: data definition has no type or storage class
$ ./a.out < readability1.txt 
'I would not, could not, in the rain.
Not in the dark, not on a train.
Not in a car, not in a tree.
I do not like them, Sam, you see.
Not in a house, not in a box.
Not with a mouse, not with a fox.
I will not eat them here or there.
I do not like them anywhere!
'
104.074631    
$ ./a.out < readability2.txt
'It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape
the vile wind, slipped quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty dust from entering
along with him.
'
63.044090
$ ./a.out < readability3.txt 
'When in the Course of human events, it becomes necessary for one people to
dissolve the political bands which have connected them with another, and to
assume among the powers of the earth, the separate and equal station to
which the Laws of Nature and of Nature's God entitle them, a decent respect
to the opinions of mankind requires that they should declare the causes
which impel them to the separation.
'
-1.831667

Deficiencies:

  • The sentence counting logic is wrong, but I get away with it because only one of the inputs has a ! or a ?.
  • The word counting logic will treat contractions as two words.
  • The syllable counting logic will treat those same contractions as one syllable. But probably overcounts on average (for instance there is counted as two and many words ending in e will be counted one too many), so I've applied a constant factor of 96.9% correction.
  • Assumes a ASCII character set.
  • I believe the vowel detection will admit [ and {, which clearly isn't right.
  • Lots of reliance on K&R semantics makes this ugly, but hey, it's code golf.

Things to look at:

  • I am (momentarily) ahead of both python solution here, even if I am trailing the perl.

  • Get a load of the horrible thing I did for detecting vowels. It makes some sense if you write the ASCII representations out in binary and read the comment in the long version.

dmckee --- ex-moderator kitten

Posted 2013-01-25T02:07:43.577

Reputation: 2 726

"I had to alter the formula a little by hand to get acceptable results." This may be bad form. – Joe Z. – 2013-01-25T14:07:36.857

1I have now a least followed Strigoides' lead and made the adjustments on the basis of who the text comprehension makes errors rather than a purely ad hoc tweak to ring the three test cases into agreement. – dmckee --- ex-moderator kitten – 2013-01-25T22:22:09.287

2

Python, 202 194 188 184 171 167 characters

import re
def R(i):r=re.split;w=len(r(r'[ \n]',i));s=r('\\.',i);y=r('[^aeiou](?i)+',i);return 206.835-1.015*w/(len(s)-s.count('\n'))-84.6*(len(y)-y.count(' ')-2)*.98/w

First, get the total number of words by splitting along spaces and newlines:

w=len(r(r'[ \n]',i))

Then, the formula. Sentence and syllable counts are only used once, so they are embedded in this expression.

Sentences are simply the input split along ., with newlines filtered out:

s=r('\\.',i);s=len(s)-s.count('\n')

Syllables consist of the input split along non-vowels, with spaces removed. This seems to consistently slightly over-estimate the number of syllables, so we need to adjust it down (about .98 seems to do it):

y=r('[^aeiou](?i)+',i);y=len(y)-y.count(' ')-2;

202 -> 194: len(x)-2 rather than len(x[1:-1]). Removed unnecesary brackets. Made syllable regex case-insensitive

194 -> 188: The file was previously saved as dos rather than unix file format, causing wc -c to count newlines as two characters. Whoops.

188 -> 184: Get rid of those nasty x for x in ... if x!=...s by storing the intermediate result and subtracting x.count(...)

184 -> 171: Remove input/output, and convert to function

171 -> 167: Insert the len(x)-x.count(...)s into the formula

Strigoides

Posted 2013-01-25T02:07:43.577

Reputation: 1 025

Your answer does not have to include the input and output procedures. – Joe Z. – 2013-01-28T22:11:40.310

@JoeZeng Oh, okay. I'll turn it into a function then. – Strigoides – 2013-01-29T03:23:09.627

1

Python 380 characters

import re
def t(p):
 q=lambda e: e!=''
 w=filter(q,re.split('[ ,\n\t]',p))
 s=filter(q,re.split('[.?!]',p))
 c=len(w)*1.0
 f=c/len(s)
 return w,f,c
def s(w):
 c= len(re.findall(r'([aeiouyAEIOUY]+)',w))
 v='aeiouAEIOU'
 if len(w)>2 and w[-1]=='e'and w[-2]not in v and w[-3]in v:c-= 1
 return c
def f(p):
 w,f,c=t(p)
 i=0
 for o in w:
  i+=s(o)
 x=i/c
 return 206.835-1.015*f-84.6*x

This is a rather long solution, but it works well enough, at least of the 3 test cases provided it does.

Testing code

def test():
 test_cases=[['I would not, could not, in the rain.\
        Not in the dark, not on a train.\
        Not in a car, not in a tree.\
        I do not like them, Sam, you see.\
        Not in a house, not in a box.\
        Not with a mouse, not with a fox.\
        I will not eat them here or there.\
        I do not like them anywhere!', 111.38, 103.38, 119.38],\
        ['It was a bright cold day in April, and the clocks were striking thirteen.\
        Winston Smith, his chin nuzzled into his breast in an effort to escape\
        the vile wind, slipped quickly through the glass doors of Victory Mansions,\
        though not quickly enough to prevent a swirl of gritty dust from entering\
        along with him.', 65.09, 57.09, 73.09],\
        ["When in the Course of human events, it becomes necessary for one people to\
        dissolve the political bands which have connected them with another, and to\
        assume among the powers of the earth, the separate and equal station to\
        which the Laws of Nature and of Nature's God entitle them, a decent respect\
        to the opinions of mankind requires that they should declare the causes\
        which impel them to the separation.", 3.70, -4.70, 11.70]]
 for case in test_cases:
  fre= f(case[0])
  print fre, case[1], (fre>=case[2] and fre<=case[3])

if __name__=='__main__':
 test()

Result -

elssar@elssar-laptop:~/code$ python ./golf/readibility.py
108.910685484 111.38 True
63.5588636364 65.09 True
-1.06661971831 3.7 True

I used the syllable counter from here - Counting syllables

A more readable version is available here

elssar

Posted 2013-01-25T02:07:43.577

Reputation: 579

1if len(w)>2 and w[-1]=='e'and w[-2]not in v and w[-3]in v:c-= 1 Simple minded but a good approximation. I like it. – dmckee --- ex-moderator kitten – 2013-01-25T11:06:05.460

0

Javascript, 191 bytes

t=prompt(q=[]);s=((t[m="match"](/[!?.]+/g)||q)[l="length"]||1);y=(t[m](/[aeiouy]+/g)||q)[l]-(t[m](/[^aeiou][aeiou][s\s,'.?!]/g)||q)[l]*.33;w=(t.split(/\s+/g))[l];alert(204-1.015*w/s-84.5*y/w)

First test case gives 112.9 (correct answer is 111.4, off by 1.5 points)

Second test case gives 67.4 (correct answer is 65.1, off by 2.3 points)

Third test case gives 1.7 (correct answer is 3.7, off by 2.0 points)

SuperJedi224

Posted 2013-01-25T02:07:43.577

Reputation: 11 342