Count spelling errors in text; minimize the number of spelling errors in your code

28

3

Write a program or function that takes two inputs:

  1. A text message
  2. The dictionary of English language, as it appears in this Github file (containing about 60000 words)

and outputs the number of spelling errors in the message (see below for definition and test cases).

You can receive the dictionary as a parameter to your function, as a pre-defined file that your program expects to find, as hard-coded data in your code, or in any other sensible manner.


Your code should itself look like a text message, with a minimal number of spelling errors. So, you will calculate the score of your code by feeding it to itself as input.

The winner is code that has the lowest score (minimal possible score is 0). If there are several answers with the same score, the winner is decided by code size (in characters). If two answers are still tied, the winner is the earlier one.


If required, you can assume the input message to be ASCII (bytes 32...126) with newlines encoded in a conventional manner (1 byte "10" or 2 bytes "13 10"), and non-empty. However, if your code has non-ASCII characters, it should also support non-ASCII input (so it can calculate its own score).

Characters are subdivided into the following classes:

  • Letters a...z and A...Z
  • Whitespace (defined here as either the space character or the newline character)
  • Punctuation . , ; : ! ?
    • Sentence-ending . ! ?
  • Garbage (all the rest)

A word is defined as a sequence of letters, which is maximal (i.e. neither preceded nor followed by a letter).

A sentence is defined as a maximal sequence of characters that are not sentence-ending.

A character is a spelling error if it violates any of the spelling rules:

  1. A letter must belong to a dictionary word (or, in other words: each word of length N that doesn't appear in the dictionary counts as N spelling errors)
  2. The first character in a sentence, ignoring any initial whitespace characters, must be an uppercase letter
  3. All letters must be lowercase, except those specified by the previous rule
  4. A punctuation character is only allowed after a letter or garbage
  5. A newline character is only allowed after a sentence-ending character
  6. Whitespace characters are not allowed in the beginning of the message and after whitespace characters
  7. There should be no garbage (or, in other words: each garbage character counts is a spelling error)

In addition, the last sentence must be either empty or consist of exactly one newline character (i.e. the message should end with a sentence-ending character and an optional newline - let's call it rule 8).

Test cases (below each character is a rule that it violates; after => is the required answer):

Here is my 1st test case!!
           711           4                => 4

main(){puts("Hello World!");}
2   777    883     3     77 78            => 12

  This message starts with two spaces
66                                   8    => 3

What ? No apostrophe's??
     4              71 4                  => 4

  Extra   whitespace   is   BAD!
66      661111111111 66   66333           => 21

Several
lines?
Must be used only to separate sentences.
                                          => 1 (first linebreak is en error: rule 5)

"Come here," he said.
73         7                              => 3 (sentence starts with '"', not 'C')

anatolyg

Posted 2017-01-22T15:53:01.247

Reputation: 10 719

2I was expecting a bunch of loopholes, but you've seemed to cover them all. +1 from me. – Nathan Merrill – 2017-01-22T16:06:49.787

4

I think SPL is the winner here.

– Gurupad Mamadapur – 2017-01-22T16:46:34.803

2.Gertrude is even better. Commands are arbitrary sentences, only word count and average word length matter. – Rainer P. – 2017-01-22T18:03:12.253

I thought "Applescript" when I saw this. Don't have a Mac, though. – PurkkaKoodari – 2017-01-22T20:26:43.303

I think the second test case has two errors: the H and W are wrong because of rule 3, not rule 2. And it would be nice to have some test cases covering the less intuitive aspects of the definition of sentence. E.g. in "Come here," he said. the C should be lower-case because it isn't the first character in the sentence. (IMO that's an error in the definition of sentence; it might be worth deleting the question before anyone answers so that you can sandbox and see whether anyone picks any more holes). – Peter Taylor – 2017-01-22T22:51:29.227

Shouldn't the fourth testcase be 21 instead of 17? – smls – 2017-01-23T00:29:46.383

1@PeterTaylor I don't want the rules to become too complicated. Your test case is fine; I added it to my post. – anatolyg – 2017-01-23T09:14:03.960

@smls That was a bug in counting. I don't have a solution to this challenge, so was counting stuff manually. – anatolyg – 2017-01-23T09:14:53.090

@anatolyg Why is the word whitespace full of 1s? What am I missing? – None – 2017-01-23T15:31:38.127

@Masterzagh Rule 1 says "a letter must belong to a dictionary word", and "whitespace" isn't in the dictionary.

– ETHproductions – 2017-01-23T15:46:06.263

@ETHproductions So that rule means that for every group of letters that doesn't form a dictionary word you count group length errors? I just got really confused by the rule. – None – 2017-01-23T15:58:14.273

@Masterzagh Yes. I have updated the post to clarify this. – anatolyg – 2017-01-23T16:01:29.907

Oh, this is going to be fun. Here's my first line: Def check message, words out errors? (Sadly DEF isn't a word...) – 12Me21 – 2018-05-02T13:51:16.623

Since lenguage can easily reach score 0, do other language reach score 0 without so long code? – l4m2 – 2018-05-21T08:00:02.447

Answers

6

Perl 6, 134 spelling errors

my token punctuation {<[.,;:!?]>}
my \text = slurp; my \mistakes=[]; for split /\.|\!|\?/, text { for .trim.match: :g, /<:letter>+/ -> \word { (append mistakes, .comb when none words slurp pi given lc word) or (push mistakes, $_ if ((.from or word.from) xor m/<[a..z]>/) for word.match: :g, /./) }}
append mistakes, comb / <after \s | <punctuation>> <punctuation> | <!before <punctuation> | <:letter> | \s> . | <!after \.|\!|\?> \n | [<before ^> | <after \s>] \s /, text; say mistakes.Numeric

With extra whitespace for readability:

my token punctuation {<[.,;:!?]>}
my \text = slurp;
my \mistakes=[];
for split /\.|\!|\?/, text {
    for .trim.match: :g, /<:letter>+/ -> \word {
        (append mistakes, .comb when none words slurp pi given lc word)
        or
        (push mistakes, $_ if ((.from or word.from) xor m/<[a..z]>/) for word.match: :g, /./)
    }
}
append mistakes, comb /
  <after \s | <punctuation>> <punctuation>
  | <!before <punctuation> | <:letter> | \s> .
  | <!after \.|\!|\?> \n
  | [<before ^> | <after \s>] \s
/, text;
say mistakes.Numeric

Notes:

  • Expects the dictionary in a file called 3.14159265358979 in the current working directory.
  • The only inspired part is the line
    append mistakes, .comb when none words slurp pi given lc word,
    the rest is pretty bad. But maybe it can at least serve as a baseline for better solutions... :)

smls

Posted 2017-01-22T15:53:01.247

Reputation: 4 352

4The most readable perl code ever – user41805 – 2017-01-23T09:22:14.393