Determining part of speech of most words

2

0

Write a program to output the part of speech of any word inputted to it. I doubt that anyone could get a perfect algorithm, so I will give the code a batch of words in order to ascertain what algorithm is the best.

Weighted Scoring:
+10pts for each word identified correctly: http://www.englishclub.com/grammar/parts-of-speech_1.htm (out of 100 randomly selected words)
-1pt for each byte
+5pts for each upvote on your answer

Rules:
no using external resources (unless they are counted in the size)

UPDATE: Due to the need for a standardized word list for testing purposes, I will be using the list provided in the response to this question. http://www-personal.umich.edu/~jlawler/wordlist It seems to be a fairly complete list, and will make this a challenge more than a guessing game.

Stack Tracer

Posted 2014-02-28T02:26:23.977

Reputation: 259

Question was closed 2014-03-03T09:24:54.320

1"+style points (tie-breakers, mostly.)" How are these determined? "no golfscript or other languages designed for code golf." Why not? – Doorknob – 2014-02-28T02:36:44.240

and they are gone. now that I think of it, the chance of a tie is so astronomically small that it is irrelevant. As to golfscript, – Stack Tracer – 2014-02-28T02:37:20.877

What's wrong with GolfScript? How do you determine "languages designed for code golf"? – Doorknob – 2014-02-28T02:37:45.857

Why block GolfScript &c.? – None – 2014-02-28T02:37:54.207

There is always the sandbox for your future questions: http://meta.codegolf.stackexchange.com/questions/954/proposed-questions-sandbox-mark-viii?cb=1

– None – 2014-02-28T02:38:58.353

fine, use golfscript... I was just hoping to have answers that resemble program code more than machine code... Evidently, people around here love their golfscript just as much as I don't. – Stack Tracer – 2014-02-28T02:39:59.497

Seems good now, although somebody will probably find some way to abuse it. – None – 2014-02-28T02:40:57.017

2"(out of 10 randomly selected words)": Randomly selected from what set? – DavidC – 2014-02-28T02:52:36.173

1His scoring is alright, the answer with the most point wins. – Zero Fiber – 2014-02-28T03:00:36.910

@SampritiPanda Oh, sorry, somehow I got confused. – Victor Stafusa – 2014-02-28T03:22:05.673

2

I believe this is near impossible: http://programmers.stackexchange.com/a/133789/115106

– Zero Fiber – 2014-02-28T03:32:16.717

1Don't lose points for incorrectly identifying it so always printing 'verb' should give a decent score – VBCPP – 2014-02-28T04:23:33.190

Does the target language need to be English? Certain "more logically structured" languages, such as German, would be quite a bit easier. – primo – 2014-02-28T05:06:53.730

@primo, German, or especially Latin, would be easier. It is called a challenge for a reason. – Stack Tracer – 2014-02-28T05:54:25.700

@DavidCarraher, I will pull them from the English language. Probably from some news articles, as those cannot be predicted as easily as other things. – Stack Tracer – 2014-02-28T06:02:25.530

3@StackTracer It may be better to use 1000 randomly selected words, granting one point each, rather than 10 words each worth 100. – primo – 2014-02-28T06:04:55.600

@SampritiPanda, an exact, perfect solution is more or less impossible. a "best-guess" type scenario (not perfect) seems quite possible. – Stack Tracer – 2014-02-28T06:05:42.603

@primo, good idea. Increase Sample Size. And so it is done. – Stack Tracer – 2014-02-28T06:07:09.937

Should it be case sensitive and use punctuation? In that case, interjections would be pretty simple to find – Mark Jeronimus – 2014-02-28T06:33:11.597

3This isn't a spec: it's an invitation to write a mind-reading program. – Peter Taylor – 2014-02-28T09:29:29.643

You really need to add the list of test words to the spec. Prior knowledge of the test cases is not problematic if they are sufficiently diverse to thoroughly cover the problem domain. Moreover, you really need to decide what kind of challenge this is. A popularity contest is decided by upvotes, not by a scoring system you dictate. If you're dictating a scoring system that includes upvotes as part of the formula, it is still not a popularity contest. – Jonathan Van Matre – 2014-02-28T19:07:30.880

Answers

4

Mathematica

Mathematica has a native database of words (called up by the simple command WordData ) that includes, among other properties, the parts of speech. Without this database, it would be impossible for me to answer the question.

The word list from here was downloaded. Once stored, it is called up as follows:

words = Import["wordlist.txt", "Table"];
Length[words]

69905

Forty words are randomly chosen from the list:

list = Flatten@RandomSample[words, 40]

{"mealtime", "deride", "cricketground", "torpedinidae", "rosefish", "akron", "sloppy", "deaden", "assaultive", "mick", "unperceptiveness", "skeg", "pad", "mane", "carvel-built", "night-stop", "monarchy", "corking", "backlog", "scaphopoda", "bagman", "phrthe", "wholesome", "proprietress", "hortus", "unreliable", "farthermost", "fearfulness", "malposed", "stoke", "affiche", "lunatic", "holcus", "devoutness", "macaque", "zenithal", "poetry", "precipitate", "agoraphobic", "snapper"}


The following asks Mathematica to return the parts of speech for the list of 40 words. The parts of 31 words are identified. 9 words are not recognized.

(s = Select[Flatten@{{#}, WordData[#, "PartsOfSpeech"]} & /@ 
     list, ! MatchQ[#, {_, WordData[_, _]}] &]) // TableForm
Length[s]

words

DavidC

Posted 2014-02-28T02:26:23.977

Reputation: 24 524

1Seems like cheating, using an internal dictionary.. – TheDoctor – 2014-02-28T23:22:35.937

I agree. But it is essentially impossible to determine the part of speech of a word by raw computing power. Some sort of database is required. – DavidC – 2014-02-28T23:48:02.847

1"no using external resources (unless they are counted in the size)" What's your score? -1,000,000? – qwr – 2014-03-01T00:44:56.303

It was't an external resource. And the word list did not contain information about parts of speech. It was used because the OP did not provide words to use as test cases. – DavidC – 2014-03-01T01:38:35.417

This is currently the best solution. I do admit that I was hoping for something less cheaty than a database lookup, but there is no rule against it. I could see a way to do it based on word structure, but that gets rather fuzzy due to the nature of english. – Stack Tracer – 2014-03-01T04:31:24.120

I understand your misgivings about the approach. I share them. The only alternative I see to a database-lookup is to create one's own database. That is a challenge that few people are skilled and motivated enough to undertake. – DavidC – 2014-03-01T09:39:06.970

This compares unfavorably with the "always return verb" approach. A heuristic's the way to go for this. (I'd post "always return verb" as an answer, but the question's closed.) – Brilliand – 2014-03-06T23:28:39.343