Truecasing

Truecasing is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This commonly comes up due to the standard practice (in English and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercase text messages).

Truecasing is unnecessary in languages whose scripts do not have a distinction between uppercase and lowercase letters. This includes all languages not written in the Latin, Greek, Cyrillic or Armenian alphabets, such as Japanese, Chinese, Thai, Hebrew, Arabic, Hindi, and Georgian.

Techniques

Sentence segmentation can be used to determine where sentences begin, to implement the rule that the first word of every sentence must be capitalized.
Part-of-speech tagging can be used to identify proper nouns, which must be capitalized. In some cases, the same word can be used as different parts of speech, and is capitalized differently. For example, Xerox the company, as a noun, is capitalized, but to xerox a document, as a verb, is not capitalized. A xerox, as in the copy of a document, can be recognized by the presence of a determiner, which is not used for proper nouns.
Named entity recognition can be used to identify proper nouns, which must be capitalized.
A spell checker can be used to identify words that are always capitalized.

Applications

Truecasing aids in other NLP tasks, such as named entity recognition, automatic content extraction, and machine translation.[1] Proper capitalization allows easier detection of proper nouns, which are the starting points of NER and ACE. Some translation systems use statistical machine learning techniques, which could make use of the information contained in capitalization to increase accuracy.

gollark: I only have 10000 pages, this should really be trivial to search.

gollark: Or work out why this is so slow.

gollark: It looks like I MAY need to find an actually good search backend somehow?!

gollark: Oh apioforms, my search thing has horrible performance problems in an entirely possible-to-anticipate way.

gollark: That's totally possible.

References

Lita, L. V.; Ittycheriah, A.; Roukos, S.; Kambhatla, N. (2003). "tRuEcasIng". Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. pp. 152–159.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] Lita, L. V.; Ittycheriah, A.; Roukos, S.; Kambhatla, N. (2003). "tRuEcasIng". Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. pp. 152–159.

Natural language processing
General terms	Natural language understanding Text corpus Speech corpus Stopwords Bag-of-words AI-complete n-gram (Bigram, Trigram)
Text analysis	Text segmentation Part-of-speech tagging Text chunking Compound term processing Collocation extraction Stemming Lemmatisation Named-entity recognition Coreference resolution Sentiment analysis Concept mining Parsing Word-sense disambiguation Ontology learning Terminology extraction Textual entailment Truecasing
Automatic summarization	Multi-document summarization Sentence extraction Text simplification
Machine translation	Computer-assisted Example-based Rule-based Neural
Automatic identification and data capture	Speech recognition Speech synthesis Optical character recognition Natural language generation
Topic model	Pachinko allocation Latent Dirichlet allocation Latent semantic analysis
Computer-assisted reviewing	Automated essay scoring Concordancer Grammar checker Predictive text Spell checker Syntax guessing
Natural language user interface	Virtual assistant Chatbot Interactive fiction Question answering Voice user interface