Czech National Corpus

The Czech National Corpus (CNC) (Czech : Český národní korpus) is a large electronic corpus of written and spoken Czech language, developed by the Institute of the Czech National Corpus (ICNC) in the Faculty of Arts at Charles University in Prague. The collection is used for teaching and research in corpus linguistics.[1] The ICNC collaborates with over 200 researchers and students (mainly for spoken and parallel data acquisition), 270 publishers (as text providers), and other similar research projects.

Areas of focus

The Czech National Corpus focuses systematically on the following areas:[2]

Synchronic written corpora: the SYN-series corpora maps the Czech language of the 20th and 21st century (esp. the last twenty years) and forms the core of the project. Texts are enriched with metadata, lemmatization, and morphological tagging.[3]

Contemporary spontaneous spoken Czech: The ORAL-series corpora contain contemporary, spontaneous spoken language used in informal situations through the entire Czech Republic (as opposed to prepared, broadcast or scripted texts generally found in spoken corpora).[4]

Multilingual parallel corpus: InterCorp is a large corpus of Czech texts aligned at the sentence level with translations to or from more than 30 languages. The core of the corpus consists of manually aligned and proofread fiction texts.[5]

Diachronic corpus of Czech: the DIAKORP corpus of historical Czech includes texts from 14th century onwards. The current focus of DIAKORP is on the 19th century. The long term goal of DIAKORP is to create a corpus covering the period of 1850–present and interconnecting the data with the SYN series.[6]

Specialised linguistic data: the ICNC is also involved in the collection of language data for specific research purposes, including DIALEKT (dialectal speech), CzeSL (texts written by non-native learners of Czech), DEAF (Czech texts written by the deaf), or Jerome (translated and non-translated Czech).

gollark: The best OS.

gollark: A giant cuboid of cloud blocks, like the PotatOS Institute.

gollark: Maybe I should buy it and site the PotatOS Institute² there.

gollark: What happened? What's going there now?

gollark: Wait, AlexDevs's's tower has been disassembled?!

References

"Institute of the Czech National Corpus". Institute of the Czech National Corpus. Retrieved 8 January 2019.
Křen, Michal. "Recent Developments in the Czech National Corpus" (PDF). Publication Server of the Institute for German Language. Retrieved 8 January 2019.
M. Hnátková, M. Křen, P. Procházka, and H. Skoumalová. (2014). "The SYN-series corpora of written Czech". Proceedings of LREC2014: 160–164. S2CID 2586912.CS1 maint: multiple names: authors list (link)
L. Válková, M. Waclawičová, and M. Křen. (2012). "Balanced data repository of spontaneous spoken Czech" (PDF). Proceedings of LREC2012: 3345–3349. Retrieved 9 January 2019.CS1 maint: multiple names: authors list (link)
F. Čermák and A. Rosen (2012). "The case of InterCorp, a multilingual parallel corpus" (PDF). International Journal of Corpus Linguistics. 13 (3): 411–427. doi:10.1075/ijcl.17.3.05cer. Retrieved 9 January 2019.
K. Kučera and M. Stluka. (2014). "Corpus of 19th century Czech texts: Problems and solutions" (PDF). Proceedings of LREC2014: 165–168. Retrieved 9 January 2019.

External links

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] "Institute of the Czech National Corpus". Institute of the Czech National Corpus. Retrieved 8 January 2019.

[2] Křen, Michal. "Recent Developments in the Czech National Corpus" (PDF). Publication Server of the Institute for German Language. Retrieved 8 January 2019.

[3] M. Hnátková, M. Křen, P. Procházka, and H. Skoumalová. (2014). "The SYN-series corpora of written Czech". Proceedings of LREC2014: 160–164. S2CID 2586912.CS1 maint: multiple names: authors list (link)

[4] L. Válková, M. Waclawičová, and M. Křen. (2012). "Balanced data repository of spontaneous spoken Czech" (PDF). Proceedings of LREC2012: 3345–3349. Retrieved 9 January 2019.CS1 maint: multiple names: authors list (link)

[5] F. Čermák and A. Rosen (2012). "The case of InterCorp, a multilingual parallel corpus" (PDF). International Journal of Corpus Linguistics. 13 (3): 411–427. doi:10.1075/ijcl.17.3.05cer. Retrieved 9 January 2019.

[6] K. Kučera and M. Stluka. (2014). "Corpus of 19th century Czech texts: Problems and solutions" (PDF). Proceedings of LREC2014: 165–168. Retrieved 9 January 2019.

Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Quranic Arabic Corpus Russian National Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine