Cambridge English Corpus

The Cambridge English Corpus (formerly the Cambridge International Corpus) is a multi-billion word corpus of English language (containing both text corpus and spoken corpus data). The Cambridge English Corpus (CEC) contains data from a number of sources including written and spoken, British and American English. The CEC also contains the Cambridge Learner Corpus, a 40m word corpus made up from English exam responses written by English language learners.

The Cambridge English Corpus is used to inform Cambridge University Press English Language Teaching publications as well as for research in corpus linguistics. Access is currently restricted to authors and researchers working on projects and publications for Cambridge University Press, and researchers at Cambridge English Language Assessment.[1]

Written Data

The Cambridge English Corpus contains instances of modern written English, taken from newspapers, magazines, novels, letters, emails, textbooks, websites, and many other sources.

Spoken Data

The Cambridge English Corpus contains a wide variety of spoken English language, taken from many sources, including everyday conversations, telephone calls, radio broadcasts, presentations, speeches, meetings, TV programmes and lectures.

Cambridge Learner Corpus

The Cambridge Learner Corpus (CLC) is a collection of exam scripts written by students learning English, built in collaboration with Cambridge English Language Assessment. The CLC contains scripts from over 180,000 students, from around 200 countries, speaking 138 different first languages and is growing all the time.[2] The exams currently included are:

KET Key English Test (and KET for schools)
PET Preliminary English Test (and PET for schools)
FCE First Certificate in English
CAE Certificate in Advanced English
CPE Certificate of Proficiency in English
BEC Business English Certificate (all levels)
IELTS International English Language Testing System (academic and general training)
CELS Certificates in English Language Skills
ILEC International Legal English Certificate
ICFE International Certificate in Financial English
Skills for Life

A unique feature of the Cambridge Learner Corpus is its error coding system. Language specialists identify and annotate errors in the exam scripts. This means that the Corpus can be used to find out about the frequency of different types of errors, the contexts that the errors are made in and the student groups that find particular language areas difficult.[3]

Authors of Cambridge English Language Teaching resources can use this information to target common errors – for example, the Cambridge Advanced Learner’s Dictionary contains ‘Common mistake’ features which highlight frequent learner errors.

Conversely, the error coding system also reveals what students can achieve at each level. This is central to the work of English Profile, a collaborative programme to enhance the learning, teaching and assessment of English worldwide.[4] The founding partners are Cambridge University Press, Cambridge English Language Assessment, the University of Cambridge, the University of Bedfordshire, the British Council and English UK.[5] The project’s aim is to describe what learners know and can do in English at each level of the Common European Framework of Reference (CEFR).[6]

Specialized corpora

The Cambridge English Corpus contains a number of specialized corpora:

Cambridge Business English Corpus

The Cambridge Business English Corpus is a large collection of British and American business language, including reports and documents, books relating to different aspects of business, and the business sections from many national newspapers.

The Cambridge Business English Corpus also includes the Cambridge and Nottingham Spoken Business English Corpus (CANBEC), the result of a joint project between Cambridge University Press and the University of Nottingham. This is a collection of recordings of English from companies of all sizes, ranging from big multinational companies to small partnerships. It contains formal and informal meetings, presentations, telephone conversations, lunchtime conversations, and spoken language from other business situations.

Cambridge Legal English Corpus

The Cambridge Legal English Corpus contains books, journals and newspaper articles relating to the law and legal processes.

Cambridge Financial English Corpus

The Cambridge Financial English Corpus contains texts relating to economics and finance, including leading financial magazines and newspapers.

Cambridge Academic English Corpus

The Cambridge Academic English Corpus contains written and spoken academic language at undergraduate and post-graduate level from a range of US and UK institutions, including lectures, seminars, student presentations, journals, essays and text books.

CANCODE

The Cambridge and Nottingham Corpus of Discourse in English (CANCODE) is a collection of spoken English recorded at hundreds of locations across the British Isles in a wide variety of situations (e.g. casual conversation, socialising, finding out information, and discussions). The CANCODE corpus is the result of a joint project between Cambridge University Press and the University of Nottingham.

There are about five million words in the CANCODE corpus, and it's a very rich resource for researchers of spoken English. However, the data does have some limitations. Most people knew they were being recorded, and are chatting in informal situations such as while relaxing at home, with others of fairly equal social status. This means the interactions are generally consensual and collaborative, so the corpus has minimal evidence of conflict or adversarial exchanges[7]

Cambridge-Cornell Corpus of Spoken North American English

The Cambridge University Press/Cornell Corpus is a large collection of informal, highly interactive, multiparty conversations between family/friends in North America. The Cambridge-Cornell corpus is the result of a joint project between Cambridge University Press and Cornell University.

CAMSNAE

The Cambridge Corpus of Spoken North American English (CAMSNAE) is a large collection of spoken American English. It includes recordings of people going about their everyday life – at work, at home with their families, going shopping, having meals, etc.

gollark: Actually, capitalism is PHP-based.

gollark: It says here that most of the Ampere chips are made on Samsung 8nm, which isn't EUV-based anyway, so I don't think that's massively relevant.

gollark: Isn't the lower-end Nvidia lineup on Samsung?

gollark: AMD is admittedly being completely ineffective with their stuff, but it looks like Intel is trying somewhat.

gollark: Probably the biggest one is TPUs, which are another proprietary thing you can only rent, but still.

References

Cambridge International Corpus, http://www.cambridge.org/us/esl/catalog/subject/custom/item3637700/Cambridge-International-Corpus-Cambridge-International-Corpus/?site_locale=en_US
Cambridge Learner Corpus, http://www.cambridge.org/us/esl/catalog/subject/custom/item3646603/Cambridge-International-Corpus-Cambridge-Learner-Corpus/?site_locale=en_US
Diane Nicholls, http://ucrel.lancs.ac.uk/publications/CL2003/papers/nicholls.pdf
English Profile project, http://www.englishprofile.org/index.php?option=com_content&view=article&id=11&Itemid=2 Archived 2011-09-14 at the Wayback Machine
English Profile, http://www.englishprofile.org/index.php?option=com_content&view=article&id=24&Itemid=22 Archived 2011-05-07 at the Wayback Machine
Council of Europe, CEFR levels,"Archived copy". Archived from the original on 2009-10-30. Retrieved 2009-11-05.CS1 maint: archived copy as title (link)
Carter (2004) Language and Creativity: The Art of Common Talk. London: Routledge.

External links

cambridge.org/corpus

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] Cambridge International Corpus, http://www.cambridge.org/us/esl/catalog/subject/custom/item3637700/Cambridge-International-Corpus-Cambridge-International-Corpus/?site_locale=en_US

[2] Cambridge Learner Corpus, http://www.cambridge.org/us/esl/catalog/subject/custom/item3646603/Cambridge-International-Corpus-Cambridge-Learner-Corpus/?site_locale=en_US

[3] Diane Nicholls, http://ucrel.lancs.ac.uk/publications/CL2003/papers/nicholls.pdf

[4] English Profile project, http://www.englishprofile.org/index.php?option=com_content&view=article&id=11&Itemid=2 Archived 2011-09-14 at the Wayback Machine

[5] English Profile, http://www.englishprofile.org/index.php?option=com_content&view=article&id=24&Itemid=22 Archived 2011-05-07 at the Wayback Machine

[6] Council of Europe, CEFR levels,"Archived copy". Archived from the original on 2009-10-30. Retrieved 2009-11-05.CS1 maint: archived copy as title (link)

[7] Carter (2004) Language and Creativity: The Art of Common Talk. London: Routledge.

Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Quranic Arabic Corpus Russian National Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine