Russian National Corpus
The Russian National Corpus (Russian: Национальный корпус русского языка, lit. 'National Corpus of the Russian language') is a corpus of the Russian language that has been partially accessible through a query interface online since April 29, 2004. It is being created by the Institute of Russian language, Russian Academy of Sciences.
It currently contains more than 1 billion word forms[1] that are automatically lemmatized and POS-/grammeme-tagged, i.e. all the possible morphological analyses for each orthographic form are ascribed to it. Lemmata, POS, grammatical items, and their combinations are searchable. Additionally, 6 million word forms are in the subcorpus with manually resolved homonymy.
The subcorpus with resolved morphological homonymy is also automatically accentuated. The whole corpus has a searchable tagging concerning lexical semantics (LS),[2] including morphosemantic POS subclasses (proper noun, reflexive pronoun etc.), LS characteristics proper (thematic class, causativity, evaluation), derivation (diminutive, adverb formed from adjective etc.).
The RNC includes also the following subcorpora:
- a treebank of syntactical dependencies (largely based on the Igor Mel'čuk's Meaning-Text Theory)
- English⇔Russian, German⇒Russian, Ukrainian⇔Russian and Belorussian⇔Russian parallel corpora;
- a large (100+ million words) separate corpus of modern newspapers (2001–2011);
- a corpus of Russian poetry, where the rhyming words and poetic prosody (including meter, stanzas etc.) is additionally tagged;
- a corpus of Russian dialects with specific dialect grammar tagging;
- a multimedia corpus with searchable tagged fragments of Russian-language movies;
- a corpus showing the history of Russian stress
- an educational subcorpus reflecting school standards.
All the texts have tags bearing metatextual information - the author, his/her birth date, creation date, text size, text genres (general fiction, detective story, newspaper article etc.); all these categories are browsable and searchable separately. It is possible to define a user's subcorpus to search lemmata/POS-grammeme/semantic tags combinations only within this subset.
See also
References
- ^ "Национальный корпус русского языка". Национальный корпус русского языка (in Russian). Archived from the original on March 5, 2022. Retrieved August 28, 2022.
- ^ Apresjan, Ju.; Boguslavsky, I.; Iomdin, B.; Iomdin, L.; Sannikov, A.; Sizov, V. (2006). A Syntactically and Semantically Tagged Corpus of Russian: State of the Art and Prospects. Proceedings of LREC. Genova, Italy. pp. 1378–1381. CiteSeerX 10.1.1.111.8165.
External links
- Russian National corpus
- v
- t
- e
English
- American National Corpus
- Bank of English
- Bergen Corpus of London Teenage Language
- British National Corpus
- Brown Corpus
- Buckeye Corpus
- Cambridge English Corpus
- Corpus of Contemporary American English
- Enron Corpus
- EnTenTen
- International Corpus of English
- Lancaster-Oslo-Bergen Corpus
- Oxford English Corpus
- PropBank
- Spoken English Corpus
- Switchboard Telephone Speech Corpus
- TIMIT
- VerbNet
- Wellington Corpus of Spoken New Zealand English
non-English
- Bijankhan Corpus
- CHILDES
- CorCenCC National Corpus of Contemporary Welsh
- Croatian Language Corpus
- Croatian National Corpus
- Czech National Corpus
- Europarl Corpus
- German Reference Corpus
- Hamshahri Corpus
- National Corpus of Polish
- Neo-Assyrian Text Corpus Project
- Persian Speech Corpus
- Quranic Arabic Corpus
- Russian National Corpus
- Scottish Corpus of Texts and Speech
- Slovenian National Corpus
- TalkBank
- Tatoeba
- Tehran Monolingual Corpus
- Tekstaro de Esperanto
- TenTen Corpus Family
- Thesaurus Linguae Graecae
This article about a digital library is a stub. You can help Wikipedia by expanding it. |
- v
- t
- e
This article about Slavic languages is a stub. You can help Wikipedia by expanding it. |
- v
- t
- e