Content and structure of the Corpus

The RNC includes primarily original prose representing standard Russian (from the middle of the 18th century) but also, albeit in smaller volumes, translated works (parallel with the original texts) and poetry, as well as texts, representing the non-standard forms of modern Russian: spoken (recordings of oral speech, spontaneous and public) and dialectal.

The main corpus

The main corpus, which includes texts representing standard Russian, can be subdivided into 3 parts, each of which has its distinguishing features: modern written texts (from the 1950s to the present day), a subcorpus of real-life Russian speech (recordings of oral speech from the same period), and early texts (from the middle of the 18th to the middle of the 20th centuries). By default, the search is carried out in all the three sub-groups. It is possible to choose one of them and add search parameters on the “customize your corpus” page.

Every text included in the main corpus is subject to meta tagging and morphological tagging. Morphological tagging is carried out by computer programs for automated morphological analysis. In a small part of the main corpus (currently around 5 million tokens; this figure is set to increase with time) homonyms are disambiguated by hand and the results of automated morphological analysis corrected. This part is the model morphological corpus and serves as a testing ground for various search algorithms and programs of morphological analysis and automated processing. It can also be used for research on modern Russian morphology that requires particular preciseness. Examples of this subcorpus are annotated as “disambiguated” (“îìîíèìèÿ ñíÿòà”). Disambiguated texts are automatically supplied with indicators of stress (from the Grammatical dictionary of Russian). Stress annotation may be turned off for printing or saving the search results.

Modern written texts

The representative corpus of morphologically tagged modern texts is the main and the largest of the subcorpora. The planned volume of the corpus is 100 million tokens. The corpus includes various types of texts representing modern standard (written) Russian:

Modern fiction of various genres

Modern drama

Memoirs and biographies

Journalism and literary criticism

Scientific, popular scientific and teaching texts

Religious and philosophical texts

Technical texts

Business and jurisprudence texts

Day-to-day life texts, including texts not intended for publication (letters, diaries, etc.)

Texts are represented in proportion to their share in real-life usage. For example, the share of fiction (including drama and memoirs) does not exceed 40%.

The sources of book, magazine and newspaper texts included in the Corpus are usually proof-read electronic versions supplied by their respective publishers and the texts are used with publishers' permission.

The search can be limited to modern texts in the Date of creation field of the Customize your corpus page.

Mid-18th to mid-20th century texts

Texts from the middle of the 18th century to the middle of the 20th century are also included in the Corpus and represent various genres (fiction, scientific texts, journalism, letters) but due to limited availability of such texts in electronic form or in modern reprints the proportion of fiction for this period is much higher than for the main corpus. Pre-1918 texts are given in modern orthography; peculiarities of their original orthography preserved in modern academic editions are also preserved in the Corpus.

Deeply Annotated Corpus

This subcorpus of the RNC contains texts augmented with morphosyntactic annotation.
Besides the morphological information ascribed to each word in the text,
every sentence has its syntax structure marked up.

The Deeply Annotated Corpus (DAC) uses dependency trees as its annotation formalism.
Nodes in such a tree are words of the sentence, while its edges
are labeled with names of syntax relationships. This way of representing the syntax
structure originates from “Meaning ⇔ Text” linguistic model
by Igor A. Mel’čuk and Alexander K. Zholkovsky. The repertory
of syntactic relationships for the DAC, as well as other
specific linguistic decisions on how to represent the syntax of Russian sentences,
has been developed in the Laboratory for Computational Linguistics, Institute
for Information Transmission, Russian Academy of Sciences that compiled the DAC.

Unlike the morphologically annotated portion of the RNC, the DAC only contains
fully disambiguiated annotations (i.e. both morphological and syntax ambiguity
is resolved).

Parallel text corpus

The parallel text corpus is a special type of corpus where a text in Russian is complemented by its translation into a different language, and vice versa. The units of the original and the translated texts (usually, a unit is a sentence) are matched through a procedure known as “leveling”. A leveled parallel corpus is an important tool for various type of research, including studies on the theory of translation; it can also be used as a language teaching tool.

Dialectal corpus

The dialectal corpus contains recordings of dialectal speech (presented in loosely standardized orthography) from different regions of Russia. There is no intention to present the phonetic variation, but morphological, syntactic and lexical peculiarities of these texts are preserved. The subcorpus employs special tags for specifically dialectal morphological features (including those absent in standard language); moreover, purely dialectal lexemes are supplied with commentary.

Poetry corpus

At the moment the poetry corpus covers the time frame between 1750 and 1890s, but also includes some poets of the 20th century; currently, works of drama composed in poetry are not included. Apart from the usual morphological tagging (identical to that available for the non-disambiguated corpus), there is a number of tags adapted for poetry. For example, it is possible to search for texts written in various poetic meters such as amphibrach.

Educational corpus

The educational corpus is a small disambiguated corpus adapted for the Russian educational program, including works of fiction on the school reading list and several additional morphological features.

Corpus of Spoken Russian

The Corpus of Spoken Russian includes the recordings of public and spontaneous spoken Russian and the transcripts of the Russian movies. To record the spoken specimens the standard spelling was used. The lexical, morphological and semantic queries are practicable. The building of the user's sub-corpora is available (for this purpose the usage of the sociological parameters is also possible). The corpus contains the patterns of different genres/types and of different geographic origins (Moscow, Sanct-Peterburg, Saratov, Ulyanovsk, Taganrog, Ekaterinburg, and so on). The corpus covers the time frame from 1930 to 2007.