Meta tagging

An important part of the Corpus' search engine is the meta tagging (or meta description) of the texts. Meta tagging is a process of supplying the text with a number of attributes, which characterize the circumstances of its creation, its author, genre, etc. The main purpose of meta tagging is to make it possible for users of the Corpus to customize the search output with external parameters: for example, to carry out the search on memoirs only, on texts authored by women, texts created by authors born between 1940 and 1960, and so on.

Given the size of the Corpus and the variety of texts it contains, such differentiation is a necessity: few researchers will work with the whole Corpus; most will work with the texts that are most important for their research. Another interesting type of research that can be carried out with the help of meta tagging is the study of correlations between various metatextual parameters, such as the author's age or gender, and linguistic peculiarities.

Users can create their own subcorpora on the Customize your corpus page and search only those texts, which match the parameters chosen by the user. The following is a description of parameters, which can be used for the creation of subcorpora.

The structure of meta tagging

The RNC uses a relatively simple system of tags, which is oriented towards an average user rather than a specialist in corpus linguistics. This type of meta tagging is reflected in the interface used for customizing the search output.

The interface unites certain parameters into blocks:

I. Passport

Author: name, gender, year of birth or approximate age
Text title
Date of creation (can be given as an exact or an approximate date, and as after or before a certain date)
Size (word count; can be given as “<not> more than” or “<not> less than”)

I. Block 2 consists of three sub-groups: non-fiction, fiction and drama. The two former sub-groups have different structures of parameters.

1. Fiction

Genre (including the “no genre” tag)
Text type (the text's self identification is widely used for this tag: the list of tags is given in alphabetical order)
Text chronotope (an approximation of the time and place of events described in the text, including the “no chronotope” tag); in particular, the distinguished periods include prehistory, classical antiquity, the Middle Ages, Early Modern period, Russia: 19th century, Russia: pre-1914 20th century, Russia/USSR: First World War, Civil War, 1920's, 1930's, Second World War (1941-1945), post-war period (to 1952), 1950's, 1960's-1980's, perestroika, Russia: post-Soviet period. The chronotope tag is used for fiction instead of the “theme” tag as more informative.

1. Non-fiction

Sphere of functioning (this parameter is primarily used to distinguish linguistic features): day-to-day life, business and official, technical, journalistic, scientific and teaching, theological.
Text type (the text's self identification is widely used for this tag: the list of tags is given in alphabetical order; includes the “no type” tag)
Text theme (one text can have several themes)

In devising the meta tagging system for the RNC, its creators modeled the system on the work of other major corpora, in particular the British National Corpus. There exist a number of suggestions for text classification, but the system used in the RNC is based on the work of J. Sinclair, known as EAGLES. These recommendations were adapted to the specifics of Russian texts by S. A. Sharov and formed the basis of the first system. Currently, this system is being implemented in the Corpus and it is expected it will bring the RNC closer to international standards and make it easier to use for foreign specialists in corpus linguistics.