Mark-up and annotation

Each korpus file begins with a header <teiheader>. The header documents the name of the text(s) in the file, the extent of the file in words and in bytes and lists the used tags.

The text itself begins with the tags <text><body>. In every text, at least the heades <head>, passages <p> and sentences <s> have been marked. The rest of the annotation can be different in different subcorpora.

How do we do it?

Texts that are electronically available (e.g. on the Internet) are most easy to collect. Among such texts, journalism is the most common genre but the Internet is also a suitable source for collecting texts pertaining to other genres, such as fiction, scientific texts, etc.

When handling large amounts of texts it is important to automatize the process as much as possible. When compiling the Estonian Reference Corpus, the initial plan was to develop a program that would extract the texts from the web, convert them from HTML to TEI (Text Encoding Initiative) format, annotate parts of texts (headings, paragraphs, and sentences), and would check the accordance with SGML (Standard Generalized Markup Language) standard. After this, it is possible to lemmatize and disambiguate the texts with morphology analyzer. The final result would have been parsed texts that can be searched for lemmas, word forms, and random strings. Currently, the corpus is pared but not lemmatized.

However, the texts available on the internet (especially the journalistic texts) turned out to vary tremendously as for their format. Therefore, no single program is capable of converting them.

Representativeness and balancedness are key notions in corpus linguistics. A representative and balanced corpus should include all (or most) text classes that are prominent in a certain culture in a certain period of time and these text classes should be represented in the corpus in accordance with their prominence. In reality, representativeness and balancedness are becoming less important as the size of the corpora keeps growing. Truly large representative corpora are quite rare; British National Corpus is an example of such a corpus.

Based on the Estonian Reference Corpus, a smaller (more) balanced corpus has been compiled. This is referred to as The Balanced Corpus of Estonian consists of three sub-corpora - journalistic texts, fiction, and scientific texts - each 5 million words.

Because the Balanced Corpus of Estonian is a part of the Estonian Reference Corpus, it is possible to get the same sentence twice as a result of a single query. In order to avoid that, the searches from the Balanced Corpus of Estonian can be made as separate queries.

The Estonian Reference Corpus is no longer the largest Estonian corpus. Today the largest corpus available for Estonian language is etTenTen, a corpus collected from the Internet, compiled in co-operation of the Institute of Estonian Language and Lexical Computing Ltd. etTenTen can be found on Keeleveeb's webpage.