Estonian Open Parallel Corpus 2014. Estonian-English

View resource name in all available languages

Eesti avatud paralleelkorpus 2014. Eesti-Inglise

This corpus has been collected within the framework of “Estonian Open Parallel Corpus”. The purpose of the project is to create a significant amount of language resources to improve statistical machine translation systems.

In META-SHARE infrastructure, the material collected, organized, aligned and published in the corpus contains:
1) Estonian-English parallel corpus with more than 11 million words,
2) Estonian-Russian parallel corpus with two million words,
3) Estonian-Latvian parallel corpus with 1.5 million words.

The corpus has been applied in the following ways:
1) During the EKT63 project, statistical machine translation systems were trained for English-Estonian-English language directions; the translation quality of these systems exceeds that of other machine translation systems in the general area, including Google Translate.
2) Estonian-Latvian parallel texts were collected in cooperation with the Institute of the Estonian Language (EKI) and the corpus is used (in addition to training the machine translation systems) for the compilation of Estonian-Latvian and Latvian-Estonian dictionary in a cross-border project of EKI and Latvian Language Center.
3) The outcomes of the project have been used by several parties (e.g. the researchers from the University of Tartu in their terminology work).
4) Taking into account the copyright restrictions, fiction (books) processed during the project has been made available in the DIGAR system of the National Library of Estonia that enables to read it with e-readers.

During the project period of 2012–2014, Estonian-English parallel corpus with over 11 million words was collected (2.5 million in 2012; 3.75 million in 2013; 5 million in 2014), also Estonian-Russian parallel corpus with approximately 2 million words and Estonian-Latvian parallel corpus with 1.5 million words were collected. The results have been made available via META-SHARE infrastructure. The corpus was compiled of texts that have not been used for creating parallel corpora so far (does not overlap with existing parallel corpora, e.g. DGT and JRC Aquis, “Estonian-French parallel corpus” etc.).

For the selection of the sources for parallel corpora, bilingual news flows were focused on (e.g. stock exchange news), magazines and fiction, in order to obtain a balanced corpus (since fiction includes significantly more adjectives than legal texts, for example). The outcome of the work was arranged and published as full sentences or n-grams due to copyright restrictions that do not enable to recover the initial work from the parallel corpus. This approach enables to collect the best available training material for machine translation systems in the legal framework that is closest to everyday professional language and creates the most value for the end users of machine translation systems.

In general, the methodology of collection consists of the following activities.
Activity 1: Collecting the material for parallel corpus.
Activity 2: Removing text formats, scanning paper books, OCR, error correction.
Activity 3: Sentence-level alignment, quality checks.
Activity 4: Propagating outcomes. The collected corpora are made available for machine translation systems. The owners of the systems are asked for a feedback about the quality of the corpora and its effect on the translation quality.
Iteration ends with iteration evaluation and planning of next iterations and corrections of methods.