Monolingual versus multilingual corpora

Many corpora are monolingual – they contain data in only one language.
But there are two types of multilingual corpora.

Comparable corpora

A comparable corpus contains components in two or more languages
that have been collected using the same sampling method, e.g. the same proportions of the texts
of the same genres in the same domains in a range of
different languages in the same sampling period.
The subcorpora of a comparable corpus are not translations of each other.
Rather, their comparability lies in the similarity of their sampling frames.
An example is the use of the
LOB corpus sampling frame for the
Lancaster Corpus of Mandarin Chinese
(McEnery et al. 2003), making these corpora comparable.

Parallel corpora

By contrast, a parallel corpus contains native language
(L1) source texts and their (L2) translations.
In this case, the sampling frame is automatically the same for all the languages in
the corpus. Examples include the the Canadian Hansard corpus
(Brown et al. 1991) and the CRATER corpus (McEnery and Oakes 1995).

For a parallel corpus to be useful, an essential step is to align
the source texts and their translations, annotating the correspondences
between the two at the sentence or word level (see Oakes and McEnery 2000
for an overview). Automatic alignment of parallel corpora is possible for some language pairs,
but for others, it can be a very great challenge.

Trying it out!

The best way to see the benefits of an aligned corpus is to try some searches in such
a corpus! The EUROPARL corpus, which contains European Union documents in
English, German, French, Italian, Spanish and Dutch, is aligned throughout at the sentence
level. An
online interface to this corpus is available. Before running a search, make sure to
click on the Simple Query option – otherwise the website will expect you to use a
very complex query language called CQP.

When you run a search on EUROPARL, you will see that every concordance line is followed by
a table of the equivalent sentences in all the other languages. This is very useful, for example, if you are
interested in a particular word or concept and want to find out whether it is always translated in the
same way or not.