Use of corpora in translation studies

Use of corpora in translation studies

Centre for Translation Studies, University of Leeds, develops and hosts a range of large representative corpora in a variety of languages (including English, Arabic, Chinese, French, German, Italian, Japanese, Spanish, Polish and Russian). Some corpora are available in-house only (because of copyright restrictions), while others can be accessed freely. The list of all corpora is available from a separate page.

Intellitext

Intellitext is a recent project funded by AHRC. It produced a versatile and intuitive interface offering a simple step-by-step approach to performing a corpus search. First-time and inexperienced corpus users can use the IntelliText Search Builder and Part-of-Speech Editor to build multi-word phrases and add grammatical information to their corpus queries – without having to enter complex string codes. Users may choose from seven search options:

Concordance search [all languages]

Collocation search [all languages]

Affix search [all languages]

Comparison of the frequency of two or more competing words or phrases [all languages]

The comparable corpus of English and Russian news texts

Originally the website was created for making the query interface to the comparable corpus of English and Russian news texts.

The description of the corpus content

The English corpus is based on a subset of the
corpus of Reuters news, a collection of newswires from Reuters for
one year from 1996-08-20 to 1997-08-19. You can search trough a subset
of the corpus within texts annotated with general topic codes
(prefixed with 'G' in the Reuters classification). This includes
newswire texts concerning political events (GPOL), crime (GCRI),
entertainment (GENT), etc, but excludes news from markets, unless they
were explicitly annotated with general topic codes by Reuters corpus
developers. The corpus has been POS tagged and lemmatised using Helmut Schmidt's
TreeTagger.
There is some level of redundancy in the Reuters corpus. Some articles
(my rough estimate is about 10-15%) reuse much of their content from
other articles. This results in identical or almost identical lines in
the output concordance. Take this into account, when analysing
results.

The Russian corpus is based on articles from
Izvestia, a national broadsheet newspaper, and covers the period
from 2000 to 2001. The POS tagging and lemmatisation of the corpus
has been done using mystem.

The language of Russian newspapers can be compared against the first
version of the Russian Reference Corpus, which consists of about 50
million words and represents a variety of genres in Russian. The Russian
Reference Corpus was also used as the basis for development of the frequency
dictionary of modern Russian, its description and information for download
is available from
a separate
page.

The size of the corpora is summarised in the following table:

Corpus

Size(in words)

Reuters subset

83,491,119

Izvestia

14,564,884

Russian Reference Corpus

50,512,584

The interface will allow you to compare word uses between English and Russian as well as across two registers
in Russian (in the language of newspapers vs. the language of fiction). Even though the size of the corpora
varies, the first line of the output shows the relative frequency of
your search term in the corpus you have selected (in terms of the
number of occurrences of the term per million words).