1. Introduction

CzEng 0.5 (http://ufal.mff.cuni.cz/czeng/) is a Czech-English parallel corpus compiled at the Institute
of Formal and Applied Linguistics, Charles University, Prague in
2005-2006. The corpus contains no manual annotation. It
is limited only to texts which have been already available in
an electronic form and which are not protected by authors' rights in
the Czech Republic. The main purpose of the corpus is to support
Czech-English and English-Czech machine translation research with the
necessary data. CzEng 0.5 is available free of charge for educational and
research purposes, however, the users should become acquainted with
the license agreement (http://ufal.mff.cuni.cz/czeng/license.html).

CzEng 0.5 consists of a large set of parallel textual documents mainly from the fields
of European law, information technology, and fiction, all of them converted
into a uniform XML-based file format and provided with automatic
sentence alignment. The corpus contains altogether 7,743 document
pairs. Full
details on the corpus size are given in the table below.

2. Download

3. Sources of Parallel Texts

We have used texts from the following publicly available sources:

Acquis Communautaire Parallel Corpus (prefix
celex) available at
http://wt.jrc.it/lt/Acquis/. It
contains a huge body of EU legislative texts written between 1950s
and 2005 (CzEng uses only two out of 20 languages covered by Acquis
Communautaire Corpus).

Corpus OPUS available at http://logos.uio.no/opus/. It
is an open source collection of freely available corpora; two of them
are used in CzEng:

E-books (prefix books) freely available on the
Internet both in English and Czech (especially at http://www.gutenberg.org
and http://www.palmknihy.cz),
namely:
Jack London: The Star Rover / Tulák po hvězdách,
Franz Kafka: Trial / Proces,
E.A. Poe: The Narrative of Arthur Gordon Pym of Nantucket: Dobrodružství A.G.Pyma,
E.A. Poe: A Descent into the Maelstrom / Pád do Malströmu,
Jerome K. Jerome: Three Men in a Boat / Tři muži ve člunu.

The quantitative properties of the individual sources (after
performing the necessary preprocessing, as described in the next
section) are summarized in the following table:

Document pairs

Sentences

Words+Punctuation

Czech

English

Czech

English

Total

7,743

1,418,721

1,295,647

18,517,624

20,994,274

100.0%

100.0%

100.0%

100.0%

100.0%

Acquis Communautaire

6,272

1,101,610

930,626

14,619,572

16,079,043

81.0%

77.6%

71.8%

78.9%

76.6%

European Constitution

47

11,506

10,380

138,853

176,096

0.6%

0.8%

0.8%

0.7%

0.8%

Samples from European Journal

8

5,777

4,993

104,560

133,136

0.1%

0.4%

0.4%

0.6%

0.6%

Readers' Digest

927

121,203

128,305

1,794,827

2,234,047

12.0%

8.5%

9.9%

9.7%

10.6%

Kačenka

5

62,696

69,951

1,034,642

1,188,029

0.1%

4.4%

5.4%

5.6%

5.7%

E-Books

5

17,140

17,495

330,118

399,607

0.1%

1.2%

1.4%

1.8%

1.9%

KDE

479

98,789

133,897

495,052

784,316

6.2%

7.0%

10.3%

2.7%

3.7%

4. Text Preprocessing

Since the individual sources of parallel texts differ in many aspects,
a lot of effort was required to integrate them into a common
framework. Depending on the type of the input resource, (some of) the following
steps have been applied on the Czech and English documents:

removing long text segments having no counterpart in the corresponding document,

adding sentence and token identifiers,

conversion to a common XML format.

5. Known Limitations of Preprocessing

The tokenization and segmentation rules were kept as simple as possible:

a different character class (digit, letter, punctuation) always starts a new
token,

adjacent punctuation characters are encoded as separate tokens.

This decision leads to some unpleasant differences in tokenization and segmentation compared to the "common standard" of Penn-Treebank-like annotation.

No abbreviations were searched for. This hurts especially with titles (Dr.) or abbreviated names (O. Bojar), because a period followed by an upper-case letter is treated as the sentence boundary. All such expressions are thus splitted into several sentences.

10. License

By using CzEng 0.5 the user agrees to be bound by the license
agreement. Briefly said, the license

follows the restrictions specified in the individual
licenses of the sources of parallel texts,

allows the user to use the data only for non-commercial research or
educational purposes,

allows the user to extract statistical information from the texts
and/or to make short citations,

requires the user to make a reference
to CzEng in any published work in which he/she used the CzEng data.

11. Disclaimer

The user of CzEng 0.5 should be aware of its following properties:

CzEng is not claimed to be a balanced corpus (whatever it means).

CzEng does not provide the information about what was the original
text and what was the translation (English is usually the original language,
however, in some cases both English and Czech texts
are translations from a third language).

Quality of the contained data (including grammatical correctness,
translation accuracy, alignment quality etc.) is not guaranteed and
actually can be very diverse, depending especially on the type of the
input resource.

CzEng does not contain all the information present in the input resources,
and thus they cannot be reconstructed from CzEng. Some text segments as
well as parts of the original annotation might be missing (for
instance, all the resources have been (re-)segmented and (re-)aligned).