The purpose of the add-on page is to provide detailed documentation of the data, tools and settings used so that the results can be reproduced by other researchers.

Data

IIIT-TIDES

Daniel Pipes

EMILLE

Agrocorpus

Shabdanjali

Wikipedia Named Entities

Tools and their settings

Tokenization and normalization of the data

Hunalign

GIZA++

makecls

SRILM

Moses

Joshua

Mumbai Tagger

Affisix

Hindomor

HiTBSuf

počítadlo BLEU skóre

Link tables from the paper to concrete settings

Link to the PDF version of the paper; link to Biblio?

Data

IIIT Tides

A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.

Part

Sentences

Tokens en

Bytes en

Tokens hi

Bytes hi

train

50,000

1,195,436

6,496,995

1,287,174

15,917,598

dev

1,000

21,842

118,239

23,851

291,147

test

1,000

26,537

145,376

27,979

348,221

The test data contain 1 reference translation per sentence.

Our preprocessing of the data included the following steps:

Further tokenization. Although Tides in the form we got it is roughly tokenized, there were tokens (like “anglo-american”) we wished to split into smaller tokens.