Sebastian Padó

Cross-lingual projection of semantic roles

In two studies (Pado and Lapata, EMNLP 2005 and ACL/COLING
2006), we have proposed the use of annotation projection for
the task of creating corpora with role-semantic annotation for new
languages. To evaluate this approach, we have annotated a
1000-sentence sample from the English-German EUROPARL bitext, which is
now available for download.

Our sample selection procedure was informed by two existing resources,
FrameNet for English and SALSA for German. Inter-annotator agreement
(on an additional, but comparable, calibration set) was 0.87. Details
can be found in Pado and Lapata (EMNLP 2005).

This page makes available:

Corpora with manual semantic role annotation on automatic syntactic analyses (for German, with Amit Dubay's Sleepy parses, and for English, with Michael Collins' parser) as in the EMNLP and ACL/COLING papers

Corpora with manual semantic role annotation on hand-corrected syntactic analyses (for German, according to the TIGER guidelines, and for English, according to the Penn Treebank guidelines)

The GIZA++ word alignment and the manual gold alignment (produced according to the BLINKER guidelines) compared in Pado and Lapata (ACL/COLING 2006).

Download

The corpus and the word alignments are freely available for research
purposes. However, we'd be grateful to hear from you if you use this
corpus in your research.

The corpora use the SALSA/TIGER XML format, which can be visualised directly
using the SALTO tool. The alignments are stored in a simple text file format, with four lines per sentence: The ID (corresponding to the sentence ID in the XML file), the two sentences as sequences of tokens, and the word alignment as pairs of token indices.