Data

Ill-formed or short sentences were eliminated from the randomly-selected sentences prior to annotation. The data was segmented and annotated for part of speech (POS), syntactic structures, verb subclasses and noun compounds.Word segmentation and POS tagging were accomplished automatically using statistical models trained on a larger, annotated corpus of Peoples Daily newswire stories. Humans manually annotated the syntactic structures and corrected word segmentation errors. POS tags were not corrected.

The data is provided in the format of CoNLL-X and in UTF-8. One line presents information for one word. An empty line indicates the end of a sentence. Each line contains 10 columns separated with a tab.