Introduction

LDC's CALLHOME Mandarin Chinese collection includes telephone speech, associated transcripts and a lexicon. CALLHOME Mandarin Chinese Speech consists of 120 unscripted telephone conversations between native speakers of Mandarin Chinese. All calls, which lasted up to thirty minutes, originated in North America and were placed to locations overseas; most participants called family members or close friends. CALLHOME Mandarin Chinese Transcripts covers a contiguous five or ten-minute segment from each of the telephone speech files. The transcripts are in tab-delimited format with GB2312 encoding, are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. CALLHOME Mandarin Chinese Lexicon is comprised of over 40,000 words from twenty CALLHOME Mandarin transcripts.

CALLHOME Mandarin Chinese Transcripts - XML Version, the latest addition to this collection, presents the entire original corpus of 120 transcripts in XML format with UTF-8 encoding, retokenization and part-of-speech (POS) tagging. The retokenization and POS information were supplied using the Chinese Lexical Analysis System (ICTCLAS) developed by the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. ICTCLAS aims to incorporate Chinese word segmentation, POS tagging, disambiguation and unknown words recognition into a single theoretical framework using multi-layered hierarchical hidden Markov models.

In addition to the original applications for Mandarin Chinese CALLHOME data (e.g., speech recognition), CALLHOME Mandarin Chinese Transcripts - XML Version will be useful in the grammatical study of spoken Mandarin.

Data

This XML corpus retains all of the linguistic analyses (e.g., timestamps, spoken features and proper nouns) from the original transcripts release, but the mnemonics used in the original release were migrated into XML markup following the mapping rules described below:

All analyses in the original release were retained at the sacrifice of tokenization and part-of-speech tagging accuracy (e.g., some mnemonics encoding spoken features may split a word, which can affect the tagging accuracy). However, the results of the automated processing were substantially post-edited. For example, four aspect markers in Chinese (-le, -guo, -zhe and zai) were disambiguated and corrected by hand; all of the classifiers (also called "measure words") were re-tagged using a more fine-grained annotation scheme developed on the Lancaster University project. In addition, a large number of obvious typographical errors in the original release were corrected in the process of post-editing.