Thesis Proposal

David McClosky

Tuesday, April 14, 2009 at 4:00 P.M.

Room 368 (CIT 3rd floor)

Current efforts in syntactic parsing are largely data-driven. These methods require labeled examples of syntactic structures and attempt to learn statistical patterns governing these structures. Labeled data typically requires expert annotators which makes it both time consuming and costly to produce. Furthermore, once training data has been created for one textual domain, it has only limited portability to similar domains. Since a major goal of syntactic parsing is to capture syntactic patterns across an entire language rather than just those in a specific domain, this domain-dependence has inspired a large body of work.

The simplest approach is to assume that the target domain is essentially the same as the source domain. No additional knowledge about the target domain is included. Naturally, this works best when the target domain is the same or relatively close to the source. A more realistic approach assumes that we have only raw text from our target domain. This assumption lends itself well to semi-supervised learning methods since these utilize both labeled and unlabeled examples.

This proposal focuses on a specific family of semi-supervised methods which we refer to as self-training. Self-training allows one to turn an existing supervised learner into a semi-supervised learner with minimal effort. We first show results on self-training for syntactic constituency parsing within a single domain. While self-training has failed for this task in the past, we present a simple modification which allows it to succeed, producing state-of-the-art results on English constituency parsing. Next, we show how self-training is beneficial when parsing across domains and helps further when raw text is available from the target domain. One of the remaining issues is that one must choose a training corpus appropriate for the target domain or performance may be severely impaired. Humans can do this on a small scale, but this strategy becomes less practical as we approach larger data sets. We plan to investigate methods for automatically detecting useful source domains and adjusting our model to incorporate them effectively. As a result, we aim to produce a fully automatic syntactic constituency parser which can produce high-quality parses for all types of text, regardless of domain.