Models for Natural Language Learning using Unlabeled Data

Here is a lightly-annotated bibliography of papers on
learning from labeled and unlabeled data. It
focuses on methods especially relevant to bootstrap learning for
natural language analysis, and on theoretical models for how and when
we should expect unlabeled data to be helpful.

Please edit this file ( /afs/cs/project/theo-21/www/semisupervised.html) to add more citations.

Natural
language bootstrap learning:

Yarowsky wrote an early paper describing how to learn to disambiguate
word senses. It makes the assumption that each occurance of a
word (e.g., "bank") in a document has the same meaning (e.g., river
bank or financial bank). Abney's paper is a more recent
formal analysis of why Yarowsky's algorithm works.

Cotraining uses unlabeled data together with labeled data to learn
f(x)=y when x can be expressed as a pair of features
x=<x1,x2> such that both x1 and x2 are individually
sufficient to predict y. This has been used to train web page
classifiers, named entity recognizers, image classifiers, and more.
The idea was introduced in 1998 by Blum & Mitchell
and has been applied and extended in several directions.

The above papers contain a number of theoretiical models, especially
the Blum & Mitchell 1998 paper, and the Abney paper. Following are
more recent theoretical models for how and when unlabeled data can
improve learning.

These papers provide PAC-style bounds on co-training and related
learning settings that go beyond those provided in the original
co-training paper.

Whereas the above papers focus on PAC bounds, the following paper has a
very different focus. It presents a statistical model for
estimating accuracy for bootstrap learning of named entity and relation
extractors, under the assumption that correct entities and relations
will be repeatedly extracted from a large corpus, and that correct
extractions will be repeatedly more frequently than incorrect
extractions. This is used by Etzioni's system described above.