syntactic analyzers depend largely on POS and closed-class words,
and can be effectively trained from corpora of 100K to 1M words

tasks are largely domain independent ... grammar is largely the
same for different domains (although some adaptation is needed for
different genres)

Supervised methods are less successful for semantic tasks

some tasks (a few classes of named entities) are easily trained,
but must be retrained for each domain

some tasks (lexical semantic relations, coreference, discourse
structure) are lexically specific and require large amounts of training
data ... more than we can practically annotate

our ideal: can we learn from unannotated text? ... we have
lots of it, thanks to the web

Types of learning

classification tasks: most of the tasks we will consider
can be described as
classification problems ...
the data points to be classified are characterized by a set of features
and values
(with usually a finite set of possible values for each feature) ...
the task is to assign each data point in a test set to one of a finite set of
classes (we label the data
points)

part-of-speech tagging: assign a part of speech to each
token

chunking and named entity tagging: these units span several
tokens,
but we can use BIO tagging to transform it into a token tagging
task

hyponymy: X is a Y or X is not a Y

the classifier is trained using a training set of data points

supervised learning:
every
data point in the training set is labeled

unsupervised learning:
none
of the data points in the training set are labeled

semi-supervised
learning: some of the data points in the training set are labeled

adaptation: we have labeled data for one task and
unlabeled data for a related task for which we want to train a
classifier

active learning:
some of the data points in the training set are labeled and
the learning procedure can ask for the labels of some of the initially
unlabeled points

with the hope that we can learn more efficiently ... get the
same performance with fewer labeled points than a supervised learner

and see how they have been addressed by supervised and semi-supervised
learning. In particular, we will consider what properties of the
data allow us to learn from unannotated data ... distributional
hypothesis, parallel data.

Course organization:

mixture of lectures and student reports on papers

each student will report on 3-4 papers over the course of 12
weeks (approx. 5%
of grade for each report)

References

Almost all of the papers we will look at have been published through
the Assn. for Computational Linguistics (ACL). It maintains a
fairly complete on-line archive of proceedings going back 30 years, and
in some cases 40 years, at http://aclweb.org/anthology-new/
. Studying the original papers will give some historical
perspective on how the field is developing. However, I will also
give references to two excellent texts which are suitable for
background reading in natural language processing and basic statistical
methods:

Daniel Jurafsky and James Martin, Speech and Language Processing.
Prentice
Hall, first edition, 2000; second edition, 2008. (The textbook
for the basic natural language processing course; cited as
J&M)

A recent book on some of the topics of this course is: Steven
Abney, Semisupervised Learning for
Computational Linguistics, Chapman and Hall, 2008.

Standard abbreviations for citations (all available through the ACL
archive):

ACL = Proc. of the Annual Conference of the Assn.
for Computational Linguistics
NAACL = Proc. of the Conf. of the North American
Chapter of the ACL
EACL = Proc. of the Conf. of the European Chapter of
the ACL
ANLP =Proc. of the Conf. on Applied Natural Language
Processing (ended 2000)
COLING = Proc. of Int'l Conf. on Computational
Linguistics
EMNLP = Empirical Methods in Natural Language
Processing
CL = Computational Linguistics (journal)

Perspectives

Classifiers:

we will use a variety of classification models:

decision lists

Naive Bayes

maximum entropy

SVM [support vector machine]

if we are tagging a sequence of items (most often, a sequence of
tokens) we don't want to treat each item as a separate classification
task but will rather want to take into account that some sequences of
classes are preferred, by using a sequential model:

HMM [Hidden Markov Model]

MEMM [Maximum Entropy Markov Model]

CRF [Conditional Random Field]

Semi-supervised learning strategies: self-training

One of the most common strategies for semi-supervised training is
"self-training". We initially train a (base) classifier from the
labeled data, and then use this classifier to label the unlabeled
data. We then select a subset of the newly labeled data about
which we have the greatest confidence, add it to the labeled training
data, and repeat...

Here train is the base learner label is a procedure for labeling data with a classifier
and assigning a confidence to each label select selects a subset of the data based on the
confidence

There are many design decisions in applying this to particular
tasks (Abney, pp. 20-22), such as how to compute confidence, how
to select data and how to
stop. For particular tasks, the results may be quite sensitive to
these decisions. We will discuss this as part of several papers
in the coming weeks.

We will also discuss a special case of self-training, co-training, in which we alternately
train two classifiers using disjoint sets of features.

Looking ahead: papers targeted for student presentations for
next week