Chunking

Text chunking consists of dividing a text in syntactically correlated
parts of words.
For example, the sentence
He reckons the current account deficit will narrow to only # 1.8
billion in September .
can be divided as follows:

Text chunking is an intermediate step towards full parsing.
It was the shared task for
CoNLL-2000.
Training and test data for this task is available.
This data consists of the same partitions of the Wall Street Journal
corpus (WSJ) as the widely used data for noun phrase chunking:
sections 15-18 as training data (211727 tokens) and section 20 as
test data (47377 tokens).
The annotation of the data has been derived from the WSJ corpus
by a program written by Sabine Buchholz from Tilburg University,
The Netherlands.

The goal of this task is to come forward with machine learning
methods which after a training phase can recognize the chunk
segmentation of the test data as well as possible.
The training data can be used for training the text chunker.
The chunkers will be evaluated with the F rate, which is a combination
of the precision and recall rates:
F = 2*precision*recall / (recall+precision) [Rij79].
The precision and recall numbers will be computed over all types of
chunks.

Background Information

In 1991, Steven Abney proposed to approach parsing by starting with
finding correlated chunks of words [Abn91].
Lance Ramshaw and Mitch Marcus have approached chunking by using a
machine learning method [RM95].
Their work has inspired many others to study the application of
learning methods to
noun phrase chunking.
Other chunk types have not received the same attention as NP chunks.
The most complete work is [BVD99] which presents results for NP, VP,
PP, ADJP and ADVP chunks.
[Vee99] works with NP, VP and PP chunks.
[RM95] have recognized arbitrary chunks but classified every non-NP
chunk as VP chunk.
[Rat98] has recognized arbitrary chunks as part of a parsing task but
did not report on the chunking performance.

Software and Data

The train and test data consist of three columns separated by spaces.
Each word has been put on a separate line and there is an empty line
after each sentence.
The first column contains the current word, the second its
part-of-speech tag as derived by the Brill tagger and the third its
chunk tag as derived from the WSJ corpus.
The chunk tags contain the name of the chunk type, for example I-NP
for noun phrase words and I-VP for verb phrase words.
Most chunk types have two types of chunk tags, B-CHUNK for the first
word of the chunk and I-CHUNK for each other word in the chunk.
Here is an example of the file format:

The O chunk tag is used for tokens which are not part of any chunk.
Instead of using the part-of-speech tags of the WSJ corpus, the data
set used tags generated by the Brill tagger.
The performance with the corpus tags will be better but it will be
unrealistic since for novel text no perfect part-of-speech tags will
be available.

Results

Eleven systems have been applied to the CoNLL-2000 shared task.
The systems used a wide variety of techniques.
Here is an overview of the performance of these 11 systems on the
test set together with other results (*) on this data set published
after the workshop:

The baseline result was obtained by selecting the chunk tag which
was most frequently associated with the current part-of-speech tag.
At the workshop, all 11 systems outperformed the baseline.
Most of them (six of the eleven) obtained an F-score between 91.5
and 92.5.
Two systems performed a lot better:
Support Vector Machines used by Kudoh and Matsumoto [KM00] and
Weighted Probability Distribution Voting used by Van Halteren [Hal00].
The papers associated with the participating systems can be found in
the reference section below.

[CM03]
Xavier Carreras and Lluís Màrquez,
Phrase Recognition by Filtering and Ranking with Perceptrons.
In "Proceedings of the International Conference on Recent Advances
in Natural Language Processing, RANLP-2003", Borovets, Bulgaria, 2003.http://www.lsi.upc.es/~nlp/papers/2003/ranlp2003-cm.ps.gz