What is Text Classification?

Text classification typically involves assigning a document to a
category by automated or human means. LingPipe provides a
classification facility that takes examples of text
classifications--typically generated by a human--and learns how to
classify further documents using what it learned with language
models. There are many other ways to construct classifiers, but
language models are particularly good at some versions of this task.

20 Newsgroups Demo

A publicly available data set to work with is the 20 newsgroups data
available from the

4 Newsgroups Sample

We have included a sample of 4 newsgroups with the LingPipe
distribution in order to allow you to run the tutorial out
of the box. You may also download and run over the entire
20 newsgroup dataset. LingPipe's performance over the whole
data set is state of the art.

Quick Start

Once you have downloaded and installed LingPipe, change
directories to the one containing this read-me:

> cd demos/tutorial/classify

You may then run the demo from the command line (placing all of the
code on one line):

On Windows:

java
-cp "../../../lingpipe-4.1.0.jar;
classifyNews.jar"
ClassifyNews

On Linux, Mac OS X, and other Unix-like operating systems:

java
-cp "../../../lingpipe-4.1.0.jar:
classifyNews.jar"
ClassifyNews

or through Ant:

ant classifyNews

The demo will then train on the data in
demos/fourNewsGroups/4news-train/ and evaluate on
demos/4newsgroups/4news-test. The results of scoring are
printed to the command line and explained in the rest of this
tutorial.

The Code

The entire source for the example is ClassifyNews.java. We will be using
the API from Classifier
and its subclasses to train the classifier, and Classifcation
to evaluate it. The code should be pretty self explanatory in terms of
how training and evaluation are done. Below I go over the API calls.

Training

We are going to train up a set of character based language models (one
per newsgroup as named in the static array CATEGORIES) that processes
data in 6 character sequences as specified by the NGRAM_SIZE
constant.

The smaller your data generally the smaller the n-gram
sample, but you can play around with different values--reasonable
ranges are from 1 to 16 with 6 being a good general starting
place.

The actual classifier involves one language model per classifier. In
this case, we are going to use process language models (LanguageModel.Process). There is a factory method in DynamicLMClassifier
to construct actual models.

There are two other kinds of language model classifiers that may
be constructed, for bounded character language models and
tokenized language models.

Training a classifier simply involves providing examples of text
of the various categories. This is called through the handle
method after first constructing a classification from the category
and a classified object from the classification and text:

That's all you need to train up a language model classifier. Now we can
see what it can do with some evaluation data.

Classifying News Articles

The DynamicLMClassifier
is pretty slow when doing classification so it is generally worth
going through a compile step to produce the more efficient compiled
version, which will classify character sequences into joint
classification results. A simple way to do that is in the code as:

Now the rubber hits the road and we can can see how well the machine learning
is doing. The example code both reports classifications to the console and evaluates the performance. The crucial lines of code are:

The text is an article that was not trained on and the JointClassification
is the result of evaluating the text against all the language
models. Contained in it is a bestCategory() method that
returns the highest scoring language model name for the text. Just to
be sure that some statistics are involved the toString()
method dumps out all the results and they are presented as:

Scoring Accuracy

The remaining API of note is how the
system is scored against a gold standard. In this case our testing
data. Since we know what newsgroup the article came from we can
evaluate how well the software is doing with the JointClassifierEvaluator class.

This class wraps the compiledClassifier in an evaluation framework
that provide very rich reporting of how well the system is doing. Later in the
code it is populated with data points with the method addCase(),
after first constructing a classified object as for training:

This will get a JointClassification for the text and then keep track
of the results for reporting later. After all the data is run, then
many methods exist to see how well the software did. In the demo code
we just print out the total accuracy via the ConfusionMatrix
class, but it is well worth looking at the relevant Javadoc for what
reporting is available.

Cross-Validation

Running Cross-Validation

There's an ant target crossValidateNews which cross-validates
the news classifier over 10 folds. Here's what a run looks like:

This reports that there are 250 training examples. With 10 folds,
that'll be 225 traniing and 25 test cases each. The accuracy for each
fold is reported along with the 95% normal approximation to the
binomial confidence interval per run (with no smoothing on the
binomial estimate, hence the 0.00 variance for fold 4). The moral of
this story is that small training sizes lead to large variance.

More Cheating Possibilities

Reading cross-validation results can be challenging, because
they have the characteristic of results on development sets.
Researchers often report results of cross-validation for the
best sets of parameters they found, which typically overestimates
accuracy on truly held out data.

Cross-validation is a means of using a single corpus to train and
evaluate without deciding ahead of time how to carve the data into
test and training portions. This is often used for evaluation, but
more properly should be used only for development.

Cross-Validation for Development

Another common approach is to use cross-validation during
development. Then, with a truly held-out test set, there is
no bias in the reports.

How Cross-Validation Works

Cross-validation divides a corpus into a number of evenly sized
portions called folds. Then for each fold, the data not in the fold
is used to train a classifier which is then evaluated on the current
fold. The results are then pooled across the folds, which greatly
reduces the variance in the evaluation, reflected in narrower confidence
intervals.

Implementing a Cross-Validating Corpus

Corpus Implementations for Training

An instance of corpus is required for
training batch-oriented classifiers which make multiple passes over the data,
such as logistic regression and perceptron classifiers.

LingPipe supplies a convenient corpus.Corpus
class which is meant to be used for generic training and testing
applications like cross-validation. The corpus class is typed based
on the handler type H intended to handle its data.
The basis of the corpus class is
a pair of methods visitTest(H) and visitTrain(H) which
send handlers every training instance or every testing instance respectively.

LingPipe implements cross-validation for evaluation with the class
corpus.XValidatingObjectCorpus.
This corpus implementation just stores the data in parallel lists and
uses it to implement the visit-test and visit-train methods of the
corpus.

Permuting Inputs

User-Specified Randomizer

Always allow users to specify their own instance
of java.util.Random in any method that
does randomization. There are two reasons. First,
it's the only way to guarantee repeatability during
testing. Second, it's the only way to allow users
to implement a different or better randomizer than the one built
into Java's Random implementation.

It is critical in evaluating classifiers to pay attention to
correlations in the corpus. In the case of the 20 newsgroups data,
which is organized by category, a naive 10% cross-validation would
remove most or all of a category's training data, which would lie
in a continuous run.

To solve this problem, the cross-validating corpus implementation
includes a method to permute the corpus using a specified random
implementation.

We implemented the randomizer with a fixed seed so that
experiments would be repeatable. Change the seed to get a
different set of runs. You should see the variance even more
clearly after more runs.

Cross-Validation Implementation

The command-line implementation for cross-validating is in
src/CrossValidateNews.java.
The code mostly repeats the simple classifier code. First, we create
a cross-validating corpus, then store all of the data from both the
training and test directories.

For each fold, the fold is first set on the corpus. Then a trainable
classifier is created and the corpus is used to train it through the
visitTrain() method. Then the classifier is compiled
and used to construct an evaluator. The evaluator is then run over
the test cases by the corpus method visitTest(). Finally,
the resulting accuracy and 95% confidence interval are printed.

Leave-One-Out Evaluations

Efficient Leave One Out

Leave-one-out evaluations can be very expensive in general
implementations that literally retrain on all but one example.
It's much more efficient if there is a way to untrain an
example. Then, the whole corpus can be trained, and each
example can be visited, untrained, evaluated, and added back
in.

The limit of cross-validation is when each fold consists of a
single example. This is called "leave one out" (LOO). This is easily
achieved in the general corpus implementation by setting the number of
folds equal to the number of data points. The only potential problem
is rounding errors in arithmetic, so leave-one-out evals are typically
done with specialized implementations. Also, in doing leave-one-out,
there is no point in compiling the classifier before running it.