What are Significant Phrases?

LingPipe provides a simple way find statistically significant phrases
in a document collection. There are two types of significance
of interest.

Collocations

Collocations are phrases which are seen together more than you would
expect given an estimate of how frequent each token is and how often
they are seen together. For example in 600 posts from
the rec.sport.hockey newsgroup, the collocations
of length two found by LingPipe are:

These are two token phrases which are ranked by how often they are
seen together as opposed to how often each is seen alone. For example,
'Los Angeles' has a higher score than 'Tie Breaker' because we see
'Los' 67 times, 'Angeles' 67 times and 'Los Angeles' 67 times. So
'Los' and 'Angeles' always occurs with the larger phrase--a high
correlation. On the other hand 'Tie' occurs 15 times, 'Breaker' 8
times and 'Tie Breaker' 8 times, so Tie only occurs with the larger
phrase half the time, less of a correlation.

Relatively New Terms

This technique evaluates the significance of phrases in one collection
versus another, finding phrases that occur significantly more often in
the foreground corpus than they would be expected to from the
background corpus. The interesting phrases are those that we see more
often in the foreground model than expected in the background
model. Returning to our Google example, this is a way to tell if there
is more news than usual about a phrase--if 'George Bush' counts go way
up, there is likely some big stories about him.

In our hockey data set, we see that 400 (chronologically later)
articles have surprisingly more information about the following
capitalized phrases:

Google's "In the News"

Running an example

If you have not already, download and install LingPipe. That done,
change directory to
demos/tutorial/interestingPhrases. You will also need to
untar/zip the data in demos/rec.sport.hockey.tar.gz and
type the following on a single line (replacing the colon":" with a semicolon ";" if using Windows):

The resulting output will be collocations for the background model and
new terms form the foreground model given the background model. See
below for more details on how the software was configured.

The Code

First we assemble the background model by visiting a directory of text
files and training up a tokenized language model (a model over tokens
rather than over characters as in our classifier demo). The rubber hits the road in just a few lines of the InterestingPhrases.java class:

We need to specify what kind of tokenization we are using in the
training and for English a reasonable assumption is the supplied IndoEuropeanTokenizerFactory
which provides tokenizers. Recall that this will take a text like

"So many morons...

and produce an array of tokens:

{ "\"", "So", "many",
"morons", "..." }

If you need a different tokenization, for instance you don't want
punctuation to be a token, then this class is the place to start
digging around.

Next we have the method buildModel which will take a directory of files and return a tokenized language model. That method is:

We create a new TokenizedLM
which requires us to set how many tokens of data it is sampling. Once
that object is created we have a bit of training to do with the
train()
method which takes text from each file. Pretty simple.

Popping back to where buildModel() is called, the next
interesting line is
backgroundModel.sequenceCounter().prune(3); which makes
the collocation calculation run much faster because open ended parts
of the internal data structures are cleaned up.

Next we get the goodies with the method call collocations(int,int,int)
which pulls phrases of specified token length nGram,
minimum number of instances minCount, and how many
phrases to return in the ranking maxReturned. Some
experimenting is required to see what sort of minCount to use since if
it is set too low then you get a bunch of noisy single/low count
instances and if it is set too high you may lose interesting
examples. The method returns an array of ScoredObject instances which is a convient class to use when there is a
double value associated with an object.

Now that we have collocations, 101 of them to be exact of length 2
given the constant definitions, let's do something with them--like print
them out in order. The method report() does just that as
shown below:

The array is iterated over in the obvious fashion, and the
score and accum variables built up in
report_filter. We are only looking at words which start with a capital
letter and it all ends in a glorious call to println. In
an actual application you might want to find the phrases in the
documents and provide a hyperlink from a web page, compare to
yesterday's collocations or populate a database for a trend-spotting
data-mining application.

Finding New Terms in Comparison with Another Data Set

Sometimes a more useful definition of interesting phrase is "Am I
seeing phrases in one data set more often than I would expect given
another data set?" For example, "What is in today's news
that is more common than expected given the past week of news?" With
very little additional work we can get just such a capability with
LingPipe.

All we need do is train up a foreground model and compare it to the
background model we already have for the collocations demo
above. Continuing in the source for InterestingPhrases.java we
have:

We simply build a model with some different data, in this case it is
more and chronologically later data from rec.sport.hockey, prune it,
and apply the method newTerms
with appropriate parameters. Then we use the same report
method as with the collocations.

References

For a survey of research on collocations and novel phrase finding, see: