THE TERM-DOCUMENT MATRIX

Doing the Numbers

As we mentioned in our discussion of LSI, the term-document matrix
is a large grid representing every document and content word in a
collection. We have looked in detail at how a document is converted
from its original form into a flat list of content words. We prepare a
master word list by generating a similar set of words for every
document in our collection, and discarding any content words that
either appear in every document (such words won't let us discriminate
between documents) or in only one document (such words tell us nothing
about relationships across documents). With this master word list in
hand, we are ready to build our TDM.

We generate our TDM by arranging our list of all content words along
the vertical axis, and a similar list of all documents along the
horizontal axis. These need not be in any particular order, as long as
we keep track of which column and row corresponds to which keyword and
document. For clarity we will show the keywords as an alphabetized list.

We fill in the TDM by going through every document and marking the
grid square for all the content words that appear in it. Because any
one document will contain only a tiny subset of our content word
vocabulary, our matrix is very sparse (that is, it consists almost
entirely of zeroes).

Here is a fragment of the actual term-document marix from our wire stories database:

We can easily see if a given word appears in a given document by
looking at the intersection of the appropriate row and column. In this
sample matrix, we have used ones to represent document/keyword pairs.
With such a binary scheme, all we can tell about any given
document/keyword combination is whether the keyword appears in the
document.

This approach will give acceptable results, but we can significantly
improve our results by applying a kind of linguistic favoritism called term weighting to the value we use for each non-zero term/document pair.

Not all Words are Created Equal

Term weighting is a formalization of two common-sense insights:

Content words that appear several times in a document are probably more meaningful than content words that appear just once.

Infrequently used words are likely to be more interesting than common words.

The first of these insights applies to individual documents, and we refer to it as local weighting.
Words that appear multiple times in a document are given a greater
local weight than words that appear once. We use a formula called logarithmic local weighting to generate our actual value.

The second insight applies to the set of all documents in our collection, and is called global term weighting.
There are many global weighting schemes; all of them reflect the fact
that words that appear in a small handful of documents are likely to be
more significant than words that are distributed widely across our
document collection. Our own indexing system uses a scheme called inverse document frequency to calculate global weights.

By way of illustration, here are some sample words from our
collection, with the number of documents they appear in, and their
corresponding global weights.

You can see that a word like wrestler, which appears in only seven documents, is considered twice as significant as a word like project, which appears in over a hundred.

There is a third and final step to weighting, called normalization.
This is a scaling step designed to keep large documents with many
keywords from overwhelming smaller documents in our result set. It is
similar to handicapping in golf - smaller documents are given more
importance, and larger documents are penalized, so that every document
has equal significance.

These three values multiplied together - local weight, global
weight, and normalization factor - determine the actual numerical value
that appears in each non-zero position of our term/document matrix.

Although this step may appear language-specific, note that we are
only looking at word frequencies within our collection. Unlike the stop
list or stemmer, we don't need any outside source of linguistic
information to calculate the various weights. While weighting isn't
critical to understanding or implementing LSI, it does lead to much
better results, as it takes into account the relative importance of
potential search terms.

The Moment of Truth

With the weighting step done, we have done everything we need to
construct a finished term-document matrix. The final step will be to
run the SVD algorithm itself. Notice that this critical step will be
purely mathematical - although we know that the matrix and its contents
are a shorthand for certain linguistic features of our collection, the
algorithm doesn't know anything about what the numbers mean. This is
why we say LSI is language-agnostic - as long as you can perform the
steps needed to generate a term-document matrix from your data
collection, it can be in any language or format whatsoever.

You may be wondering what the large matrix of numbers we have
created has to do with the term vectors and many-dimensional spaces we
discussed in our earlier explanation of how LSI works. In fact, our
matrix is a convenient way to represent vectors in a high-dimensional
space. While we have been thinking of it as a lookup grid that shows us
which terms appear in which documents, we can also think of it in
spatial terms. In this interpretation, every column is a long list of
coordinates that gives us the exact position of one document in a
many-dimensional term space. When we applied term weighting to our
matrix in the previous step, we nudged those coordinates around to make
the document's position more accurate.

As the name suggests, singular value decomposition breaks our matrix
down into a set of smaller components. The algorithm alters one of
these components ( this is where the number of dimensions gets reduced
), and then recombines them into a matrix of the same shape as our
original, so we can again use it as a lookup grid. The matrix we get
back is an approximation of the term-document matrix we provided as
input, and looks much different from the original:

The matrix contains far fewer zero values. Each document has a similarity value for most content words.

Some of the similarity values are negative. In our
original TDM, this would correspond to a document with fewer than zero
occurences of a word, an impossibility. In the processed matrix, a
negative value is indicative of a very large semantic distance between
a term and a document.

This finished matrix is what we use to actually search our
collection. Given one or more terms in a search query, we look up the
values for each search term/document combination, calculate a
cumulative score for every document, and rank the documents by that
score, which is a measure of their similarity to the search query. In
practice, we will probably assign an empirically-determined threshold value
to serve as a cutoff between relevant and irrelevant documents, so that
the query does not return every document in our collection.

The Big Picture

Now that we have looked at the details of latent semantic indexing,
it is instructive to step back and examine some real-life applications
of LSI. Many of these go far beyond plain search, and can assume some
surprising and novel guises. Nevertheless, the underlying techniques
will be the same as the ones we have outlined here.