LATENT SEMANTIC INDEXING

Taking a Holistic View

Regular keyword searches approach a document collection with a kind
of accountant mentality: a document contains a given word or it
doesn't, with no middle ground. We create a result set by looking
through each document in turn for certain keywords and phrases, tossing
aside any documents that don't contain them, and ordering the rest
based on some ranking system. Each document stands alone in judgement
before the search algorithm - there is no interdependence of any kind
between documents, which are evaluated solely on their contents.

Latent semantic indexing adds an important step to the document
indexing process. In addition to recording which keywords a document
contains, the method examines the document collection as a whole, to
see which other documents contain some of those same words. LSI
considers documents that have many words in common to be semantically
close, and ones with few words in common to be semantically distant.
This simple method correlates surprisingly well with how a human being,
looking at content, might classify a document collection. Although the
LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent.

When you search an LSI-indexed database, the search engine looks at
similarity values it has calculated for every content word, and returns
the documents that it thinks best fit the query. Because two documents
may be semantically very close even if they do not share a particular
keyword, LSI does not require an exact match to return useful results.
Where a plain keyword search will fail if there is no exact match, LSI
will often return relevant documents that don't contain the keyword at
all.

To use an earlier example, let's say we use LSI to index our collection of mathematical articles. If the words n-dimensional, manifold and topology
appear together in enough articles, the search algorithm will notice
that the three terms are semantically close. A search for n-dimensional manifolds
will therefore return a set of articles containing that phrase (the
same result we would get with a regular search), but also articles that
contain just the word topology. The
search engine understands nothing about mathematics, but examining a
sufficient number of documents teaches it that the three terms are
related. It then uses that information to provide an expanded set of
results with better recall than a plain keyword search.

Ignorance is Bliss

We mentioned the difficulty of teaching a computer to organize data
into concepts and demonstrate understanding. One great advantage of LSI
is that it is a strictly mathematical approach, with no insight into
the meaning of the documents or words it analyzes. This makes it a
powerful, generic technique able to index any cohesive document
collection in any language. It can be used in conjunction with a
regular keyword search, or in place of one, with good results.

Before we discuss the theoretical underpinnings of LSI, it's worth
citing a few actual searches from some sample document collections. In
each search, a red title or astrisk indicates that the document doesn't
contain the search string, while a blue title or astrisk informs the
viewer that the search string is present.

In an AP news wire database, a search for Saddam Hussein
returns articles on the Gulf War, UN sanctions, the oil embargo, and
documents on Iraq that do not contain the Iraqi president's name at all.

Looking for articles about Tiger Woods
in the same database brings up many stories about the golfer, followed
by articles about major golf tournaments that don't mention his name.
Constraining the search to days when no articles were written about Tiger Woods still brings up stories about golf tournaments and well-known players.

In an image database that uses LSI indexing, a search on Normandy invasion
shows images of the Bayeux tapestry - the famous tapestry depicting the
Norman invasion of England in 1066, the town of Bayeux, followed by
photographs of the English invasion of Normandy in 1944.

In all these cases LSI is 'smart' enough to see that Saddam Hussein is somehow closely related to Iraq and the Gulf War, that Tiger Woods plays golf, and that Bayeux has close semantic ties to invasions and England.
As we will see in our exposition, all of these apparently intelligent
connections are artifacts of word use patterns that already exist in
our document collection.

Gain a Competitive Advantage Today

Want more great SEO insights? Read our SEO blog to keep up with the latest search engine news, and subscribe to our SEO training program to get cutting edge tips we do not share with the general public. Our training program also offers exclusive SEO videos.