Does Google Use Latent Semantic Indexing (LSI)?

First and foremost, I can’t take credit for the material here.
This is information I’ve gleaned from the incredible team at StomperNet and specifically Leslie Rohde in this case, who in my opinion is the BadAss of bad asses when it comes to SEO and absolutely one of my internet marketing heroes.

If you don’t know who they are and you own a website, or are trying to learn how to make money online, then it’s time to crawl out from under your rock and get some real education. Let’s get started!

Google Patents & LSI

First, let’s look at some basic facts.
There has been much debate over Google’s use of LSI.
Does Google implement Latent Semantic Indexing in it’s algorithm? True? False? Sometimes? Maybe?

Google doesn’t have a single patent filed for LSI. They don’t even have one that lists LSI as being part of any prior technology.

Practically all of the patents listed in the argument have nothing to do with LSI, rather they discuss word or phrase co-occurrence, not LSI. All of these were filed in 2006 by a single researcher and haven’t been heard of since.

Verify for this for yourself here.
Google patent search: www.google.com/patents
* No company that Google owns nor patent filed is based on LSI.

Actually something I find even more intrigueing is a search on Google for the term LSI doesn’t produce a single search in the top 10 SERPs referencing the technology. If it’s so important, why isn’t it there?

Latent Semantic Indexing – the Technology

There is a bunch of complex mathematics involving matrix computing and so forth to explain exactly how LSI works. So for the sake of time I’ll try to keep this as simple as possible.

By using a very large group of documents, every common meaning and usage of every single word is indexed by reference to every other word.

That is the true power of LSI.
Working at the level of “meaning” in this way actually handles a bunch of language problems that would be difficult to do any other way.

As humans we take this ability for granted. But for computers it’s really hard work.
The simplest example would be dealing with singular versus plural.

Cat / Cats

Octopus / Octopi

Dress / Dresses

Supply / Supplies

The possessive of words is even harder, but all of it is nothing compared to changes in tense. Talk about chaos!

Grow, growing, grown, grew

Start, starting, started

Die, dying, dead

For example:
Grower <– is related to –> Grow
Grower is someone who grows stuff.

however…

Burg – as in a town, is not related to – Burger
Although burger may appear to be a person who burgs.

But by far the most impressive component of LSI is it’s ability to index the relation between synonyms. Word pairs that have very similar meaning yet very different spellings.

Pupil / Student

Buy / Purchase

These are just a couple of examples out of hundreds that mean nearly the same thing. There are thousands more that mean almost the same thing to varying degrees…

ie. Windows Vista / death

LSI is the only formal theoretical approach that handles all of these cases in a meaningful way.

So how does it all work?
Here’s where all the matrix math comes into play, which again for the sake of time I will skip and give a somewhat watered down version. The math is extremely complex and lengthy, but the concept is quite simple…

First we start with a group of documents that we want to index and later find using a search engine. In Google’s case this would be well over 50,000,000,000 pages.

Now we create a list of every single word that is found anywhere on any one of those pages.

Both of those lists are combined in a matrix, or table, where each column is one of our pages and each row is one of the words we found and each cell is an intersection of those words and pages. Everything is sorted, scored, etc. using what’s called “Term Frequency – Inverse Document Frequency” where more common words like a, and, the, etc., are devalued as to not detract from LSI’s ability to identify a pages true meaning.

But this matrix is simply far too large to do anything with so we use an algorithm called “Singular Value Decomposition,” whereby reducing the matrix to about 300 rows. But there’s a problem there…

Remember the rows represent the words and their pages. If you only keep 300 rows, what happens to all the other words? This is where LSI claims to do magic.

The THEORY states that all of the relationships identified in the larger matrix are still represented, latently, in the smaller matrix even though only 300 words remain. And claims that the smaller matrix can PREDICT all of the word combinations that were removed. However this is greatly dependent on the source used to create the full matrix or “corpus;” a limitation I’ll get into in a minute.

LSI Only Works IF you have a good data-set to work from. So now you have this matrix or data base, with all these columns, easily as many as we have pages… How do we create a search engine from that?

Here’s how…
We treat a query or “search” as a micro document which we will cross reference against all of our matrix “index” documents and score them to how “similar” they are to our query document. All that’s left is to present the “search results” in order of high score to low or High Similarity to Low Similarity.

Sounds kind of familiar right? I mean we all hear about Google and the weight it places on relevancy right? But is it the same thing? And is Google using LSI? Maybe. There’s really only one solid way to find out…

LSI CASE STUDY #1

Let’s do a search.
If LSI is in play then we should be able to see Google doing what LSI would do by searching various instances and comparing search results.

The simplest should be singular vs. plural.
Remember LSI claims to handle the endings of words automatically and it is the meaning not the spelling of words that is being compared in the algorithm. So search results should be identical in both search results, index size and document score when comparing singular and plural searches of the same words, because their MEANINGs are identical.

Make sense? Now test it yourself…
To setup the experiment, first log out of your Google account. Yes, your search results are affected by whether or not you are logged into your account or not. But that’s not what we’re discussing here. Just log out and take my word for it for now.

Now open 2 tabs in your browser. Or 2 windows if you’re surfing old school. In either case go to www.google.com in each tab.

Now search for car in one and cars in the other.
The search results differ by almost 50% and there is only 1 result in the top 5 that are the same.

Do the same with grape and grapes.
In my case, only 1 of the top 5 results matched and the search results were off by 29%.

Try one more using blanket and blankets.
I see a 34% difference in search results and not one similarity in the top 5 results.

So if this is LSI at work, why would we see such varying search results? But lets keep going.

LSI CASE STUDY #2

Now let’s look at the tense of a word.
This is an area where LSI will return very similar results. Not identical mind you, but similar. And for this part of the experiment you’ll need to open a 3rd tab to account for the variables of word tense: past, present and future.

Now let’s search for grow, growing and grown.
My results show pages indexed raging from 106 to 244 million and not a single common search result in the top 5 across all three searches.

Based on what we’ve seen so far this last test is surely doomed to failure but for the sake of argument let’s test one last variable by searching for a set of synonyms closely related in meaning.

LSI CASE STUDY #3

Car and Automobile

Instead of the similar results Latent Semantic Indexing would provide, Google results vary by as much as 88% and only 1 similarity in the top 5, whereas in previous tests done at StomperNet labs, there were none.

Conclusion

So the question is simple… Do You See LSI at Work Here At All?
LSI provides quite predictable results, which simply aren’t there at Google.

So some may argue at this point, well maybe you can’t see it. Maybe they’re using it and it’s just invisible. Who cares! As it relates to SEO and getting your pages ranked in Google why would you bother chasing a white rabbit that no one sees anyway? What you should be focusing on are factors that will move your ranking with concrete evidence and LSI clearly isn’t is.

In fact researches directly involved with LSI state that there is no known instance of successful use of LSI in document loads greater than 1,000,000 documents (LSI Corpus size). The process simply scales poorly. Google in contrast does case studies on 1+ million pages with over 50 Billion pages indexed. LSI just isn’t in the same class as Google’s technology.

So why all the buzz around LSI? And if LSI is so great why doesn’t Google use it? Simple. They do something better. And in this case LSI, compared to what Google ACTUALLY does is really nothing more than a….

L.ame S.earch I.dea

Well, this is quite a long post already.
So stay tuned for more where I’ll share what that “special thing” Google does do… actually Is.

~Richard

PS:
This post was inspired by a question posed to me in the forum over at BetterNetworker.com, where we’ll be discussing the topic (hopefully) if people find it interesting enough. So jump over and check it out. And while you’re there join in the conversation.

Hey Wait! Before you go…
If you found this article useful, hated it or whatever… drop me a comment below.
Let me know what your thoughts are on the topic.
You can Tweet it too. Just click the little green button to the right >>>