The YouMoz Blog

A Philosophical Look at LDA

This entry was written by one of our members and submitted to our YouMoz section.The author's views below are entirely his or her own and may not reflect the views of Moz.

I’m going to make a comment here that may shock a lot of people. I hate to say it, but it may shock a lot of people who are very good at using statistics.

Statistical models are not answers – they are models.

Models are built in replica of a system. They may look similar, but they are not perfect and the resemblance can always be improved.

Now that I have got that shock out of the way, here is my motivation. Rand recently posted on a new correlation value for Latent Dirchlet Allocation (LDA) to Google ranking. The first value was 0.32, then recognition of a mistake in the calculations lead to a figure of 0.17. Now, it seems, the number should be around 0.06 - 0.1. Meanwhile, other datasets have given others different results. As SEOs we are after a clear number - we want the answer "LDA correlates to Google's rankings with a score of 0.XX". Unfortunately, we are unlikely to ever get that answer

A Binary View on Probability

Rand and Danny Sullivan’s debate in the comments on Rand’s original post about LDA was caused by the fact that that people, especially those for whom mathematics is not their adopted language, have pretty binary views on the subject: they either trust statisticians implicitly or are highly cynical of the results they produce. Both responses are due to a lack of understanding and a fundamental position on the relevance of things that we do not understand. When we are uncertain about something, we like to probe it to find out whether it is good or bad. Think of all the questions you asked last time you bought an oven, or a car. You probed each one to find out whether it was suitable for the task it claimed to perform and efficient in performing that task. The problem is, with physical objects like these we can see examples, road tests and samples; with something abstract like statistical methodology we have to rely on those who do understand it – and they were the people who we didn’t trust in the first place.

It's easy enough to find out whether a car does what it's supposed to, but statistics can be a little more difficult. Image credit, Cartoon Stock

String and Sealing Wax

What a lot of people forget is that these models are not perfect and they weren’t built to be perfect. In the same way that the models you might have built of a Spitfire or Mustang when you were young could not fly, they are just representations of the system and the best you could make at the time. You might then have improved them – add an engine and some RF wizardry and you can make them fly. But you still can’t fit in them – they’re just models. To be perfect models of a plane, they would have to be a plane.

Another good example is gravity. Isaac Newton published his laws of gravity which said that all objects with mass attract each other in a certain way. This model worked then and it works now – we send satellites into orbit and predict the motions of galaxies with it. But it had problems: it could not explain the observed changes in Mercury’s orbit and it did not say what gravity actually was, only what it did. Albert Einstein refined the model, with his famous General Theory of Relativity. This told us what gravity is and its relationship with mass – gravity tells space-time to bend, mass says by how much – and it solved the problem of Mercury’s orbit. It even predicted that gravity causes light to bend. It was a much better model, one that we can use to look out far beyond the solar system. But it’s still not perfect – the satellite Pioneer 10 is nowhere near where it should be and General Relativity alone gets the age of the universe completely wrong. It’s a flawed model. My opinion, I wrote my masters dissertation on this, is that this is because Einstein’s model of space-time is wrong. He said it’s flat, or slightly curved, while the model I think improves on it says that it is actually far from flat. This model improves on relativity in many ways and even provides links with another great model, quantum mechanics. It even fulfils Einstein’s dream of describing everything in the universe through geometry. But it’s still not perfect or, I hasten to ad, properly verified.

The point here is that the real world is damned complex. Physicists and statisticians, like people who make model aeroplanes, represent it to the best of their abilities and those abilities are constantly evolving. Like the LDA model, where Rand and Ben started with the assumption that all keywords have the same ranking factors factors affecting their SERPs in the same way. This model is simplified, it will not be a complete model of reality, just like Newton's model of gravity, but we can still use it in our everyday lives. If we want to be more precise, we must take into account how competitive a term is - refining our model. This is evolution in action.

From Ape to Homo Sapiens

In his comments to Rand's original post on LDA, Danny Sullivan mentions people in the late ‘90s using “theme pyramids” and their association with LDA. He’s completely right in how he’s introduced them, but not right to dismiss LDA because of the association of these to techniques. Theme pyramids are the equivalent of Aristotle or Newton – a sensible starting place – while LDA is crinkled space-time – a very good idea but something we’re pretty sceptical of because it’s unproven. We can even chart how the model of content relevance has evolved over the years:

Theme pyramids

Information Frequency vs Inverse Document Frequency (IF vs IDF)

Latent Semantic Indexing (LSI)

Probabilistic LSI (pLSI)

Latent Direchlet Allocation (LDA)

Each model improves on the last, but is not perfect. Even Rand, replying to Danny’s commented that started their debate on the original LDA post, says "We think it's interesting, given the relatively high correlation (compared to link metrics) to try it out, but we haven't suggested conclusive results." In his latest post, Rand says, "more polished results may still be several months away" and Ben in his update was scathing of his own work, saying "I think 0.17 really might not be the last word.... Just treat all of this as suspect until we know more."

Aristotle's periodic tabel. Just because it was wrong doesn't mean all periodic tables are wrong. As a starting place it served us well. Image credit, University of Virginia

A Conclusion - Perhaps

The point here is that in their recent debate, both Rand and Danny are right and both are wrong. The SEOmoz team seem, to my mind at least, to have made a pretty convincing case for LDA. The methodology used in the original SEOmoz study was the same as in the Bing vs Google study and seems sound as a starting point, but:

Only in the case of the ranks between 1 and 10

Only if Wikipedia is taken as the de facto corpus of all English words and their proper, contextual usage

Only if we look at keywords, not key phrases

Only if all phrases are equally competitive.

Beyond these restrictions, who knows?

The problem is that more data is needed and it needs to be analysed in a less naive way. Ideally, we should look at sites ranking 1 to 100. Even Rand said in the Google vs Bing post that “Ben [Hendrickson] & I both feel that... we should gather the first 3-5 pages of results, not just the 1st page.” The problem with this is that while there will be 10 times the number of results to review, the work will be 100 times greater. It would also make more sense to train the algo on a thesaurus and Wikipedia combined, and possibly even the whole Oxford University Press range of specialist dictionaries. But again all this would take a huge amount of resources, as would expanding the model to included phrases.

We should also define exactly what we mean, numerically, by a "competitive" keyword and see how this affects our end results. We need to figure out stress-tests for the model, how local and universal search are incorporated, and many other factors. Until we do this, we will only have a simplistic model - but it could require the SEO equivalent of Einstein to work out

So, the great debate: is LDA a pile of codswallop or a pile of gold? Neither – it’s something in between, as is any statistical model. How much gold and how much fish is in the pile we can only learn from collecting more data and putting in more resource. It is also not the be-all and end-all, it is simply the best that we have. There will be a better model devised, with fewer assumptions and restrictions, and until then LDA is probably the most accurate. But then this is true of any new scientific or mathematical model.

My own view, which I commented on Rand’s first post with, is that the next stage of evolution needs to be introducing the Zipf-Mandelbrot law, which models language usage much more accurately than simple Bayesian statistics and allows phrases as well as words to be analysed. We also need to take a website, or category of a website, and apply LDA to a page as a document within a corpus. Unfortunately, both of these will require a leap beyond my mathematical abilities and a lot of processing power.

About BenjaminMorel —
Benjamin Morel is an agency-based digital marketeer and project manager working for Obergine in Oxford, UK. He is passionate about inbound marketing, especially building strategies centered around data and communication. Follow him on Twitter @BenjaminMorel or on Google +.

21 Comments

Wow Ben. My head usually always tends to feel a bit fuzzy when people start talking statistics, but I just managed to read your entire post in one sitting without the slightest bit of fog in my brain. Well done man! Very well communicated.

Even if the LDA tool that SEOmoz has isn't perfect however, if it can help even a little bit, then it's one more boost for a site and for my money, any boost for a site is a good thing.

After the lack of enthusiasm for the maths in my last post I decided to rein back a bit -plus after some of the comments made on Rand and Ben's posts I thought a bit of a (defensive?) explanation of what we try to acheive when modelling statistical systems was in order.

I completely agree with your point about it still being useful - LDA it tells us never to neglect on-page optimisation but that we have to do it in a way that makes as much sense as possible for people reading the site. A constant reminder not to be spammy because that will lose us money in the long run.

I really enjoyed this post. I think you did a great job of taking a debate in our small niche and putting it in the context of bigger, and frankly, more impactful ideas. I had no idea you had the background you that do, very refreshing and impressive.

While I don't have any idea what will happen with LDA, i am happy to see Ben Hendrickson continue to make iterations on it and for our community (that include you, Ben!) to continue to play in the space with us.

For what it is worth, my favorite take away from this wasn't directly stated but it was certainly implied. As an SEO I like to think of the Internet as complex, but really it is childs play compared to the Universe :-)

Cheers!

Danny Dover

Update: I may or may not have been logged into the wrong account when I posted this ;-p Don't tell anyone my secret...

Glad to hear you liked this post. I think anyone who deals with statistics in the real world, no matter what their niche, has had to deal with the misconception dealt with here. People often view science in the same way as religion – that everything we say is dogma. The whole thing about models, and this in my opinion is what makes them fun, is that they are not an answer just a step toward it.

I too look forward to the next iteration – but I disagree about the internet being less complex than the universe. Even the simplest system can be infinitely complex. One of the best examples the double pendulum, some interesting information here http://en.wikipedia.org/wiki/Lorenz_attractor, which is governed by three seemingly simple equations yet has incredibly complex behaviour which we cannot predict in advance. The internet, then, is at least as complex as the universe even though it is contained within it. How that is possible is another story for another time, preferably after a few beers to lubricate the mind ;-)

Interesting - I have no idea whether Matt Cutts plays dice, though, and he doesn't seem like the kind of guy who would play with loaded dice.

Although your comment is in jest, if we think about how some people hang on every word that is written or said about him, that he is the most knowledgable person we can think of on the subject and that some would claim we can feel his influence in everything we do online, even in small ways, the parallel starts to look a lot stronger...

A very interesting post that really makes one think. I would only add that I'm not convinced that binary views on whether to trust a model are based on a misunderstanding of the purpose of models. I think there is certainly room to thoroughly understand the modelling process with it's pros and cons, while still having to make a very binary decision on whether the model provides convincing information that should compel the reader to act on the model's outputs. I think that the need to decide whether to change future action based on a model only gives the appearance that the overwhelming majority of people are entirely skeptical or drinking the modelling Kool-Aid. It's probably just human nature to defend your decision to act or not act on a model by being a skeptic or enthusiatic supporter, even if you know there are no certainties.

In other words, whether a model should impact how I do SEO is binary, but either way I better understand that there is a measure of probability attached to wether my decision will actually be beneficial to my goals.

Good read, and an important point regarding the actual utility of models. Though I think that this is false: “is LDA a pile of codswallop or a pile of gold? Neither – it’s something in between, as is any statistical model.” On the contrary, there are plenty of probabilistic models, like LDA, that are quite good at or quite bad at predicting phenomena. Some applications of LDA yield high correlational coefficients. Others, such as those performed by SEOMoz so far, do not. I think we are, therefore, justified in asserting that, with respect to the SEOMoz analyses, LDA is a poor model per se.

"On the contrary, there are plenty of probabilistic models, like LDA, that are quite good at or quite bad at predicting phenomena."

This is the nub of my argument: models will always be "quite good" or "quite bad" at predicting and explaining observations - after all these are two prerequisites for a hypothesis to become known as a theory. However, they will not be perfect.

I don't think LDA is a poor model - I think that in its current form it is the foundation of a much more powerful method for modelling content relevance. The fact that it only has a relatively minor correlation with search results is not due to a poor hypothesis or method, but with a lack of computing power, resource and the fact that even Google, and after all the point is to emulate Google and others, isn't perfect.

I agree that statistical models are not the panacea some believe them to be. However, the nub of my concern has to do precisely with:

"The fact that it only has a relatively minor correlation with search results is not due to a poor hypothesis or method, but with a lack of computing power, resource and the fact that even Google, and after all the point is to emulate Google and others, isn't perfect."

I haven’t seen evidence in favor of this assertion, and it seems as though your claim here begs the question. I believe that we are at present only justified in believing there is either some small or no correlation. Hence, that LDA is a poor explanatory candidate.

On the other hand, if there is some additional reason to believe that a low correlation coefficient for the kind of sample pulled in this case is typical of a high correlation coefficient for a more representative sample, then there might be good reason to believe that LDA will be useful in this application.

I'm not sure I agree with you. The hypothesis that search engines model using topic vectors rather than individual keywords makes complete sense, it explains the "halo effect" we see on keywords, is a sensible method for preventing keyword stuffing and predicts that in local search webstes optimised for both local and niche vectors will rank well. I also cannot think of anotherform that a viable model could take for the search engines as topic modelling covers all the attributes of on-page optimisation comprehensively.

LDA as we currently have it from SEOMoz is a poor explanatory candidate, that I agree on, but I think the limitations in this model noted in the original post are purely technical: not enough of a sample space, not a large enough or cross-checked original corpus, using a page as the test corpus rather than as a test document within a test corpus, not modelling phrases but only individual words. These are all down to not enough time and computing power being spent. At the moment LDA is the best algorithm we have for modelling linguistic topics, but these technical limitations have meant that the model has had to be applied somewhat naively. As we have only had the first stage of investigation so far, this is to be expected and hopefuily we will see more results in the future as the project matures.

As I also say in the original post for those who think that LDA may not make sense - where are your alternative models?

The additional reasons you’ve marshaled in defense of why the search engines use topic vectors are good. Furthermore, flushing out these reasons builds a much more compelling case for why LDA is part of the algorithm than does delineating the current meager correlation between page rank and LDA score. However, unless I am misreading your post, you seem to believe that, based on the results from SEOMoz’s application, the SEOMoz team has made a “pretty convincing case” for LDA. Yet, no good – certainly no convincing – case has been made for LDA using SEOMoz’s results.
With regard to alternative models, I don’t believe you will find any of the type you are looking for. That’s because, given what we know about the complexity of search algorithms, performing these types of bivariate analyses may never produce more than a poor correlation between the two variables measured. And if we regard the results of such statistical analyses as indications of what’s going on in the real world, we risk weighting certain algo candidates in ways that are altogether wrong.

I am highly sceptical of the effects that humans are having on global warming and think that it has been sensationalised hugely by the media, approaching any data on climate change with a degree of cynicism. Most people would disagree and say that the models being used are highly accurate, embracing new releases that support this idea with open arms. On the other hand I feel that topic modelling is the most sensible way to analyse on-page factors but that we need to progress with our models, while you feel that although topic modelling makes a certain amount of sense we will never be able to extract anything useful from it. My point in the first section of this post is that whenever we look at a statistical model we have an opinion based on our preconceptions and so we have both taken away what we want from the original series of posts.

However, I have tried to remain unbiased in writing the rest of the post. That is what my conclusion is "does it work, I don't know because the model is oversimplified". At the same time though I have tried to invite debate by posing questions to more sceptial people who have, as yet, produced little evidence to counter LDA as a way to measure the weghting and strength of topic modelling. As the title suggests this is not a piece defending or attacking LDA, it just uses LDA as a good example of how statisticians are humans too and as such don't always get things right first time. That does not mean their ideas are inherently wrong, they may just not have a sophisticated enough model to treat them thoroughly.

"On the contrary, there are plenty of probabilistic models, like LDA,
that are quite good at or quite bad at predicting phenomena."

This is the nub of my argument: models will always be "quite good" or "quite bad" at predicting and explaining observations - after all these are two prerequisites for a hypothesis to become known as a theory. However, they will not be perfect

This was a great look at the topic of LDA from someone who can actually understand the math and basis behind it. Thanks for that.

When I first started learning SEO, one of the first things I began to wonder was "why can't someone run some statistical analyses on things and figure out what Google is doing?" I've progressed a little bit past that stage now, although maybe SEOmoz hasn't ;) Obviously we are never going to understand every calculation that goes into serving up a SERP, but the closer we get the better the SEOs we will be able to aspire to be.

Figuring out what Google is doing is always going to be very difficult - the main thing holding us back is that we need an absolutely vast data set and the time and resources to analyse it. SEOMoz are able to answer these questions because they employ a statistician, as well as people who know a lot about maths and computer science. Most of us don't have that luxury and never will. It's a shame because modelling the Google/Bing algo's is something I would dearly love to do.

You're definitely right in that models are not answers. But I think we all understand that all of these things are attempts at figuring out how the algo works and getting a little bit closer to the answers. Even the fact that LDA is not that highly correlated with rankings is a good discovery because it reaffirms the fact that we're dealing with something much more complex than many people realize.