Post navigation

Lies, Damn Lies, & SEO: Statistical Analysis of SERPs

It sounds so seductive… by using advanced statistical methods, you can determine the best mix of on page factors for SEO. Wow, imagine the incredible competitive edge that you’d have. You could use just the right number of bold tags, figure out whether to use bold or strong, and you’d be an unstoppable ranking machine.

The only problem with this approach is that it’s complete bunk. Let’s try a couple examples…

A Statistical Lie I Kind Of Liked: MSN "prefers" sites that run on Microsoft’s own IIS

A while back, someone published a statistical study that appeared to show that MSN’s search results were far more likely to contain pages from sites that run IIS, vs. Google and Yahoo. Did people take this as a sign that they should move their web sites onto IIS? No, of course not… because Google has a lot bigger market share, people actually thought maybe they should switch away from IIS in order to do better on Google!

So… now you wonder: is MSN rewarding you for using IIS while Google doesn’t care, or is Google rewarding you for using Apache while MSN doesn’t care? If Google doesn’t care and MSN does, then you rush to IIS. If Google cares and MSN doesn’t… enough! Spare yourself the circular reasoning before you go mad, and let’s consider some possible root causes.

At the time this study was published, I pointed out that there are many differences between IIS and Apache, aside from the names.

Whereas Apache is the majority choice of the entire web, as you move to larger sites and in particular the corporate world, IIS has a much stronger position. So if Google crawls more of the web’s smaller sites than MSN, they’re going to have a higher percentage of Apache-delivered pages in their index. Which means, statistically speaking, that you’re likely to see a higher percentage of pages on Google SERPs being served up by Apache.

ASP.Net, whatever else it does, can come with a lot of extra baggage. Such as the "viewstate" form fields that tend to get inserted, with 10-50k of utter gibberish text. So if MSN taught their bot to ignore this junk, and Google didn’t… well, this alone might account for this statistical variation. Since it’s relatively easy to build a site on IIS without adding all that dead weight, it’s hard to blame the search engines either way.

The bottom line: search engines don’t care what kind of server you run. They might care how it behaves, but not about the name.

The Original Statistical Sin: Keyword Density

If you’ve never used "search engine optimization" software to tell you how to optimize your web pages, good for you. If you run keyword density analyzers to do anything other than extract search terms from web pages… stop. You don’t need to. Keyword density isn’t a factor – search engines just don’t work that way.

Keyword density is loosely defined as "the percentage of the words on the page that are your keywords." I can remember endless debates back in the late ’90s about the "right" way to measure it – did you count all the words, did you only count exact phrases? There was only one problem with those debates – we were all wrong. Search engines do not measure the "keyword density" of a web page.

What they do is, in fact, immensely more complicated. Don’t follow that link unless you want to get hit with a firehose full of math, BTW… it talks about the vector space model, information retrieval theory, linearization, TF/IDF (term frequency / inverse document frequency), and other stuff that can give you tired head real fast. I’ll summarize what it all means in a minute.

So there’s no such thing as keyword density… at least to the search engines. However, the "fact" that keyword density isn’t even measured by search engines hasn’t stopped people from peddling their latest "statistical analysis" of the optimal keyword density.

There are two main approaches that are used to push keyword density:

Take the top 10 pages for a particular search query, measure their keyword density (yes, I know nobody can agree on how), and then take the average score as the "ideal" keyword density. This is the approach that most optimization software uses. Never mind that the #1 result may be 2%, the #2 result 41%, etc. – if the average is precisely 4.67291% then that’s what you shoot for… and while you’re at it, make sure your page matches the average number of words that were used by the top 10.

Dive deeper, categorize the pages into buckets based on their keyword density. Then you analyze a whole bunch of search results and determine that, statistically speaking, the pages that fall in a certain range are more likely to be ranked higher. Depending on the search terms you use, the numbers will vary a bit, but generally the "magic number" is discovered to lie somewhere in the 1-4% range. Which, as it turns out, is pretty much what happens when you just write naturally.

As you already know, there are other factors in play when it comes to rankings. In fact, "on page text" is probably nowhere close to the most important factor in SEO. What this statistical data should be telling you, is that you are wasting your time by worrying about keyword density. If you translate all the technical stuff in Dr. Garcia’s paper on keyword density into English and then summarize, it says "use relevant keywords, but write naturally."

So yes, my friends, there is a magic number for exactly how many times and in what places you want to place your keywords. Unfortunately, without access to the entire search engine index and their ranking algorithm, you don’t stand a snowball’s chance in Hades of discovering what that actually is… and it’s different for every search query.

Even if you could measure the same things the search engines were measuring, you’d still be unable to get there with statistical analysis, because there are a few things that will tend to skew the statistical averages higher or lower than what’s optimal from a pure "vector space search" perspective:

People who are doing SEO work to improve their rankings will probably tend to repeat their keywords just a little more than the average writer… which might actually be WORSE for their rankings than writing naturally, but they also tend to do other things, like building links to their sites and using anchor text to boost their rankings. This will drive the numbers up.

People in general are more likely to enjoy reading pages that are well written, with natural word use… and they tend to not enjoy reading keyword stuffed garbage. This leads to a general trend where "over optimized" pages will receive fewer links, and sites that are full of keyword stuffed jibba-jabba will receive fewer links. This tends to reward sites that don’t have an extremely high "keyword density" and drives the numbers down.

Blogs make a big difference in the math, because blog posts tend to gather more links over time (increasing rankings) and collect comments as they collect traffic (decreasing keyword density towards the average of natural language). This drives the numbers down, or toward the averages at least.

The bottom line on keyword density is that there is no such thing. Write naturally, write persuasively, write to communicate… because no matter what your keyword density is, you can always fire more links at the page to improve its ranking, but the only way to make your copy do its job is to write well, and forget about the damned search engines.

Have any of the statistical guys done controlled experiments to see if similar pages with different keyword densities are picked up with different results? I haven’t seen the original studies but it seems like if they haven’t done this sort of experiment that all they are doing in finding correlations. And, as any statistician will tell you. Correlations don’t indicate causality.

Lisa, the problem with keyword density is that the “black box” these guys are trying to reverse engineer doesn’t measure it.

Imagine you have a black box, and when you put rocks into it, it shines a light out in some color of the spectrum. Let’s say you want to learn what makes it glow green.

You’re doing statistical analysis based on the number of rocks you’re putting in, and you become convinced that the best number of rocks is 17 because that gives you the greenest green output.

The guy who built the box laughs at you, because it’s actually operating based on the weight of what you put in the box, and the number 17 is based on the average weight of the rocks you happen to be using.

That’s what’s going on here. Search terms are measured by weight, not volume, and the statisticians can not determine the weight.

Dan, last year I listened to a Stomper CD with Brad & Nathan Anderson. Nathan discussed keyword density with MSN & Yahoo and to try the 7% bump for MSN along with modeling the top sites.

Anyway, I decided to analyze those top sites and model them. I used a density of about 6%. I eventually made it to page 1 on Yahoo & MSN for my main keyword. Obviously, there were many other factors that played a role but it seemed to be useful at the time.

I was always kind of curious…if I wrote naturally and ended up at 2% or 3%, would I have made it to page 1 sooner, later, or never?

Thanks for your continued efforts and giving the best free information on the web!

Hear hear. I find that the people most hung up over keyword density don’t see the wood from the trees.

My “perfect” keyword density is to “actually have the search phrase on the page”. So many forget that simple secret! Then to also have the plurals and the reverse of the search phrase. If you don’t have the phrases actually on the page, then it takes a lot more links to get to the positions you need.

I find that clients can get so hung up on statistics that they miss the basics.

Have the search phrase in the title, meta description, H1, and in an opening paragraph, then “scatter the search phrase around the page”. Now that’s really scientific!

Totally right – has to read well.

Just a pity that the level of the playing field is that much higher with Google in the past months regards links.

You start trying to get clients website high and you realise how hard it is. You do the SEO basics onpage, onsite, and get links…. and it takes time for the Gold of your clients site to rise above the heap. You do nothing apart from steady links over months, and all of a sudden you rank well. Google seems to let you in over time even if you deserve to rank higher.

Dan,
I understand that keyword density is irrelevant, but what about keyword distribution?

As a library studies student (many moons ago) we learned that terms in titles and headers bear more weight than those in the first paragraph, which bear more weight than those in the middle of a document (since an author is likely to highlight key terms in these places).

We can’t know for sure what’s inside Google’s blackbox but it seems like fundamental concepts like these would still bear fruit.

I found this info very usefull, my coach is pushing me to write 250 words of “jibberish” just for s.e.o. ( my sight is new) and i understand what he is teaching, but i find it bad to write words on my front page that will make no sense to visitors wanting to buy. thanks for the help…c.h.

I think the problem with most SEOs is that, when we think of keyword density, we think of the frequency of the words on the particular page, while IR research books clearly show that the search engines need to take in account the frequency of the words in the whole document set as well.

The important question is: Why?

Search engines need to tell web pages apart (in order to properly organize them). One way to tell them apart is by finding words that are unique to them.

TF/IDF helps in that phrases that are popular on the page, but that are not popular (inverse relation) on the whole index are better at describing the web page, than words that are also popular on the whole index.

If you go all the way back to Salton’s work, I’d agree that the simple vector space model isn’t going to precisely describe what the search engines’ “black box” is doing. But it’s a much more helpful model for understanding than keyword density.

Your article is very insightful. In November of last year we got our single biggest SEO client ever – the one client hired our company to provide them 15 hours a week in an ongoing SEO / SEM campaign. After extensive research to bolster the paltry knowledge I had at the time regarding SEO, I proposed a massive site overhaul. One of the first things they thew at me was one of their staff members findings – she’d used one of the off-the-shelf SEO analysis solutions – you know the likes of which tell you your keyword density in granular categorized statistics…

My own research back in November had led me to a blog listing 35 key points in order of importance, as determined by the person who’d posted the list. While one of those factors listed was keyword density (ranked around 10th most important, I chose to take the much bigger approach, because my client’s site had over 150 pages. I was not about to expend 80% of my clients 15 hours a week on keyword density.

I am quite happy to report that within three months,
four of our top phrases got us in the #1 position at Google. All these months later, we’re now up to a dozen of our top phrases resulting on the 1st page results, four more are on page 2, and all the rest (25 phrases in total) are continually moving up in ranking, whereas when I started, the best results they had were on the 17th page of results.

So how much time have I put into deep density analysis? ZIP. Nada. None. I have taken pages that, when I’d started, were purely garbage and unintelligible to a real site visitor, and re-wrote the content on almost all the pages. I’ve added RSS news feeds to the site, and several new pages. I’ve focused on back-links. Optimizing specific phrases on specific pages of the site. In-page in-linking. Strong, Bold and Header text on pages. Tight integration of title, Keywords, Description, and on-page content.

On and on the list of things I’ve focused on goes. And by simply writing as much as I can in a natural language method. Oh sure – I’ve added a TON of text to a few of the more important pages – some of them are massive pages now. And honestly, I have a LOT of clean-up to do – because there’s still so much text that is very nearly worthless to a site visitor when it comes to readability or worthiness.

When I applied the same concepts to a 2nd client this past spring and summer, we got very similar 1st page ranking results in yet one more fierce SEO saturated industry.

So until Matt Cutts hands out copies of the Google algorithm, and provids us with regular updates every time they change or tweak it, there is no way I’m going to waste my clients money on keyword density being so vital as the statistics software pushing companies would have us all believe.

I’m still an Seo Newbie trying to learn everything about Seo. Before, I believe that keyword density works, but now I guess I need to agree with you.

I’m currently joining an Seo competition here in the Philippines “Paradise Philippines” and I tried to experiment this keyword density by doing some keyword stuffing technique, but it doesn’t really works, my entry is still on page three.

I observed that most of the websites on the first page of Google have a lot of backlinks, some of them have thousands of backlinks which my site doesn’t have.

It only shows that links is still the main reason why this websites ranks well and not for keyword density.

Considering you can rank as high as #1 for keywords that aren’t even on your page why is this topic even still around?

Sheesh, you can use link text to decide what the SE think your site is about without having the keywords on your page.

That little fact by itself kinda makes keyword density irrelevant because you can have a keyword density of zero and still get ranked for your keywords.

I make sure I get my keywords in the title, description, h1, first paragraph, last paragraph and the rest of it I don’t worry about. I may bold or italisize related keywords because I think LSI is as important as keywords.

In the end I still think words used in link text are more important than keyword density. Just ask the miserable failure.

Thanks for clearing up this confusion as I have been operating under a false impression for quite some time and achieving decent results!! Hopefully this will save a lot of effort and still achieve decent results.

Goodness! How is one supposed to decide what’s true? Ask the experts they say. But it seems they can’t agree. I think I will make my website for the people, but try to get the search engines to like me at least a little. Sigh.

I remember my early days in SEO where I was trying to understand this and many other factors. I understand them better now,.. main result of all this understanding: I don’t care much about most factors anymore. Just build a site the right way and you get pretty much all factors right without measuring anything. In the end what makes the difference is marketing your website. There are some things I do look at though:

Focus of a page – You can do every technical factor correct, but if you don’t focus, it’s a waste of time. (and it often amazes me how good people are at building pages that lack focus.)

Semantic related words – Who cares if a keyword comes up many times in a page if the related words are missing? Search engines don’t!

anchor texts – with a lot more focus on internal links than on external links.

“because no matter what your keyword density is, you can always fire more links at the page to improve its ranking”

When you say “fire more links” that’s actually more difficult nowadays, if we are to stay well clear of reciprocal link exchanges.

I’m really struggling getting people to link to a travel site which provides a service. I’ve added a blog and write unique content daily, added travel videos just to add content that people will link or bookmark.

Don’t overlook the power of internal links – it doesn’t take much to make fine-grained details of on page text irrelevant.

In terms of building external links, reciprocal links have long been a big time suck vs. the benefits, but there are plenty of other ways to get links. Have you watched any of the videos in my free link building course?

Uncoverthenet seems to have had some problems with Google, but I suspect they’ve fixed the problem (they were presenting SERPs as content, not intentionally spamming AFAIK) so they’ll likely return to the index soon.

There are some directories that really stink from the search engines’ perspective, because they’re really just paid link farms. A real directory would actually decline some submissions, but a lot of them don’t.

glad to see you’re using the nofollow on these commenter’s links, as i’m pretty sure that the hope of getting a backlink from you is the only reason some of these clueless ones continue to post their nonsense here…

I had a question about moving my site from .asp to .net and if this will have any effect on my current seo efforts.

I rank in the top 1-3 on MSN, YAHOO, and Google for just about every phrase( very competitive phrases ) that makes me money. I have my guys switching over my site from .asp to dot net. However I am worried that the prefix change at the end of my url’s will effect my results in the SERP’S.

Dave, I’m guessing that the change in URLs is from .asp to .aspx? The best-case would be not to change the URLs. If they must change, then you want a 301 redirect from the old version to the new version for every URL.

I’m not the world’s greatest ASP/IIS/.NET expert, so I can’t help much with HOW to get that done. I do know that ASP.NET supports URL rewriting (you don’t need any server plugins) so it shouldn’t be a huge undertaking to keep the same URLs.

Dan,
Great blog- as a self taught SEO blundering amateur this all helps -cant afford a pro as the last ‘expert’ i dealt with emptied our coffers and managed to destroy our page rank and positions [-all done to that point without any “expert” help.]

Thanks for the article and truth on keywords. There is a company out there helping people to make their sites more effective. I’m sure there are design tips that can help increase sales, like those I found in thermal mapping. Is there anymoe that can be done with keywords that would make that much difference? Of course this is after we follow all the procedures in your SEO book. They were former clients you had and have been offering the 30 day challenge under different site names.

Dan,
another thought came into mind on validating pages. Is it necessary to be sure our present pages will work in much older browsers? I also noticed that some links from other companies have html that is not compatible with older browsers, mostly netscape and Explorer 5 especially with size tags, for example height. Can we change those without causing a problem in the links?
I’m also trying to work with .php to download database info. I have little experience with it. IS there a site I can go to that would show me how to properly use this? My web server supports many programs including .php. The companies supply a script and site info but after that I need to stop for directions.

Great post. I’ve never been a fan of keywords density, even if it makes a small difference on MSN. Not worrying about repeating keywords endlessly gives freedom to SEO copywriters, which will then produce much better link worthy content.

That stats analysis turned up a whole lot more than a preference for IIS. Upon further investigation, it really didn’t point to a preference for IIS servers, it pointed to a preference for Microsoft-owned properties, which all happen to be hosted on IIS servers (duh).

So stats analysis can reveal some pretty shocking stuff. In this case, it’s a search engine seeding their SERPs with their own sites. Not exactly ethical (or legal, if you want to get technical), but certainly understandable.

Stop on by SEO Club again someday if you’d like to give the numbers a scan…

I was actually talking about someone else’s study, but thanks for reminding me about that conversation. I remember you had that explanation in mind when we talked last year. It was sort of the obvious explanation, but obviously slipped my mind completely when I wrote this.

It’s nice to hear what the outcome was when you took a deeper look, too. That’s the difference between looking at the statistics and drawing silly conclusions, and examining the data to get at the truth.

I don’t know about the legality/ethics of the hand edits (or whatever) that some search engines apparently do to boost their own properties in SERPs.

MSN might simply be looking at user data, concluding that their users prefer certain sites, and boosting those popular sites in SERPs. Since they get a lot of their search audience from their own properties there would be a bias toward those sites within their user data.

So it could all be perfectly innocent… and I got a bridge I’d like to sell ya too.

i totally agree with you, “SEOs work to improve their rankings will probably tend to repeat their keywords just a little more than the average writer”. even content witters make this mistake, natural writing is what visitors and search engines like.