W3C HTML Validation and Search Engine Optimization

It has been a while since I have posted some of Virante’s research to the blog, and a good friend and former COO Bob Misita called me out on it. I figured I would release some of the data from a recent study we did on the relationship of W3C HTML Validation and web page rankings. Because validation is quite complex, we chose to take a macro-look rather than our traditional methodology of getting individual sites into the SERPs via sitemaps and then tweaking individual independent variables.

In particular, we looked at the W3C validation of approximately 100 separate keywords in Google, Yahoo, MSN Live and Ask. For each keyword, we extracted the top 10 ranking sites, measured the number of errors via a W3C validation check, and used multiple statistical models to determine whether the individual rankings of the sites could be associated with validation error numbers.

The more rudimentary statistics are all we needed to fairly easily dismiss the assumption that validated content will perform better in the search engines – that is, in G,Y,M or A.

The erratic nature of average # of validation errors compared to the ranking position is fairly evident from the graph above. But, rather than assume that the data from the averages of all 100 keyword searches was accurate, we decided to look at the least squares regression for each and every keyword on each engine (400 different result sets).

Engine

Avg

Slope Avg

Google

155

1.61369625672E-19

Yahoo

146

0.00325581395349

MSN Live

111

0.00418604651163

Ask

102

0.000714285714286

As you can see, the slope of the Least Squares Regression Line is barely positive, the largest being Yahoo’s at 3/1000. If the confidence levels were high, you could assume that for every 333 validation errors removed from your page, you could see your rankings rise by 1 point. However, the confidence levels were not sufficient and, perhaps most glaring, fewer than 2% of the sites tested had greater than 333 validation errors (meaning the vast majority of sites could not benefit from such a change).

Engine

1

2

3

4

5

6

7

8

9

10

Google

103

74

118

190

86

127

60

180

145

146

Yahoo

97

95

78

134

121

91

126

145

133

118

MSN Live

54

102

78

59

122

79

76

100

128

88

Ask

98

99

81

94

63

112

105

82

43

89

Even though validating sites appear to do better in Live and Ask than in Google and Yahoo, we can quickly counter this by looking at the aforementioned regression slopes. It is possible that W3C validation may play a role in being indexed (although I think this is unlikely). Importantly, we saw similar variation in the sites the 4 search engines allowed to rank – meaning that there appears to be no threshold score required to rank in any of these search engines.

So, there you have it. One less thing to worry about. While I still think HTML Validation is a worthy cause in-and-of-itself, one would be hard-pressed to prove that it is directly, positively correlated, much less causal, in regards to one’s search rankings.

11 Comments

Good post. I agree, as much as I support the idea of web standards, when it comes to ranking in search engines I find it has little to no effect whatsoever. It is very useful for other things, but search ranking: I also very much doubt it. As a technical thing search engines are good at working around markup mistakes, and instead concentrate on “social” factors to determine ranking (e.g. number of inbound quality links etc).

I think when “web standards” were being pushed a few years ago, people kept saying that it would help with search engine ranking. It *might* help with indexing however, but I find that it only helps as much as just being sure you have a page that is not broken then a search engine is more able to parse it.

Yes, you have one more evidence to prove there is little if any corelationship between W3C validation and your search engine rankings. We have observed it countless times that you need do the basic on-page optimization, develop useful content and build lots of quality inbound links to be able to rank well in the major search engines.

Also interesting to think about is the possibility that sites that have many errors would tend to be sites that are more poorly developed overall with less internal linking, poorer onsite SEO factors, and more violations of SEO guidelines such as pop unders, hidden text, banned mx servers and mal ware, etc. Another interesting test would be page load time and its affect on rankings. Clearly Google thinks this is a quality issue for web pages and have begun working this factor into AdWords landing pages.

I have said it before and I will say it again, Validation is most important to meet the compliance of different browsers (human related) rather than SEO purposes. It never hurts to validate anything of course, but as far as your crawlability it has very little to do.

I disagree even though your evidence and mapping supports your case, you’re only taking a few web sites and plotting them on a graph for certain key terms. I recently had an issue with not getting indexed with MSN or LIVE.com. I contacted MSN directly and they told me to validate and resubmit and it will improve.

Glenn

Aug 20, 2008

“We are sorry, but StopScraper has determined that your IP address is associated with a scraper.”

Ehh, ok? Normal RSS-feed usage. Default setting in app to check every hour. My IP is probably available in the admin interface of this blog. I now have changed to check every 12 hour. Please remove the block, since I find your stuff interesting. Thank you!

I like that this post, even with measurably relevant data, still leaves the discussion open, but more focused.

However, I consider that if a page looks bad to a visitor, then it will look bad to a bot; and vice versa. Mack asserts this point quite effectively in regards to browser compliance. If you have great rankings, but surfing your site in Opera means only using shortcut keys to navigate – the odds are that you’ll loose a sale or exposure. That’s just how it goes.

So, validation doesn’t matter, eh? Because HTML validation was not ‘perfect’ for these sites, I’m curious to see how the script and CSS coding of these same sites perform against validation? Identical ratios would convince me that it’s not a *ranking* factor, but a *user experience* factor (and aren’t the two equally relevant to organic search?).