Lessons Learned Building an Index of the&nbspWWW

The author's views are entirely his or her own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.

Last week I gave the keynote presentation at SMX Munich, Lessons Learned Building an Index of the WWW. In that presentation, I shared a great deal of data from our web index as well as some SEO tips based on our experience replicating many search engine activities (crawling, indexing, building a link graph, de-duplication, canonicalization, etc.). In this blog post, I'd like to first announce that Linkscape's new index, with crawl data from late March to early April (& upon which these data points are calculated), is now live - check it out here - and second, to share the charts, graphs and tips from my presentation.

The Linkscape Index

First off, some basic points about Linkscape's index:

The crawl is intended to imitate what major search engines crawl and keep in their index. Talking to lots of folks from the engines who do this work, we've heard that while tens or hundreds of billions of pages are crawled, there are only "~5-10 billion pages worth keeping in a main index."

Linkscape is a crawler-built index, meaning it uses a seed set and crawls outward via links to discover new URLs.

The index currently biases towards pages with external links, meaning we don't crawl as deeply as the major engines do, but we try to crawl. very broadly (to reach as many well-connected pages and unique domains as possible).

As we crawl, we see some well-known structural pieces making up the web:

Linkscape, as well as numerous academic sources (and, almost certainly, the major search engines), collect and store data about three types of structural components - pages, subdomains and root domains. Link & content metrics, along with crawl parameters and query-independent ranking factors, are stored about each of these.

Linkscape also sees a view of the web that most IR students will be familiar with:

As others have noted in the past, the web's link structure tends to look a bit like a bowtie, with a large number of tightly linked, well connected pages in the center and outliers on the borders with few incoming/outbound links. Linkscape does a relatively good job with the center and the linked-to edge (with few/no outbounds), but struggles more on pages with no incoming links (as these are difficult to discover and often not worthwhile keeping in an index).

Index Statistics

We've found these data points fascinating and I'm excited to be able to share many of them for the first time. While Linkscape is not as comprehensive as Yahoo!/Google, it's far closer to a representation than a sample size. Our latest index update currently contains:

44,410,893,857 (44 Billion) pages

230,211,915 (230 Million) subdomains

54,712,427 (54 Million) root domains

474,779,069,489 (474 Billion) links

For this index, the following data pieces apply:

* Note that for the link distribution chart, this refers to "external, juice-passing links" which excludes links from the same subdomain to itself as well as links on pages with the meta nofollow or those that employ rel=nofollow.

* Note that for the root domains linking chart, this refers only to pages/sites receiving links from unique root domains. For example, with www.seomoz.org, we'd only receive one "linking root domain" from searchengineland.com, even though that site links to ours on many unique pages. Likewise, with links we receive from About.com and their numerous subdomains - in total, it's only one counted "unique root domain."

* Not surprisingly, most links on the web are incestuous to some degree, and thus come from internal links (those on the same subdomain as the target), same IP address (where multiple sites from the same owner are hosted), same root domain and the same c-block of IP addresses. If we can see these relationships with Linkscape, it follows that the search engines have an easy time of it as well - and these links are almost certainly not passing the same kind of value that external links from unique root domains, IP-addresses and C-blocks would.

Some interesting data points on the above:

2.7% of all links on the web are nofollowed

73% of those are internal (so nofollow is actually far more popular as a link sculpting tool than a spam prevention device)

3 billion out of our 475 billion links (~0.6%) were found in noscript tags - while the engines recommend against this and talk about it as a spam tactic, we suspect that many of these are, in fact, legitimate uses and probably do get counted (due to their value in content discovery).

165,638,731 links (0.034%) aren't visible on the page (they're hidden off screen using CSS or other tactics). Again, given the numbers, we wonder whether all of these are spam and whether they're all discounted by the engines.

This is our first index supporting the canonical URL tag, and so far we've seen just north of 16 million pages employing the parameter. While this is still a drop in the bucket on a global web scale, we'll be watching closely for how much support it generates over the months to come.

Search Engine & Linkscape Metrics

Like the search engines, we calculate a number of metrics on the pages, subdomains and root domains in our index to help uncover spam and sort by popularity & trustworthiness. The following are distributions of the metrics we currently employ:

* mozRank is our calculation of raw link popularity. Like Google's PageRank, Yahoo!'s WebRank and Live's StaticRank, it's a recursive algorithm that counts links as votes and treats links from more popular pages as more important. We've found that while it's useful for discovering which pages to crawl and index, it's a poor measure of true importance and has significant noise.

* Domain mozRank is calculated in the same fashion as page-level mozRank, but on the domain-level link graph. Thus, it only takes into account unique links that exist from one root domain to another and is agnostic as to whether a site has 1, 100 or 1,000 links to another. We've found this metric exceptionally valuable for identifying the popularity and importance of a root domain - on the subdomain link graph, it's more susceptible to manipulation and spam.

* mozTrust, which we also calculate on both the domain and page level link graphs, has proven highly effective as a spam identifier (particularly in combination with mozRank - the difference between the two is an excellent predictor of manipulative linking). mozTrust relies on the same intuition as Yahoo!'s TrustRank, running a recursive algorithm that passes juice down from trusted seed URLs/domains.

Measuring Correlation

Possibly the most interesting data I shared from an SEO application standpoint was around our research into the correlation of individual metrics to search engine rankings. Our own Ben Hendrickson has been doing significant data gathering and analysis, trying to answer the question,

How well does any single metric predict higher rankings?

His early results are enlightening:

In this chart, Ben's showing that no metric is particularly good at predicting rankings by itself, but if you had to use something, the number of root domains linking to a URL and that URL's mozRank are both just above the 95% confidence interval. Note that such classic SEO metrics as Yahoo! link counts and Alexa.com counts (which are included in many toolbars and appear in many SEO reports) are very nearly worthless.

The results are much better (though still not excellent) when we instead ask what metrics correlate with ranking 10 positions higher (essentially, what's the difference between page 1 vs. 2, 2 vs. 3, etc). Here, Ben shows that while only a single metric is above the 95% confidence interval (domains linking to a URL), there are several that are 20%+ better than random guessing.

Perhaps the most surprising result of this (for me, at least) was the data showing that Google's link counts actually do have a correlation with rankings, suggesting that they're not completely random (even though they might feel that way given their small sample size).

Out of all the metrics, it's little surprise that # of linking root domains is a favorite (we use it, for example, to sort our Top 500 list). It's one of the most difficult metrics to manipulate effectively and has high correlation with trust, importance and search engine rankings.

Top Tips for SEOs

Based on the work we do crawling and building an index, and the struggles we've encountered (and seen the engines similarly encounter), we've crafted a few short tips. While some of these are obvious and well known, they still pay to keep in mind as high-level recommendations we feel confident the search engines would support:

Don't rely on the search engine to canonicalize anything for you.

Focus on link acquisition from a diverse number of root domains, not necessarily high PageRank pages, or those with high link counts.

Make smart, usable, short URLs. They're far easier to process and have a much better correlation with useful, unique content an engine would want to keep in its index.

If you want to earn lots of links, building a distributed content widget/badge/link that users embed in their sites/pages is an incredibly effective strategy. Just look at how many of the top pages on the web achieved that position employing this strategy.

Don't rely on PageRank or raw link counts as accurate assessments of ranking potential. According to our data, they're not high signal or high rankings correlation metrics.

The social web is rising, as are those employing it effectively (again, check out the top sites list for evidence).

Don't be afraid to use nofollow internally as it's clearly not an outlier on the web. However, do be cautious with its use - you can seriously screw things up if you make mistakes on that front.

Keep content on a single subdomain and root domain wherever possible. The metrics of that domain will go a long way to make that content visible and ranking-worthy.

Avoid doing "strange" things from a technical and link acquisition perspective. The former makes you harder to crawl, process and index while the latter makes you stand out as possible spam/manipulation.

We hope you enjoy this data - please feel free to share - and enjoy using the new Linkscape index. Again, I'd like to give my congratulations and thanks to both Ben & Nick, who've done a tremendous job with Linkscape. If you have questions, please leave them in the comments and they should be able to provide answers and direction.

p.s. For those keeping track, this index update was almost exactly a month from our last one, and our goal is to maintain approximately 3-4 week intervals between updates for the foreseeable future. We're also doing a lot to improve the quality and focus of our index to capture more good stuff and deep stuff on mid-size and large domains (and less spam). We'd appreciate it if those of you who are producing lots of spam would help us out by ceasing to earn links from trustworthy, respectable sites and pages - thanks! :-)

I second that request. I'm about to start working on some internal link sculpting on a number of my sites and would love to see a Whiteboard Friday that covers best practices and the pitfalls to avoid.

WOW. That's some yummy looking data you all have over there! I would dive into this right now and soak it all in like a sponge, but after working on vector calculus for two hours... I think I'll have to read this all over again tomorrow.

Wow, thanks Rand; great data - and thanks for presenting it in a way that we might stand a chance of being able to explain some of it to less technically-minded people.

The 'number of linking domains' metric is interesting; it supports the advice in SEOMoz's Pro Linkbuilding guide that getting links from a broad number of domains is valuable because links from 10 domains probably involved 10 editorial descisions, but 10 links from 1 domainis unlikely to have.

Also great to have data on how nofollow is not being used for its original purpose.

I maybe a little strange but Rand would you analyize the long tail of each data graph?

Maybe find some commonalities of each?

I ask because websites with high PR (7,8,9 and visibility for that matter) tend to not be apart of the trends of the web. For example WSJ.com will rarly link to what they are talking about, but typical blogs would.

I too would like to find out more information about the long tail as somebody else previously referenced. Common characteristics of high PR sites is probably fairly obvious, however you may just discover some gems.

Ive started advising clients that we have to drop any reference to Alexa and a recent project showed an Alexa ranking double min but once i reviewed the web analytics data, this showed that this site got 20 times the traffic as my site.

Typically as a Alexa number gets higher, further away from #1 it should now mean such a massive difference in traffic.

So after being put in place that maybe my site should be getting less traffic or Alexa sucks, i think your score of 1% for relevance is perfect.

It's tough to evaluate links in Linkscape using mozRank when you realize that some obviously-penalized pages are passing a great deal of mozRank. Any chance we might see a "possibly penalized" indicator in Linkscape reports?

Following that is there any likelihood you'll be altering your mozRank calculation to adjust for SPAM signals?

Big THUMBS UP from me, this project is truly one of the most interesting experiments Ive seen in a long time.

One thing that would be a great educational tool would be the ability to search your index using your ranking algorithyms to determine the actual google ranking projects that you are putting forward.

This would probably be the single most important tool to all SEO's out there in the professional space, as it would help us to determine our theories (and yours) properly using this "sample" data that you are accruing.

It would also be a proactive way of identifying pages that have had a google penalty, as your algo wouldnt cater for this in the same way, so we would be able to identify pages that as you put it in the whiteboard session on Friday are "over optimized".

Anyway, a big thanks guys, really interesting in seeing where this goes over the coming months!

Excellent data Rand. Thanks for sharing. One questions based on the note below:

"* mozTrust, which we also calculate on both the domain and page level link graphs, has proven highly effective as a spam identifier (particularly in combination with mozRank - the difference between the two is an excellent predictor of manipulative linking)."

How much of a difference between mozTrust and mozRank would you look for to raise the link manipulation alarm?

We usually think a difference of more than 1 point is something that bears investigation. I'm not saying it's spam, but you might want to check out the links at that point as this probably means there's a lot of links that really aren't contributing much value.

As Nick said, a full point difference often bears investigation, and a 2-3 point difference usually means something's fishy. This works with PageRank vs. mozRank, too. If the two numbers are off by more than 2-2.5, we see a high correlation with sites/pages that have been penalized (when PageRank is lower).

However, the question is how did you count them - is that just by standard link operator (which is known to strongly understate the number of links) or these stats based on the number of backlinks from Google webmaster accounts?

The important point about the link operator, which does not reflect the correct absolute link count, is that it does reflect a good sample, at least in so far as rankings go. So if you've got two sites where one has a substantially lower count from "link:" than the other, it's likely the second will have a better chance ranking.

Wow, can't believe that there are only 23 comments on THIS post so far? I don't think that I have ever gotten to a post with less than 40 within the first couple of hours! :-)

Love this kind of stuff Rand, thanks for sharing it. I also love that it backs every theory I have been preaching for the past year! ;-) Its reassuring that there is hard data to back theories up. Although its hard to get completely conclusive data in our industry its nice to take the data for what it is.

I love the hard data associated with the post. One of my big take-aways was that Google seems more concerned with the number of linking domains, rather than the quality of the links. So my general strategy for link building shifts from pursuing a few high quality links to pursuing a vast variety of mediocre links from a large number of domains.

My question is whether Google counts domains not stored in the index as an unique domain?

Personally I would image if Google isn't aware of the domain or has bannned it from within the index, links from these would not count as a unique domain.

I would not uniformly go after many mediocre links. I think you need balance here. If you've already got a sizable inventory of highly authoritative links, it's important to diversify with some broader reach. But if you don't have any highly authoritative links, you should pull in some before going after more poor links.

Wow, that's some pretty eye-opening information there. I almost have trouble believing some of it, like trusting google to reveal decent backlinks - but it's hard to argue when there are stats to back it up.

Question though - how did you guys determine which sites are hiding links off-screen? Was there a certain margin-left or absolute position number that you looked for to come up with that number? Just curious...

I don't think the data here shows that google reveals good backlinks. Instead what we're saying is that the count that Google reveals is a consistent sampling of the overall count. So if you take two pages and look at their link counts from Google, it turns out that (relatively speaking) that's a good thing to compare when thinking about search engine performance.

But the actual links that they show you are probably not a good set of the most powerful links for the site.