Factual has analyzed data from 4 million web sites and provided a holiday gift for stats junkies. Did you know 5% of pages have either a Twitter or Facebook link? Or that 28% of sites run Google Analytics? Or 12% of them run Google AdSense? Now you do!

The core data comes from CommonCrawl, a non-profit group designed to crawl the web and provide data for anyone to use. Gil Elbaz is both a founder of CommonCrawl and of Factual, a start-up that creates tables of structured information from data found on the open web (see Factual: Parting The Curtains Of The Invisible Web).

Factual found stats such as I cited above after examining 4 million web sites. In particular:

28% of sites have Google Analytics on them

12% of sites have AdSense

5% of sites have EITHER a Twitter or Facebook link but…

2% of sites have BOTH a Twitter or Facebook link

There’s also a chart that shows other interesting stats but without precise percentages. I’ll estimate as best I can:

One thing unclear is how the stats break down on a page versus web site basis. A web site might have multiple pages. So when a “web site” is said to have AdSense on it, does that mean each page within the site has AdSense code? Or only some of them? It appears a decision was made on a site-by-side basis, with “site” being defined as all the pages within a set domain or subdomain.

Those interested can play with the data themselves. It’s summarized in this very large table at Factual.

CommonCrawl also gets a bit of publicity from this at an interesting time. Earlier this week, Google released a long internal memo talking about how important it was to the company to be open — except in the areas of search and ads:

In many cases, most notably our search and ads products, opening up the code would not contribute to these goals and would actually hurt users. The search and advertising markets are already highly competitive with very low switching costs, so users and advertisers already have plenty of choice and are not locked in.

I’ll likely do my own follow-up post to that memo in the near future. In the meantime, a post I wrote back in 2007 — Google: As Open As It Wants To Be (i.e., When It’s Convenient) — looks at how Google’s claims of being open tend to ring false when open isn’t something it seems to pursue in areas where it is ahead. In part from my post:

That large index gives Google a huge advantage over rivals. It knows more about what’s on the web than anyone else. So why not share? Why not start an Open Index Alliance where there’s a coordinated effort to crawl and index all the documents in the world, allowing anyone to tap into the raw data?

That’s the idea behind CommonCrawl. Maybe as part of being open, Google could get behind the project?

Attend Our Conferences

Attend Marketing Land's SocialPro conference and learn fresh new strategies and tactics from some of the savviest brands and digital marketing agencies managing earned, owned and paid social media marketing campaigns across multiple platforms. Visit the SocialPro site to learn more..