Google “Reveals Index Secrets”: Charts Indexing of Your Site Over Time

Total Indexed Count

Google says that this count is accurate (unlike the site: search operator) and is post-canonicalization. In other words, if your site includes a lot of duplicate URLs (due to things like tracking parameters) and the pages include the canonical attribute or Google has otherwise identified and clustered those duplicate URLs, this count only includes that canonical version and not the duplicates. You can also get this data by submitting XML Sitemaps but you’ll only see complete indexing numbers if your Sitemaps are comprehensive.

Google also charts this data over time for the past year.

Edited to add: Google has told me that the data may have a lag time of a couple of weeks, which makes it more useful for trends than for real-time action. Also, if you look at domain.com, you’ll see stats for all subdomains, and if you look at www.domain.com, you’ll see stats for only the www subdomain (of course this means that if you don’t use www for your site as with searchengineland.com, there’s no easy way to see this data with subdomain information excluded.)

Advanced Status: How This Data Is Useful and Actionable

The Advanced option provides additional details:

Great, right? More data is always good! Well, maybe. The key is what you take away from the data and how you can use it. To make sense of this data, the best approach is to exclude the Ever Crawled number and look at it separately (more on that in a moment). So, you’re left with:

total indexed

not selected

blocked by robots

The sum of these three numbers tells you the number of URLs Google is currently considering. In the example above, Google is looking at 252,252 URLs. 22,482 of those are blocked by robots.txt, which is fairly straightforward. This mostly matches the number of URLs reported as blocked under Blocked URLs (22,346). Unfortunately, it’s become difficult to look at the list of what those URLs are. The blocked URLs report is no longer available in the UI, although it is available through the API. That leaves 229,770 URLs. Which means 74% of the URLs weren’t selected for the index. Why not? Is this bad? The trouble with looking at these numbers without context is that it’s difficult to tell.

Let’s say we’re looking at a site with 50,000 indexable pages. Has Google crawled only 31,480 unique pages and indexed all of them? (In this case, all of the not selected would be non-canonical URL variations with tracking codes and the like.) Or has Google crawled all 50,000 (plus non-canonical variations) but has decided only 31,480 of the 50,000 were valuable enough to index? Or maybe only 10,000 of those URLs indexed are unique, and due to problems with canonicalization, a lot of duplicates are indexed as well.

This problem is difficult to solve without a lot of other data points to provide context. Google told me that:

“A URL can be not selected for indexing for many reasons including:

It redirects to another page

It has a rel=”canonical” to another page

Our algorithms have detected that its contents are substantially similar to another URL and picked the other URL to represent the content.”

If the not selected count is solely showing the number of non-canonical URLs, then we can generally extrapolate that for our example, Google has seen 31,480 unique pages from our 50,000-page site and has crawled a lot of non-canonical versions of those pages as well. If the not selected count also includes pages that Google has decided aren’t valuable enough to index (because they are blank, boilerplate only, or spammy), then things are less clear. (Edited to add: Google has further clarified that “not selected” includes any URLs flagged as non-canonical (and the third bullet above could include blank, boilerplate, or duplicate pages), with meta robots noindex tags, and that redirect and is not based on page quality.)

If 74% of Google’s crawl is of non-canonical URLs that aren’t indexed and redirects, is that a bad thing? Not necessarily. But it’s worth taking a look your URL structure. Non-canonical URLs are unavoidable: tracking parameters, sort orders, and the like. But can you make the crawl more efficient so that Google can get to all 50,000 of those unique URLs? Google’s Maile Ohye has some good tips for ecommerce sites on her blog. Make sure you’re making full use of Google’s parameter handling features to indicate which parameters shouldn’t be crawled at all. For very large sites, crawl efficiency can make a substantial difference in long tail traffic. More pages crawled = more pages indexed = more search traffic.

Ever Crawled

What about the ever crawled number? This data points should be looked at separately from the rest as it’s an aggregate number from all time. In our example, 1.5 million URLs have been crawled. But Google is currently considering only 252,252 URLs. What’s up with the other 1.2 million? This number includes things like 404s, but tor this same site, Google is reporting only 5,000 of those, so that doesn’t account for everything. Since this count is “ever” rather than “current”, things like 404s have surely piled up over time. Edited to add: Google has clarified that all numbers are for HTML files only, and not for filetypes like images, CSS files or JavaScript files.

In any case, I think this number is much more difficult to gain actionable insight from. If the ever crawled number is substantially smaller than the size of your site, then this number is very useful indeed as some problem definitely exists that you should dive into. But for the sites I’ve looked at so far, the ever crawled number is substantially higher than the site size.

Site size can be difficult to pin down, but for those of you who have good sense of that, are you finding that most of your pages are indexed?

Some opinions expressed in this article may be those of a guest author and not necessarily Search Engine Land. Staff authors are listed here.

About The Author

Vanessa Fox is a Contributing Editor at Search Engine Land. She built Google Webmaster Central and went on to found software and consulting company Nine By Blue and create Blueprint Search Analytics< which she later sold. Her book, Marketing in the Age of Google, (updated edition, May 2012) provides a foundation for incorporating search strategy into organizations of all levels. Follow her on Twitter at @vanessafox.

Sponsored

http://www.facebook.com/drew.pokoj Drew Pokoj

I wonder what this data is based off of, I have a BRAND new site, launched about 2weeks ago.. and got it in the index by the next day. Currently it has 2,000 some pages that show up, as I am already starting to get a few hits per day via the SERPs, this new chart show 0 indexed pages for my site… odd.

any idea if this includes all subdomains. You have to set up accounts for each separately in GWMT (though not Bing, which i like much better), but these index numbers look to be too high not to be inclusive of all subs.

Interesting and userful … let me check it for my startup LoginRadius – which offers social infrastructure to businesses! Btw, they should come up something like this based on social networks, what do you think?

http://top5ives.blogspot.com/ Majid Ali

Index Status is useful and interesting. I will check if it works for me.

Mike Miller

I am seeing the same thing. I did a site:www.sitename.com and saw 164,000 pages indexed, but according to GWT, I’m seeing 1.37mil. I’m assuming subdomains are factored into this number, which almost makes this report useless

http://profiles.google.com/singh8954 singh 09

Good news now we all can find how much google is indexing.What happened when some revamp their websites if redirect url will consider?

http://www.way2earning.com/ Suresh

This is a great move by Google. I opine Index status helps webmasters to identify the redirected and
canonical pages.

http://twitter.com/bsdeshmukh Babarao Deshmukh

Gtalk is down… make a post on it… thanks

http://twitter.com/HP2Z23 Corina C.Ramirez

Yes same here – its down from last 1 hour

http://twitter.com/roseberry9 Tom Roseberry

Yeah, I could see that. I’m usually concerned with the site as a whole and was always annoyed that G WMT didn’t allow all subdomains to roll up to one “site” so i like that they’d show indexing across the entire domain. Though they should be consistent and an option of which way you’d prefer to configure would be nice too.

http://guymanningham.com/ Guy Manningham

Great info. Google, you ellusive temptress! You always seems to change stuff just as I get up to speed.

http://guymanningham.com/ Guy Manningham

Great info. Google, you ellusive temptress! You always seems to change stuff just as I get up to speed.

http://www.ninebyblue.com Vanessa Fox

See edit in the article. I asked Google about subdomains and they said those numbers are only included if you’re looking at sitename.com (not if you’re looking at http://www.sitename.com).

Note that site: search numbers are notoriously inaccurate. How many pages does your site actually have?

http://www.ninebyblue.com Vanessa Fox

See update in article: Google has told me that there is a lag in this data.

http://www.ninebyblue.com Vanessa Fox

Redirects are included in the “not selected” number.

http://twitter.com/roseberry9 Tom Roseberry

Thanks – Vanessa. it’s webmd.com. And i don’t really know. several million. however most of the bulk that’s not as easy to know for sure is on subdomains like forums.webmd.com. Just looking at the www. version in WMT is more URLs than we have, at least as i think of it. But now that i consider it more this is probably including all paginated URLs (page=2, page=3, etc.) so that would make sense.

Anyway, greatly appreciate you asking the follow up and reposting.

Tom

http://www.brickmarketing.com/ Nick Stamoulis

It’s always interesting to see your site through the eyes of Google. It may not be 199% accurate, but if you’re fairly confident your site has 10,000 pages and Google is only indexing 5,000 of them you know something is up.

http://www.devonwebdesigners.com/ Elizabeth Jamieson

I wish they’d identify which pages are the ones included in the not selected category.

Mahendra Varma

It was quite iteresting and very useful information now we can get the site status very clearly.

Michael Carlin

SO how do we lower the no selected count? I can’t remove redirects or I’ll lose that link juice, I have no dupe content, and you say that even if I no-index tag pages and archives on WordPress that they will still be part of this number….

Attend Our Conferences

Attend Marketing Land's SocialPro conference and learn fresh new strategies and tactics from some of the savviest brands and digital marketing agencies managing earned, owned and paid social media marketing campaigns across multiple platforms. Visit the SocialPro site to learn more..