The YouMoz Blog

Solving New Content Indexation Issues for Large Websites

This entry was written by one of our members and submitted to our YouMoz section.The author's views below are entirely his or her own and may not reflect the views of Moz.

Heya SEOmoz Community! This is my first post here trying to share my experience with the most widely read and respected SEO blogs, so please be easy on me.

What I want to share today is a little technical and I am sure most you may have done something similar. In my own personal experience with working on SEO projects over the last four years, there have been numerous instances where a website undergoes a major revamp or you take up an ongoing SEO project and discover content indexation issues. The case I am specifically referring to is when you have a large number of old website pages that were not 301 redirected or removed using a 404 (not ideal) or just plain old content that lingers on in Google's index because it was just only delinked from the website's internal linking schema (we've all been there, right?). If a large number of such pages get collected over a period of time and are just forgotten about without realizing that they continue to reside in the Google index, you will very soon discover indexation issues with new content. Bear in mind that this would likely be the case for very large websites with hundreds of thousands of pages. With that as a background let's dive in!

To identify such pages, I'll break this post into three major parts and go into minute details.

Part 1: Create a list of all pages (or a sub-section of the website) on the website that would be discoverable by crawling the entire current internal linking structure/navigation schema. Xenu's Link Sleuth tool and Pivot Tables come in handy here.

Part 3: We'll put the two datasets from above together and use the VLOOKUP function in excel to identify all such pages that are present in the Google index but not discoverable in the current internal linking/navigation schema. These are the pages we are after that might be preventing new content from being indexed.

Ideally, you'll want to 301 redirect all such pages to the most relevant current page on the website or the next best option would be to redirect to the home page. If redirecting a large number of pages creates a technical hurdle, a 404 or URL removal request in webmaster tools might be another option. If you still want to maintain an archive of these pages for your users, you can create an archive section for these pages and use the "nofollow" attribute on links to these pages and "noindex" in the meta tag for these pages.

Let's just take a random live example for the purpose of this demonstration. I just thought about a random keyword "data warehousing appliances" and picked up the first website after Wikipedia. So, our example for this post for demonstration purposes only will be Netezza.

For the purpose of this post we will only use a section of the Netezza website – all pages under the sub-directory "data-warehouse-appliance-products".

The following steps in Part 1 are how to get just the HTML pages from that specific directory into an Excel file.

First head to the Xenu tool to index a list of current pages on the website. Here is how you do it.

Enter the website address and uncheck "Check external links". Once you have this report ready export it to a TAB separated file and save it.

Now, import this data into an excel spreadsheet by going through the following steps.

Open a new excel spreadsheet and select the file that you just saved from above. Then:

Keep everything as show above and click Next.

Keep everything as shown above and click Next.

You want to keep the Address and Type columns. For all other columns, select "Do not import column (skip)", then click Finish.

Under the Insert tab click on Pivot Table, select the entire dataset and choose where you want to place the output.

Now under the field "Type" only check "text/html".

Under the field "Address", go to Label Filters and select "contains".

Enter the name of the sub-directory you want to check. In this case, we will enter "data-warehouse-appliance-products" as shown below and click OK.

Now check the field "Type" first and then followed by "Address".

Now what you have is Part 1 of your required dataset.

Part 2:

You want to find all URLs in this sub-directory that are currently present in the Google index. To find this list, use the "site:" operator in Google as shown below.

Either append a "filter=0" in the SERP URL or click on "repeat the search with the omitted results included" at the bottom of the SERPs to make sure Google gives you all URLs from this sub-directory in its index.

Now use the SEOmoz toolbar to export all pages into a CSV.

Again, import the data into an excel spreadsheet as done above but make sure to check "Comma" as the delimiter this time. You only need the heading "URL" so keep it and delete the rest.

Now what you have is Part 2 of your required dataset.

Part 3:

Now open a new worksheet and enter the URL list from Part 2 (Google's index) on the left and the list of URLs from Part 1 (current URLs on the website) on the right. Use the vlookup (see the formula in the screenshot) function in excel to find all such URLs (with N/A under the heading "URL Found?" in the screenshot) that reside in the Google index but aren't currently discoverable in the internal linking structure of the website.

The URLs in the "URLs from Google index" that correspond with an N/A in the "URL Found?" column is your list!

I'd love to hear if you have any feedback or have experienced any similar issues and solutions you might have devised.

17 Comments

Great Post!!!!! However was just wondering, since site: operator does not provide an accurate data, would it make more sense to actually use the data provided by Google WMT i.e. the number of pages indexed from our XML sitemap?

Also for websites with large number of pages implementing sectional sitemaps is probably the best way to understand indexation issues at the very root level.

As I think the crawler of google is the best bot of the world and whatever you submit in your sitemap will be crawled by it someday if you do not have any indexing issue.

Now my point is If your site: operator do not give you the exact result you can not call it wrong as it only display whatever is indexed and it will output only that. So I would request everyone please STOP saying that site: operator do not give accurate data.

Useful post but it's all contingent upon getting that list of indexed URLs, which can be difficult for a large site.

There are numerous sites offering to check 20-50 URLs for index status. (Search for "bulk indexation checker") but none available to check the thousands of pages on a typical enterprise site.

Why not just trust WMT for the list? A) Large complex sites have too many pages excluded via robots or non-index tags. Important pages can slip through the cracks. More importantly, B) I do not trust WMT data.

Why not just use Screaming Frog, IIS, SEOtools for Excel, Xenu or another site scraper, to check if a cached copy exists? Seems simple, just plug in a list of URLs like http://webcache.googleusercontent.com/search?q=cac... and report which throw 404 errors. Unfortunately, Google dislikes this and forces CAPTCHAs after 30 or so.

After much research, I have only uncovered two ways to check large #s of URLs:

Thanks abdul for the smart way to find out the filtered url's creating the issues in indexation...but i also have the same question asked by Seomoz user:sazeet i.e. can't we use the XML sitemap from googles webmasters tool of the website to find the indexed urls cuz we all believe that site operaters are not the accurate enough to show the exact results! Please reply Sir...

Hi Matthew - it depends. If you plan on reusing expired URLs (example: ecommerce sites), it makes sense not to redirect. But, if the content has expired and won't come back, you might want to transfer value to your new content so it picks up without too much lag time.

We are currently working on getting an e-commerce site with c.700K products and pages fully indexed. Having submitted sitemaps and watched the indexed pages climb to 280K the technical department implemented a new URL formula, 301 redirecting the links on the sitemaps to the new versions!

Now the strange thing is that while 30K or more of the previously indexed URLs now show as not being indexed any longer, others are being indexed for the first time that hadn't yet been included.

I can understand that if the pages are 301ing Google might choose to index the target page instead, but why then are other old-format pages continuing to index?

Anyhow, this guide has given me more to think about as I work through the issues, thanks.

Sorry for taking so long to respond. With the site in question we have re-written all of the URLs to remove dynamic elements, duplication, etc.

We have created new sitemaps and a sitemap index with the new URLs and submitted these via Webmaster Tools.

Webmaster Tools reported a steadily increasing number of pages being indexed, up to a maximum of 280,000 pages, but has subsequently started de-indexing pages and the number indexed now stands at 145,000.

There is no obvious reason for the pages not being listed, some which are included in the index are deeper and less obviously linked than some that have been removed.

Is this simply a scenario where so long as Google can see the whole site we really just leave it up to them to index what they want, or is there something we can do to instigate more indexation?

If a large number of such pages get collected over a period of time and are just forgotten about without realizing that they continue to reside in the Google index, you will very soon discover indexation issues with new content.

Hi Brett - there is a limited amount of link equity that's distributed among the pages of a website. If older content competes for that limited amount with new content, search engines will take longer to index new pieces of content. Unless, links can propotionately grow with addition of new content.

Interesting Post! It’s always great to see how people are incorporating excel in their strategies to make SEO work easier. You post works great with the websites (especially bogs) that are in the bubble for a long time and producing content on daily basis.

301ing the expired pages to most appropriate pages is a must to do thing as besides Google (search engine in general) this is great from the user point of view as well…