Does Google Respect Robots.txt NoIndex and Should You Use It?

The availability of the Robots.txt NoIndex directive is little known among webmasters largely because few people talk about it. Matt Cutts discussed Google’s support for this directive back in 2008. More recently, Google’s John Mueller discussed it in this Google Webmaster Hangout. In addition, Deepcrawl wrote about it on their blog.

Given the unique capabilities of this directive, the IMEC team decided to run a test. We recruited 13 websites that would be willing to take pages on their site and attempt to remove them from the Google index using robots.txt NoIndex. Eight of these created new pages, and five of them offered up existing pages. We waited until all 13 pages were verified as being in the Google index, and then we had the webmasters add the NoIndex directive for that page to their Robots.txt file.

This post will tell you whether or not it works, explore how it gets implemented by Google, and help you decide whether or not you should use it.

Difference Between Robots Metatag and Robots.txt NoIndex

This is a point the confuses many, so I am going to take a moment to lay it out for you. When we talk about a Robots Metatag, we are talking about something that exists on a specific webpage. For example, if you don’t want your www.yourdomain.com/contact-us page in the Google index, you can put the following code in the head section of that webpage:

For each page on your website that you don’t want indexed, you can use this directive. Once Google recrawls the page and sees that directive, it should remove the page from their index. However, implementation of this directive, and Google’s (or Bing’s) removing it from their index, does not tell them to not recrawl the page. In fact, they will continue to crawl the page on an ongoing basis, though search engines may, over time, choose to crawl that page somewhat less often.

A common mistake that many people make is implementing a Robots Metatag on a page while also blocking crawling of that page in Robots.txt. The problem with this approach is that search engines can’t read the Robot’s Metatag if they are told to not crawl the page.

In contrast, implementing a NoIndex directive in Robots.txt works a bit differently. If Google in fact supports this type of directive, it would allow you to combine the concept of blocking a crawl of a page and NoIndex-ing it at the same time. You would do that by implementing directive lines on Robots.txt similar to these two:

Disallow: /page1/ Noindex: /page1/

Since reading the NoIndex instruction does not require loading the page, the search engine would be able to both keep it out of the index AND not crawl it. This is a powerful combination! There is one major downside that would remain though, which is that the page would still be able to accumulate PageRank, which it would not be able to pass on to other pages on your site (since crawling is blocked, Google can’t see the links on the page to pass the PageRank through).

Raw Results

On September 22, 2015 we asked the 13 sites to add the NoIndex directive to their Robots.txt file. All of them complied with this request, though one of them had a problem with it: implementing the directive caused their web server to hang. Since this server crash was unexpected, I tested this on stonetemple.com for a given page, and it also caused a problem for our server, resulting in this message:

Later on, I retested this, and the problem dawned on me. I had implemented the NoIndex directive in my .htaccess file instead of robots.txt. Evidently, this will hang your server (or at least some servers). But it was an “Operator Error”! I have since tested implementing it in Robots.txt without any problems. However, I discovered this only after the testing was completed, so that means we had 12 sites in the test.

Our monitoring lasted for 31 days. Over this time, we tested each page every day to see if it remained in the index. Here are the results that we saw:

Now that’s interesting! 11 of the 12 pages tested did drop from the index, with the last two taking 26 days to finally drop out. This is clearly not a process that happens the moment that Google loads Robots.txt. So what’s at work here? To find out, I did a little more digging.

Speculation on What Google is Doing

My immediate speculation was that it seems like Google is only executing the Robots.txt NoIndex directive at the time it recrawls the page. To find that out, I decided to dig into the log files of some of the sites tested. The first thing that you notice is that Googlebot is loading the Robots.txt files for these sites many times per day. What I did next was review the log files for two of the sites, starting with the day that the site implemented the Robots.txt NoIndex, and ending on the day that Google finally dropped the page from the index.

The goal was to test my theory, but what I saw surprised me. First was the site that never came out of the index during the time of our test. For that one I was able to get the log files from September 30 through October 26. Here is what the log files showed me:

Remember that for this site the target page was never removed from the index. Next, let’s look at the data for one of the sites where the target page was removed from the index. Here is what we saw for that page:

Now that’s interesting. Robots.txt is regularly accessed as with the other site, but the target page was never crawled by Google, and yet it was removed from the index. So much for my theory!

So then, what is going on here? Frankly, it’s not clear. Google is not responding to the NoIndex directive every time they read a Robots.txt file as a directive. Nor are they under any obligation to do so. That’s what led to my speculation that they might wait until they crawl the page next and consider NoIndex-ing the page then, but clearly that’s not the case.

After all, the two sets of log files I looked at both contradicted that theory. On the site that never had the target page removed from the index, the page was crawled five times during the test. For the other site, where the page was removed from the index, the target page was never crawled.

What we do know is that there is some conditional logic involved, we just don’t know what those conditions are.

Summary

Ultimately, the NoIndex directive in Robots.txt is pretty effective. It worked in 11 out of 12 cases we tested. It might work for your site, and because of how it’s implemented it gives you a path to prevent crawling of a page AND also have it removed from the index. That’s pretty useful in concept. However, our tests didn’t show 100 percent success, so it does not always work.

Further, bear in mind, even if you block a page from crawling AND use Robots.txt to NoIndex it, that page can still accumulate PageRank (or link juice if you prefer that term). And PageRank still matters. The latest Moz Ranking Factors Results for more information on that still weighs different aspects of links as the two most important factors in ranking.

In addition, don’t forget what John Mueller said, which was that you should not depend on this approach. Google may remove this functionality at some point in the future, and the official status for the feature is “unsupported.”

When to use it, then? Only in cases where 100 percent removal is not a total necessity, and in which you don’t mind losing the PageRank to the pages you are removing in this manner.

Hi Matthew – does not like we saw any clear difference there. Of course, we did look only at 12 pages, so I wouldn’t be able to draw a firm conclusion. But, on the 12 pages we looked at, some of them did NoIndex quickly, and some took a long time, from both classes of pages (new and existing).

I would bet the time it takes to deindex a page is similar to the time it takes to index a page – that is, it’s probably related to how popular the site is.

You’ve probably noticed that a new page on reddit (for instance) can often be indexed in Google within minutes of its creation, whereas a page on another random website can take hours or even days. I’m guessing this would also ring true for deindexation, but people aren’t intentionally deindexing their pages that often so it wouldn’t be so well known.

A month ago we tested this issue and asked John Mueller about would it be just a validation error in search console, or would Google expect this directive. He answered (https://goo.gl/ge1yBB), this directive would be not officially supported and it isn’t recommended to rely on it. But our tests were successful too, like yours.

When to use this feature? In those occasional cases where a CMS won’t allow you to NoIndex a page in the traditional way. I’ve found that with some enterprise-level CMS’s that it can be difficult to NoIndex a single page without development work. This would be maybe the only time that using the Robots NoIndex is a better option than using the traditional methods (HTTP header or tag).

Good test. I recently ran a similar one but with nosnippet and noarchive – the jury is still out on noarchive, but it looks as though Google also obeys nosnippet in robots.txt. It may be worth running a larger scale test on all robots directives with IMEC?

Very interesting study. Just curious, what role if any XML Sitemaps played in the test i.e. were the pages that were successfully dropped from the index listed in a Sitemap that was listed in the robots.txt and vice versa?

Hi Eric, I think John Muler and other Googlers said it many times on record that when it comes to removing a page, they want to know webmaster’s absolute intention. Google doesn’t do on first time what webmaster has said through such coding. For example, even 301 redirect take some time to come in action. Google wait to confirm that webmaster hasn’t done it accidently.

These are to be used prior to Googlebot spidering which is why you have a delay in the results. You would use Google URL removal tool to remove a URL from your website verified in search console to remove a URL. not a noindex meta.. that is for pages that require canonicals or have a limited visibility and expiration date…

robots.txt would be used to block directories not pages, of course this must be done prior to launching the website, or you would use parameter filter tools in search console? no?

I have some pages on my site that display backlink reports for some of my clients’ competitors.

All of these pages were noindexed, nofollowed and disallowed in robots.txt. There were no links to these pages from any other page on my site and none were included in my sitemap. None of the pages appeared in Google’s index.

Last week, I received a warning from Google for malicous links on these pages (now resolved).

It seems it doesn’t matter much how you instruct Googlebot, it will still crawl every page it can possibly find.

I have now put these pages behind a login which is probably what I should have done in the first place. 🙂

I recently found a mistake on a website in which a meta noindex tag was used on a product page that was in the main navigation of the website. I checked the index and it was still there. So now I’m wondering if multiple ppc landing pages with duplicate content and the same topic are hurting the website even though they have noindex tags on them .

Interesting to hear that the page was still in the index. The NoIndex tag is supposed to be a directive, which means that Google is obligated to respect it. How did you verify that the page was still in the index?

Great post. Just want to make sure I understand one of the details here. I was perhaps under the wrong impression that any page blocked from robots.txt with a disallow:/* will NOT pass any equity that it accumulates to the rest of the site. Sure that page itself may have some page rank, but with no one allowed to crawl or see the links on the page, it stops there and does not benefit the rest of your site. Is that not the case? Something you said above made me think that equity would be pass on to the rest of the site.

“Further, bear in mind, even if you block a page from crawling AND use Robots.txt to NoIndex it, that page can still accumulate PageRank (or link juice if you prefer that term). And PageRank still matters.”

Do you mean “block” via the on-page meta deceleration or block within Robots.txt?

I was referring to within robots.txt. Agree on the point that they can still accumulate PageRank, but at least that we’re out of the index! This helps a lot when we decide that we want to remove pages from the index, and we want to relieve crawl budget at the same time.

Thanks for the reply! Does that mean I shouldn’t be worried about the passage of equity from pages that are blocked via disallow in robots.txt? Is google still able to see the links on that page and pass equity to them? I get that equity would still pass to let’s say /page

but if /page is blocked in robots and links to /page1 , /page2, and /page3 internally, would’ny 1,2, and 3 miss out on the equity that’s behind /page if crawlers are disallowed?

Sorry if I’m beating the same point over and over. I just worry that robots.txt as my main weapon is leaving equity on the table where a robots ON page would likely be better.