In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren’t welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren’t suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

Note there is nothing there about SEO. That was not the original purpose of the standard. The original purpose was for web owners to have a way to indicate that robots should not go to certain pages.

So Google, it can be argued, is technically following the original intent and purpose of this. If you have a robots.txt file blocking robots, then the Google Bot will not visit your site*.

But – and this is the crucial bit – Google will still list your site with information from other sources, like links.

And also – note the “Important” box on that Google page – you have to make sure the Googlebot is not blocked from seeing that meta tag by something, like, say, for the sake of example, a robots.txt file.

So wait, in order to stop Google listing my site …

… you have to make damn sure Google can crawl your site. Exactly.

That’s Freaking Insane!

Glad I’m not the only one that thinks that.

TL;DR

The robots.txt standard was created for the purpose of stopping bots from accessing parts of your site, nothing to do with SEO. Over the years, this point got confused. Now Google, while arguably following the standard to the letter, has created a slightly insane situation in regards to what you must actually do to make sure your site isn’t listed in Google.

* Really, Google do honour robots.txt. To the point that if you try and import a ics feed from a URL to your Google Calendar, and that URL is covered by a robots.txt file that bans robots …. Google Calendar will just point blank refuse.