A robots.txt File Guide That Won’t Put You to Sleep

Let’s face it. Not all SEO topics are sexy and fun. In fact, some topics are just flat-out boring, and unfortunately, the robots.txt file falls into that category. But despite its boring nature, the robots.txt file is still very important for webmasters, and we need to discuss it. So with that in mind, I have listed a few helpful questions (and answers) that pertain to this file. I can’t promise this will be the most entertaining thing you’ve ever read, but I can promise it’ll be the most entertaining robots.txt file guide you’ve ever read. Now, let’s get started.

What is a robots.txt file?

A robots.txt file is a text file (hence the .txt suffix) that sits on your Web server and tells search engine crawlers which pages you would like them NOT to index. Reread that sentence because it is very important. In fact, I’m going to break it down like a 3rd grade English teacher. First, the robots.txt file is for search engine crawlers – not users. Second, the information in the file is simply a statement of your preference. Crawlers don’t actually have to abide by the information. Fortunately, the well known search engines (e.g., Google, Bing, Yahoo!, etc.) will respect your robots.txt file, but it’s important to understand that not every crawler plays by the rules. Finally, the file is used to help keep sections of your website out of the search engines’ indexes. In other words, if you want something to appear in the search engines, you don’t want it to appear in your robots.txt file.

I don’t have a robots.txt file. Is my website going to explode?

Yes. Oh, wait. I read that wrong. No, your website will not explode without a robots.txt file. However, your server logs will show 404 HTTP status codes every time a search engine crawler attempts to access a missing robots.txt file. This isn’t the end of the world, but I like to avoid 404s whenever I can. Consequently, every site I own has a robots.txt file (even if it’s just a bare bones one).

I feel left out. How do I create my very own robots.txt file?

You have two options. You can create one from scratch manually, or you can have a robots.txt generator create one automatically. (Technically, you could use a generator and then manually edit its output, which would give you a third option. But no one likes a smart ass :-P)

Now, for those of you brave enough to create your very own robots.txt file from scratch (don’t worry: it’s really easy), let’s begin. A robots.txt file consists of one or more records, and each record contains at least two fields. Here’s a very simple example that will allow all crawlers to index every page on your website:

User-agent: *
Disallow:

This example has a single record, which contains two fields: User-agent and Disallow. The User-agent field is used to specify which crawler should process this record, and the Disallow field is used to specify which section of the website the specified crawler should avoid indexing. In this example, “*” is used to specify all crawlers, and the Disallow field is left blank to allow all sections of the website to be indexed. Now, let’s look at an example that will disallow all crawlers from indexing any page on your website:

User-agent: *
Disallow: /

In this example, we simply added a “/” (this represents the root directory on your site) to the Disallow field, and it completely changed the file’s meaning. This new file disallows all sections of the website from being indexed. In many cases, you’ll want something in between these two extremes so here’s a slightly more complicated example that uses multiple records and multiple Disallow fields:

In this example, we have two records. The first record applies to the Google crawler (“Googlebot”), and it disallows indexing the content stored under the /do-not-index-me/ and /me-either/ directories. The second record applies to the Bing crawler (“BingBot”), and it disallows indexing three different directories.

I have a robots.txt file, but how do I know if it’s syntactically correct?

I have a syntactically correct robots.txt file, but where do I put this thing?

You want to put it in your Web server’s root directory. If your site is http://www.yoursite.com/, you should be able to access the robots.txt file at http://www.yoursite.com/robots.txt (obviously, replace www.yoursite.com with your actual site).

If I disallow access to something in robots.txt, does that mean it’s completely hidden from everyone?

No, not at all. Disallowing something in robots.txt only influences whether or not it’s indexed by search engines. It doesn’t prevent any of your content from being viewed by crawlers or users.

Are there any other swanky fields I can throw into my robots.txt file?

Now that you mention it, there are additional fields that you can use, but I can’t comment on their swankiness. You can use Crawl-delay to specify how many seconds a crawler should wait between successive requests to your website, and you can use Sitemap to specify the location of your XML Sitemap.

About The Author

Steve Webb is an SEO audit specialist at Web Gnomes. He received his Ph.D. from Georgia Tech, where he published dozens of articles on Internet-related topics. Professionally, Steve has worked for Google and various other Internet startups, and he's passionate about sharing his knowledge and experiences with others. You can find him on Twitter, Google+, and LinkedIn.

Comments

That was really helpful. I wanted to know what other things you can throw in your robots.txt file like you said Crawl delay & Sitemap. Can you show any example of this? What other kind of fields we can throw?

The two most popular robots.txt fields that are not “standardized” are:

Crawl-delay – specifies the number of seconds a crawler should wait between successive requests to your website. Here’s an example:

User-agent: *
Crawl-delay: 15

This example tells all search engine crawlers that you’d like them to wait 15 seconds in between requests to your site. Specifically, if a crawler requests page A, you want it to wait 15 seconds before it requests page B.

Trackbacks

[…] Again Webmaster Tools make it really easy see what pages or folder are being blocked. Select your website and go to Site Configurations > Crawler Access . It should be fairly easy to see if the robots.txt is blocking any pages or folders that shouldn’t be blocked from the search engines. Find out more about robots.txt here. […]