Robot Text files

In a nutshell

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The “User-agent: *” means this section applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use.

So don’t try to use /robots.txt to hide information.

How to create a /robots.txt file

Where to put it

The short answer: in the top-level directory of your web server.

The longer answer:

When a robot looks for the “/robots.txt” file for URL, it strips the path component from the URL (everything from the first single slash), and puts “/robots.txt” in its place.

For example, for “http://www.example.com/shop/index.html, it will remove the “/shop/index.html“, and replace it with “/robots.txt“, and will end up with “http://www.example.com/robots.txt”.

So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site’s main “index.html” welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lower case for the filename: “robots.txt“, not “Robots.TXT.

Understand the limits of robots.txt

Before you build your robots.txt, you should know the risks of only using this URL blocking method. At times, you might want to consider other mechanisms to ensure your URLs are not findable on the web.

Ensure private information is safe

The commands in robots.txt files are not rules that any crawler must follow; instead, it is better to think of these commands as guidelines. Googlebot and other respectable web crawlers obey the instructions in a robots.txt file, but other crawlers might not. Therefore, it is very important to know the consequences of sharing the information that you block in this way. To keep private information secure, we recommend using other blocking methods, such aspassword-protecting private files on your server.

Use the right syntax for each crawler

Although respectable web crawlers follow the directives in a robots.txt file, some crawlers might interpret those directives differently. You should know the proper syntax for addressing different web crawlers as some might not understand certain instructions.

Block crawlers from references to your URLs on other sites

While Google won’t crawl or index the content blocked by robots.txt, we might still find and index information about disallowed URLs from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the site can still appear in Google search results. You can stop your URL from appearing in Search results completely by using your robots.txt in combination with other URL blocking methods, such as password-protecting the files on your server, or inserting meta tags into your HTML.