How to control crawling with robots.txt

What is crawling? Avoid it by using robots.txt.

When users surf the web, content is made available for them by search engines through 2 main ways: crawling and indexing.

The former takes place when search engine crawlers access publicly available webpages. So, it basically involves looking at the webpages and following the links on those pages.

Indexing, on the other hand, means gathering information about a page, so that it is made available through search engine results.

The problem with crawling is that sometimes you might not want to allow crawlers to access areas of your website. Such is the case with accessing pages that use limited server resources. That’s why you might want to use the robots.txt file.

What is the robots.txt file and why is it so important?

It is a text file which allows you to specify how you’d like your site to be crawled. Crawlers generally go through the robots.txt file from your website, before they crawl it. The robots.txt file is so great because you can specify which parts can and cannot be crawled.

It’s so important because it allows you to control access to the files and directories on your server. It’s like an electronic NO TRESPASSING sign. It tells the Googlebot and other crawlers which files and directories on your server should not be crawled (nor displayed in search engine results).

What is the file’s location?

In order for it to be valid, it must be located on the root of the website host.

Upgrade Your WordPress

Become Super Competitive Now!

Overtake your competitors with today's free Upgrade!

https://

.com

For example, in order to control crawling on all URLs below http://www.yoursite.com/, the robots.txt file must be located at:http://www.yoursite.com/robots.txt.

A robots.txt file can be placed on subdomains: http://website.yoursite.com/robots.txt) or on non-standard ports: http://yoursite.com:8181/robots.txt, but it cannot be placed in a subdirectory: http://yoursite.com/pages/robots.txt.