Robots.txt (Web Crawling) Explained

This article is a brief introduction about robots.txt. All search engines commonly use web crawlers (also known as web spiders or bots) to crawl the internet and index massive amounts of data. These automated scripts crawl websites and index all web pages to serve up-to-date search results.

Assuming a wordpress blog although, a robots file wordpress use will be placed in root directory. Robots file works with all platforms and websites.

What is robots.txt?

It is a tiny text file resides on the root directory of a blog or a website and the purpose of this file is to guide web crawlers which files and folders they may or may not crawl.

Reasons to create robots.txt

N

Its a good SEO pratice. The first file Google bots will look while crawling your blog will be robots.txt and this will make indexing process much faster.

N

Your blog is live but in development phase and you don’t want search engines to crawl your blog.

N

Directing bots not to crawl posts or pages which are not important.

N

Some scripts may need to give special instructions to robots which could be placed in this file.

N

Could be used to fine tune blog’s accessability for all types of crawlers.

According to Google, don’t use robots.txt to hide your web pages from Google Search results. They could get indexed if some other post is pointing to them.

Woopoo's Tip

How to create robots.txt

Lets create one for your blog. Login to your blog’s CPanel and see if this file is already present, if not, create one, name it robots.txt and save it on root level of your blog.

In wordpress robots.txt file location is public_html or www folder which is also called as root directory.

Woopoo's Tip

The three main directives to use:

W

User-agent: it defines that the specific rules below will be applied to which search engines. Usually it will be * wildcard which means applies to all search engines.

W

Disallow: It will guide crawlers not to crawl whatever path will be placed after this directive.

W

Allow: Although this is optional but can be used to force search engines to crawl specific path or file mentioned after this directive.

See robots.txt file example below, paste the following code in your robots.txt file and save it.

The first line of code User-Agent: * means that the following rules are applied to all crawlers. Then the web crawlers are being instructed not to crawl specific directories with Disallow directive.

Setting robots.txt disallow all:

User-Agent: * Disallow: /

Disallow all will restrict all bots to crawl and index your blog. Should be used only if your blog is live but in development phase. Once it’s done, delete Disallow: / directive

Woopoo's Caution

That’s it. You have successfully created robots.txt for your blog. Whenever bots will crawl your blog, they will crawl everything apart from the files and folders you specified above.

some experts recommend to add wordpress folder wp-includes in the Disallow list also but not advisable now because some bots specially google bots will generate errors if you block wp-includes directory.

Woopoo's Tip

Testing robots.txt

Once the file is created, its time to test it with Google robots.txt Tester to ensure everything is working fine