Robots.txt Best practices

Even though SEO specialists put most of their effort into improving the visibility of pages for their corresponding keywords, in some cases it’s required to hide certain pages from search engines. Let’s find out a bit more about this topic.

What is a robots.txt file?

Robots.txt is a file that contains the areas of a website that search engine robots are forbidden from crawling. It lists the URLs that the webmaster doesn’t want Google or any search engine to index and prevents them from visiting and tracking the selected pages. When a bot finds a website on the Internet, the first thing it does is check the robots.txt file in order to learn what it is allowed to explore and what it has to ignore during the crawl.

To give you a robots.txt example, this is its syntax:

User-agent: *

# All bots – Old URLs

Allow: /

Disallow: /admin/*

What is robots.txt in SEO

These tags are required to guide the Google bots when finding a new page. They are necessary because:

– They help optimize the crawl budget, as the spider will only visit what’s truly relevant and it’ll make better use of its time crawling a page. An example of a page you wouldn’t want Google to find is a “thank you page”.

– The Robots.txt file is a good way to force page indexation, by pointing out the pages.

– They can keep entire sections of a website safe, as you can create separate robots.txt files per root domains. A good example is –you guessed it- the payment details page, of course.

– You can also block internal search results pages from appearing on the SERPs.

– Robots.txt can hide files that aren’t supposed to be indexed, such as PDFs or certain images.

Where do you find robots.txt

Robots.txt files are public. You can simply type in a root domain and add /robots.txt to the end of the URL and you’ll see the file…if there is one!

Warning: avoid listing private information in this file.

You can find and edit the file at the root directory on your hosting, checking the files admin or the FTP of the website.

How to edit robots.txt

You can do it yourself

– Create or edit the file with a plain text editor

– Name the file “robots.txt”, without any variation like using capital letters.

It should look like this if you want to have the site crawled:

User-agent: * Disallow:

– Notice that we left “Disallow” empty, which indicates that there’s nothing that is not allowed to be crawled.

In case you want to block a page, then add this (using the “Thank you page” example):

User-agent: * Disallow: /thank-you/

– Use a separate robots.txt file for each subdomain.

– Place the file on the website’s top-level directory.

– You can test the robots.txt files using Google Webmaster Tools before uploading them to your root directory. – Take note that FandangoSEO is the ultimate robots.txt checker. Use it to monitor them!

See it isn’t so difficult to configure your robots.txt file and edit it anytime. Just keep in mind that all you really want from this action is to make the most of the bots visits. By blocking them from seeing irrelevant pages, you’ll ensure their time spent on the website will be much more profitable.

Types of Meta Robots:

Apart from robots.txt, we can also use meta robots to tell the bots what to do with a page directly:

<meta name=”robots” content=”all” />

There are many types of Meta Robots that can be assigned to a page on a site:

– index= This tag allows search engines to index the page, and this comes by default so, if you’re OK with search engines finding and tracking your pages, you don’t need to touch it.

– all= As mentioned above, this tag allows search engines to index the page and follow its links. “All” equals to “index follow”.

– noimageindex= It prohibits search engines from showing an image on their search results. But if the image receives any link, Google will keep indexing it so, in this case, it’s better to assign an X-Robots-Tag HTTP to the header.

– none= Its purpose is to ask search engines not to index nor follow any link on that page: noindex and nofollow. Basically, it tells them to not react when they see the page.

– follow= This robots tag invites Google to follow the links on the page, regardless of whether they’re “index” or not.

– nofollow= It asks search engines not to follow any links from the page.

– noarchive= This one prevents search engines from showing cache on the page (the information won’t be stored on the user’s browser for future visits).

– nocache= The same as the previous one, but only for MSN/Live.

– nosnippet= It won’t let the snippets appear on the SERPs, and it also prevents the cache generation.

– noodp= Although it no longer exists, it was used for preventing search engines from using the description.

– noydir= It prevents Yahoo! from using the description on its directory as it would be shown on the search results (it isn’t used either, but you may come across it).

In these cases, you’ll want to add the code to the head of the HTML file on each page.

Finally, remember that the SEO best practice for robots.txt and meta robots is to ensure that all the relevant content is indexable and ready to be crawled! You can see the percentage of indexable and non-indexable pages among the total pages of a site using FandangoSEO’s crawl, as well as the pages blocked by the file robots.txt.