Robots.txt – Everything You Need to Know

When it comes to crawling your site, the robots.txt file is one of your biggest assets. It exists primarily to inform search engine spiders and other bots which parts of your site they can and can’t access.

It’s a file that is supported, and for the most part, obeyed by all of the major search engines. That means that any rules you specify within it will be taken into account when it comes to crawling your site.

What Is a Robots.txt File? What Does It Do?

A list of instructions is always a help when you’ve got a whole world wide web to check over. Search engine bots in particular love robots.txt files as it allows them to easily distinguish which pages should be available for public consumption.

Why Do I Need It?

Providing everything is set up and implemented correctly, robots.txt can be extremely handy in a number of instances.

Duplicate Content: Especially for eCommerce stores with a number of URLs using query strings. Canonical tags should be the priority but directives in your file can help, too.

Areas Not Meant for Public Consumption: Staging and dev sites, as well as any internal documents that might be hosted on your domain spring to mind.

Preventing Certain File Types Being Indexed: Some people don’t like things like PDFs and images being indexed.

If you don’t want to hide anything from search results, you don’t need one. You could have one that simply lets every crawler find everything if you like.

It’s worth noting that URLs excluded from indexing in robots files may still end up being indexed if search engines deem them valuable.

Links pointing, both internal and externally, to a page are the biggest reason this might happen.

How to Check for One

Before a bot crawls your site, they check to see if you’ve left them any instructions. They do so by looking for your file on the root of your domain.

Hosting Account

You can delete everything in there and replace or just add to what’s already in place.

WordPress Virtual File

If your site is on WordPress, you may have found a robots.txt file that looks something like this:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

If you can’t find where it is in your files, no need to panic. WordPress automatically creates a virtual robots.txt if one isn’t detected.

This essentially means you don’t have one so WordPress provided something on place.

How to Write

Crafting a robots.txt file in easy by explanation but can be a bit tricky if you aren’t entirely sure what you are doing.

Use a text editor – Notepad will do the job just fine. You can also use online editors, such as editpad.org. Programs like Word tend to save documents in proprietary formats, such as .doc or .docx, which often don’t conform to character rules that crawlers understand.

The primary example is quote marks. There’s a big difference between “ and " for spiders.

Note that everything is case-sensitive, including the name of the file. Don’t use any capitals when you’re saving it.

The robots.txt file is cached. It’s usually updated every day but you can resubmit in Search Console to make sure your new version is refreshed straight away.

Understanding Syntax

First and foremost, you’ll have to know how to dictate which parts of your site should and shouldn’t be crawled.

You’ll need the following syntax:

User-agent: [This is the name of the robot you’re trying to communicate with]
Disallow: [A URL path that you don’t want crawling]
Allow: [A certain path of a blocked folder or directory that should be crawled]

As an example, if we want to block Googlebot from a blog category directory, apart from the SEO section, our robots.txt file might look something like this:

Where to Put It

The robots.txt file goes on the root of your domain. Of course, if you’ve already found yours, you can simply replace the content in it with your new rules.

If you didn’t find one, it can uploaded using a few methods.

FTP

Once you’ve written your file, you can simply drag it into your FTP client window.

Remember that if it isn’t on the root, it isn’t going to work.

Hosting Account

Locate the public HTML folder of your site, as shown by the red box. Select this directory and click the upload button to begin implementing your file.

In cPanel, you’ll be presented with an option to drag or select a file. Take your pick and your file will be uploaded.

Yoast

If your site is on WordPress and you have the Yoast SEO plugin installed, you might be able to upload a robots document without having to poke around in source code and server files.

Assuming both your admin account and the Yoast plugin have permission to edit server files, select Tools from the Yoast menu.

You should see a list of options, including File Editor.

Find the section labelled “robots.txt” and paste in your rules.

Click the save changes to robots.txt button

Validation

As we mentioned earlier, you can validate your uploaded robots.txt file using Search Console.

If you’ve done everything right, including making sure you placed it in the right directory, you can go ahead and insert your blocked path.

Click the TEST button and it should return the bright red bar highlighting which rule is preventing the URL or folder from being crawled.

Potential Pitfalls

Up to this point, we haven’t really been met with any challenges. We’ve had a few things to look out for but generally not too much to make us scream.

A simple slash may seem like a small detail but it can end up being far from that.

Being as strange as I am, I’ve decided to start a travel business that only goes to places called Nottingham. I originally started out with just two destinations – Nottingham, UK and Nottingham, Maryland, USA.

I slowly began to offer more destinations, all called Nottingham of course, and had a respectful six destinations. As well as my original two, new locations in Indiana, New Hampshire, New Jersey and West Virginia became available.