Categories

A Guide to Robots.txt

Posted on: September 17, 2016 by Dimitar Ivanov

If you ever used a SEO tool you've probably seen a positive score for the using
of a robots.txt or negative for a missing robots.txt.
Despite the presence or absence of robots.txt is not a ranking
signal in SERP, it's still important part of SEO efforts to optmize a website.

What is robots.txt?

It's a plain text file used to instruct search engines
whether to include a web resource into its index or to stay away from it. The
robots.txt file must be placed into a web root
directory of a host, i.e. /robots.txt

What is robot?

A robot is an automatic program or service
that crawls the web and gathers information later used by search engines to
update their indexes and to provide relevant search results. When a robot
visits a website, tries to find and read the robots.txt. Based on instructions
found above, the crawler will add a web page or resource to its index or will
stay away of it. All search engines (e.g. Google, Baidu, Yandex, DuckDuckGo,
Bing, Yahoo! etc.) has their own robots. Another common names for a robot are:
bot, spider and
crawler.

Robots.txt syntax

The robots exclusion standard defines the following directives:

User-agent - a name of a web crawler. A wildcard * stands for all robots.

Disallow - specifies paths that must not be accessed by given robots. If no path is specified this directive has no effect.

Allow - specifies paths that must be accessed by given robots. If no path is specified this directive has no effect.

Robots.txt examples

Gives an access of all robots to whole website:

User-agent: *
Disallow:

Refused access of all robots to the whole website:

User-agent: *
Disallow: /

Prevents access of all robots to given folders or single files:

User-agent: *
Disallow: /uploads/
Disallow: /img/private.jpg

Disallow a specific robot to access a web resource:

User-agent: Googlebot
User-agent: Baiduspider
Disallow: /cgi-bin/

Prevents access of a specific robot to whole website:

User-agent: YandexBot
Disallow: /

Instruct a multiple user-agents with different rules:

User-agent: *
Disallow: /images/
User-agent: Yahoo! Slurp
Disallow: /

Tells to all robots to stay away from the whole website except the home page:

User-agent: *
Disallow: /
Allow: /index.html

Robots Meta tag

To instruct web spiders also is possible a use of HTML meta tag.
Note that this applies only for HTML documents, e.g. can't use meta tag for images,
styles or scripts.

Robots header

Instead of HTML meta tag you could use a HTTP header X-Robots-Tag
sent from your web server via a .htaccess file or
a dynamic language as PHP, Python,
Ruby, etc.

X-Robots-Tag: noindex

Conclusion

When needs to hide a portion of website from Google or other search
engines the most right solution is to use a robots.txt file. But be
careful with robots.txt - a little mistake could cost a lot, even a
whole website not to be indexed. So it's recommended after changes
always to validate the robots.txt using
a tool or service.

If you have questions about robots.txt, leave a comment below.
Don't forget to share this article if you think it worths others to know about it.
Thanks so much for reading!

Dimitar Ivanov

Dimitar Ivanov is a senior LAMP developer, javascript engineer, web performance-obsessed.
He is programming since 2003 and loves to build web applications.
You can find him on Twitter,
LinkedIn and
GitHub.

Subscribe to our newsletter

Join our mailing list and stay tuned! Never miss out news about Zino UI, new releases, or even blog post.