Blocking bad bots with robots txt

Which bad bots can be blocked? And what about bots that disregard robots.txt?

Edited: 2017-04-11 08:29

Bad robots are bots which harvest data from websites, or submit spam on blogs and forums, etc. The data they harvest may be anything from pages with form fields, to email addresses. This data can then be used in different blackhat SEO and marketing techniques.

Most of these bad robots will simply ignore your robots.txt file entirely, so its best to deal with them some other way, such as IP blocking. This is however not always an option, since IP addresses can be dynamic.

IP blocking can still save some server resources, but temporary blocks are often preferred, since there is no way of knowing if the owner of an IP has changed.

What is a bad bot?

Some website owners would consider otherwise legit bots bad, since they hammer their sites with requests, messes up the statistics, and harvest data for malicious purposes. I.e. spam, and telemarketing, and so on.

A lot of robots that do respect your robots.txt are mining data from other websites, often for irrelevant purposes. One of such is ia_archiver, which is mining data to show an archive of how the web looked in the past. Everyone might not be interested in having old versions of their site stored in a public archive, so you may wish to block it from crawling your site.

If a bot is annoying you, you can simply block it with robots.txt.

User-agent: ia_archiver
Disallow: /

SEO spammer sites

Some sites will gather information about domain names and IP addresses, and usually they consist of completely useless auto-generated data. These are very abusive, and often difficult to block.

Once your site has been indexed on one of those, it is often difficult to get it removed. Sometimes the data is also compromising your personal information, and often it will even be outdated.

It seems many of these sites rarely update their data, so if you have been unfortunate enough to get your site listed on one of these sites, then there is likely nothing to do. You can either wait for them to go out of business, or update their outdated data. Some of them provide an update button, but it does not always work.

They are likely only created for SEO purposes, in the hopes of generating traffic from website owners. Ignore them. Most of them will disappear sooner or later.

Some of them may provide a useful service, and legit ways to block their crawler via robots.txt. But they are still annoying to website owners, because there are so many of them.