This pamphlet is about blocking behaving bots with a smart robots.txt file. I’ll show you how you can restrict crawling to bots operated by major search engines –that bring you nice traffic– while keeping the nasty (or useless, traffic-wise) bots out of the game.

The basic idea is that blocking all bots –with very few exceptions– makes more sense than maintaining kinda Web robots who’s who in your robots.txt file. You decide whether a bot, respectively the service it crawls for, does you any good, or not. If a crawler like Googlebot or Slurp needs access to your content to generate free targeted (search engine) traffic, put it on your white list. All the remaining bots will run into a bold Disallow: /.

Of course that’s not exactly the popular way to handle crawlers. The standard is a robots.txt that allows all crawlers to steal your content, restricting just a few exceptions, or no robots.txt at all (weak, very weak). That’s bullshit. You can’t handle a gazillion bots with a black list.

Also, large robots.txt files handling tons of bots are fault prone. It’s easy to fuck up a complete robots.txt with a simple syntax error in one user agent section. If you on the other hand verify legit crawlers and output only instructions aimed at the Web robot actually requesting your robots.txt, plus a fallback section that blocks everything else, debugging robots.txt becomes a breeze, and you don’t enlighten your competitors.

The anatomy of a smart robots.txt

Everything below goes for Web sites hosted on Apache with PHP installed. If you suffer from something else, you’re somewhat fucked. The code isn’t elegant. I’ve tried to keep it easy to understand even for noobs — at the expense of occasional lengthiness and redundancy.

Install

First of all, you should train Apache to parse your robots.txt file for PHP. You can do this by configuring all .txt files as PHP scripts, but that’s kinda cumbersome when you serve other plain text files with a .txt extension from your server, because you’d have to add a leading <?php ?> string to all of them. Hence you add this code snippet to your root’s .htaccess file:
<FilesMatch ^robots\.txt$>
SetHandler application/x-httpd-php
</FilesMatch>
As long as you’re testing and customizing my script, make that ^smart_robots\.txt$.

Next grab the code and extract it into your document root directory. Do not rename /smart_robots.txt to /robots.txt until you’ve customized the PHP code!

For testing purposes you can use the logRequest() function. Probably it’s a good idea to CHMOD /smart_robots_log.txt 0777 then. Don’t leave that in a production system, better log accesses to /robots.txt in your database. The same goes for the blockIp() function, which in fact is a dummy.

Customize

Search the code for #EDIT and edit it accordingly. /smart_robots.txt is the robots.txt file, /smart_robots_inc.php defines some variables as well as functions that detect Googlebot, MSNbot, and Slurp. To add a crawler, you need to write a isSomecrawler() function in /smart_robots_inc.php, and a piece of code that outputs the robots.txt statements for this crawler in /smart_robots.txt, respectively /robots.txt once you’ve launched your smart robots.txt.

Let’s look at /smart_robots.txt. First of all, it sets the canonical server name, change that to yours. After routing robots.txt request logging to a flat file (change that to a database table!) it includes /smart_robots_inc.php.

Next it sends some HTTP headers that you shouldn’t change. I mean, when you hide the robots.txt statements served`only to authenticated search engine crawlers from your competitors, it doesn’t make sense to allow search engines to display a cached copy of their exclusive robots.txt right from their SERPs.

As a side note: if you want to know what your competitor really shoves into their robots.txt, then just link to it, wait for indexing, and view its cached copy. To test your own robots.txt with Googlebot, you can login to GWC and fetch it as Googlebot. It’s a shame that the other search engines don’t provide a feature like that.

When you implement the whitelisted crawler method, you really should provide a contact page for crawling requests. So please change the “In order to gain permissions to crawl blocked site areas…” comment.

Next up are the search engine specific crawler directives. You put them as if (isGooglebot()) {
$content .= "
User-agent: Googlebot
Disallow:
…
\n\n";
}
If your URIs contain double quotes, escape them as \" in your crawler directives. (The function isGooglebot() is located in /smart_robots_inc.php.)

Please note that you need to output at least one empty line before each User-agent: section. Repeat that for each accepted crawler, before you output
$content .= "User-agent: *
Disallow: /
\n\n";
Every behaving Web robot that’s not whitelisted will bounce at the Disallow: /.

Before $content is sent to the user agent, rogue bots receive their well deserved 403-GetTheFuckOuttaHere HTTP response header. Rogue bots include SEOs surfing with a Googlebot user agent name, as well as all SEO tools that spoof the user agent. Make sure that you do not output a single byte –for example leading whitespaces, a debug message, or a #comment– before the print $content; statement.

Blocking rogue bots is important. If you discover a rogue bot –for example a scraper that pretends to be Googlebot– during a robots.txt request, make sure that anybody coming from its IP with the same user agent string can’t access your content!

Bear in mind that each and every piece of content served from your site should implement rogue bot detection, that’s doable even with non-HTML resources like images or PDFs.

Finally we deliver the user agent specific robots.txt and terminate the connection.

Now let’s look at /smart_robots_inc.php. Don’t fuck-up the variable definitions and routines that populate them or deal with the requestor’s IP addy.

Customize the functions blockIp() and logRequest(). blockIp() should populate a database table of IPs that will never see your content, and logRequest() should store bot requests (not only of robots.txt) in your database, too. Speaking of bot IPs, most probably you want to get access to a feed serving search engine crawler IPs that’s maintained 24/7 and updated every 6 hours: here you go (don’t use it for deceptive cloaking, promised?).

Most search engines tell how you can verify their crawlers and which crawler directives their user agents support. To add a crawler, just adapt my code. For example to add Yandex, test the host name for a leading “spider” and trailing “.yandex.ru” string and inbetween an integer, like in the isSlurp() function.

Test

Develop your stuff in /smart_robots.txt, test it with a browser and by monitoring the access log (file). With Googlebot you don’t need to wait for crawler visits, you can use the “Fetch as Googlebot” thingy in your webmaster console.

Define a regular test procedure for your production system, too. Closely monitor your raw logs for changes the search engines apply to their crawling behavior. It could happen that Bing sends out a crawler from “.search.live.com” by accident, or that someone at Yahoo starts an ancient test bot that still uses an “inktomisearch.com” host name.

Don’t rely on my crawler detection routines. They’re dumped from memory in a hurry, I’ve tested only isGooglebot(). My code is meant as just a rough outline of the concept. It’s up to you to make it smart.

# In order to gain permissions to crawl blocked site areas
# please contact the webmaster via
# http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot

User-agent: Googlebot
Allow: /
Disallow:

Sitemap: http://sebastians-pamphlets.com/sitemap.xml

User-agent: *
Disallow: /

Search engines hide important information from webmasters

Unfortunately, most search engines don’t provide enough information about their crawling. For example, last time I looked Google doesn’t even mention the Googlebot-News user agent in their help files, nor do they list all their user agent strings. Check your raw logs for “Googlebot-” and you’ll find tons of Googlebot-Mobile crawlers with various user agent strings. For proper content delivery based on reliable user agent detection webmasters do need such information.

I’ve nudged Google and their response was that they don’t plan to update their crawler info pages in the forseeable future. Sad. As for the other search engines, check their webmaster information pages and judge for yourself. Also sad. A not exactly remote search engine didn’t even announce properly that they’ve changed their crawler host names a while ago. Very sad. A search engine changing their crawler host names breaks code on many websites.

Since search engines don’t cooperate with webmasters, go check your log files for all the information you need to steer their crawling, and to deliver the right contents to each spider fetching your contents “on behalf of” particular user agents.

Enjoy.

Changelog:

2010-03-02: Fixed a reporting issue. 403-GTFOH responses to rogue bots were logged as 200-OK. Scanning the robots.txt access log /smart_robots_log.txt for 403s now provides a list of IPs and user agents that must not see anything of your content.

22 Comments to "Get yourself a smart robots.txt"

Really nice article. I’m handling bot-visits in an similar way. I’m using a mysql-table for the function isBot($REMOTE_ADDR, $USER_AGENT). In this table I’m recording user_agent and remote_addr.

The table becomes a few default entries: ip-addresses of datacenters OR bot-user-agents. With the following routine you don’t need to add new ip-addresses manually.

If user_agent from visitor is found in table and not the ip-address then the ip-address will be added. So, if Google has a new ip for his crawler then it’s automatically added. *) If ip-address is found in table then no new entry is added.

With this logic the table is growing automatically and you only need to add manual new parts of user-agents. the ip-addresses are stored automatically.

Warning: Don’t use this logic for distributed search-engines like 80legs or Majestic. These ip-addresses are most private users. And you don’t want to have them blocked/blacklisted or whatever.

For these distributed search-engines I added a new column in the table called TYP. I also add ip-addresses for distributed SE, but I remove them a hour later with a crontab-script.

I’m using my script on a few ecommerce-sites and I’m very happy that I created this auto-script. I dind’t find any similar logic on Google. Perhaps I should search with Bing?

So, if you decline with my comment or have any other ideas, let me know…

[I had to remove your spammy URI and anchor text. *) Identification of bots solely relying on the user agent is crap. Guess how many scrapers steal your stuff as Googlebot … You need to verify all bots before you grant access. Sebastian]

Sebastian,
What you clarified in your last comment, I never thought of that when creating robots.txt. It makes kind of sense that scrapers will try to copy that concerning dupe content, but usually there is nothing important (that will attract search engine hits) to scrape in blocked area’s. At least, on my sites.

Great article, but a very over engineered solution. The assumption is that bad bots will adhere to the robots file and provide an accurate user agent. To build this solution on that assumption is, to me, the weakness of the plan.

That said, I do think the solution itself is pretty slick.

[You didn’t get it. A smart robots.txt is just one weapon in the arsenal. And yes, some rogue bots request it seeking site areas that aren’t fully indexed caused by Disallow: statements in robots.txt, usually to avoid duplicate content issues with reprinted content. Once they request robots.txt, they’ll never see any content again. As said above, rogue bot detection has to be done before any piece of content is served to any user agent. A smart robots.txt is just one source that detects and persists sneaky requests, so that the offender can be properly blocked from everything.]

Nice article Sebastian. I never knew there was anything this deep to the robots.txt file. My clients usually nag me about stupoid robots stealing their content/images. I’ll try to implement this solution from now on

Does this refer to your Disallow in the smart robots.txt file? Or does it refer to some other function? Because it’s a little confusing…as the root of the problem is still in the fact that bad bots generally ignore robots.txt.

Nope Mik. A non-behaving bot can’t be blocked by a Disallow: / statement in robots.txt.

What you need is an apparatus that detects nasty bot behavior analyzing each and every HTTP request, sometimes triggering a ban of an IP, host, user agent, or a combination of those values on the nth request.

Unusual (IOW fraudulent) requests of /robots.txt are just one source that feeds such a black list. For example when you request the robots.txt file of a protected site pretending to be Googlebot from your Cox Communications IP addy, all your future requests run into a 403. The same will happen when you request the root index page, or any other resource (page, image, …), regardless whether you’ve fetched robots.txt before or not.

Legit bots sent out from major search engines provide a procedure to validate their requests. Any request with a search engine crawler user agent string and an IP addy that can’t be verified as belonging to the particular search engine’s crawling engine is fraudulent, and should be blocked. Instead of responding with a 403-GFY you can serve ads or anything else except your content.

If you make use of somewhat shady optimization techniques, such as some variants of IP delivery, the procedure outlined above is not safe enough, but for most sites it will serve you well.

Usually you want Google to display both the snippet and the cached link. However, there are some cases where you might want to disable one or both of these. For example, say you were a newspaper publisher, and you have a page whose content changes several times a day. It may take longer than a day for us to reindex a page, so users may have access to a cached copy of the page that is not the same as the one currently on your site. In this case, you probably don’t want the cached link appearing in our results.

Again, the Robots Exclusion Protocol comes to your aid. Add the NOARCHIVE tag to a web page and Google won’t cache copy of a web page in search results:

<META NAME=”GOOGLEBOT” CONTENT=”NOARCHIVE”>

Similarly, you can tell Google not to display a snippet for a page. The NOSNIPPET tag achieves this:

<META NAME=”GOOGLEBOT” CONTENT=”NOSNIPPET”>

Adding NOSNIPPET also has the effect of preventing a cache link from being shown, so if you specify NOSNIPPET you automatically get NOARCHIVE too.

So I’d guess that works fine with the news crawler. Didn’t test it myself, tough, since I don’t run CNN or so.

David, cloak the META element for the news crawler, or, even better, serve it via X-Robots-Tag in the header to the news bot only. That’ll be enough for a somewhat clean testing. If the Google News service doesn’t support it, drop me a line and we can create some noise.