One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site. I call it the “one-strike” rule: bots have one chance to follow the robots.txt protocol, check the site’s robots.txt file, and obey its directives. Failure to comply results in immediate banishment. The best part is that the Blackhole only affects bad bots: normal users never see the hidden link, and good bots obey the robots rules in the first place.

In five easy steps, you can set up your own Blackhole to trap bad bots and protect your site from evil scripts, bandwidth thieves, content scrapers, spammers, and other malicious behavior.

The Blackhole is built with PHP, and uses a bit of .htaccess to protect the blackhole directory. The blackhole script combines heavily modified versions of the Kloth.net script (for the bot trap) and the Network Query Tool (for the whois lookups) (404 link removed 2012/07/08). Refined over the years and completely revamped for this tutorial, the Blackhole consists of a single plug-&-play directory that contains the following four files:

These four files are all contained in a single directory named “blackhole”.

Installation Overview

I set things up to make implementation as easy as possible. Here are the five basic steps:

Upload the /blackhole/ directory to your site

Ensure writable server permissions for the blackhole.dat file

Add a single line to the top of your pages to include the blackhole.php file

Add a hidden link to the /blackhole/ directory in the footer of your pages

Prohibit crawling of the /blackhole/ by adding a line to your robots.txt file

It’s that easy to install on your own site, but there are many ways to customize functionality. For complete instructions, jump ahead to Implementation and Configuration. For now, I think a good way to understand how it works is to check out a demo..

One-time Live Demo

I have set up a working demo of the Blackhole for this tutorial. It works exactly like the download version, but it’s configured to block you only from the demo, not from the entire site. Here’s how it works:

First visit to the Blackhole demo loads the trap page, runs the whois lookup, and adds your IP address to the blacklist data file

Once you’re added to the blacklist, all subsequent requests for the Blackhole demo will be denied access

So you get one chance to see how it works. Once you visit, your IP will be blocked from the demo only – you will still have full access to this tutorial (and everything else). That said, here is the demo link: Blackhole Demo. Visit once to see the Blackhole trap, and then again to observe that you’ve been blocked. If I were to include the blackhole.php in the header of my theme files, you would be banned from pretty much the entire site.

Implementation and Configuration

Here are complete instructions for implementing and configuring the Perishable Press Blackhole:

Step 1:Download the Blackhole zip file, unzip and upload to your site’s root directory. This location is not required, but it enables everything to work out of the box. To use a different location, edit the include path in Step 3.

Step 2: Change file permissions for blackhole.dat to make it writable by the server. The permission settings may vary depending on server configuration. If you are unsure about this, ask your host. Note that the blackhole script needs to be able to read, write, and execute the blackhole.dat file.

Step 3: Include the bot-check script by adding the following line to the top of your pages:

The blackhole.php script checks the request IP against the blacklist data file. If a match is found, the request is blocked with a customizable message. See the source code for more information.

Step 4: Include a hidden link to the /blackhole/ directory in the footer of your pages:

<a style="display:none;" href="http://example.com/blackhole/" rel="nofollow">Do NOT follow this link or you will be banned from the site!</a>

This is the hidden link that bad bots will follow. It’s currently hidden with CSS, so 99% of visitors won’t ever see it. To hide the link from users without CSS, replace the anchor text with a transparent 1-pixel GIF image.

This step is pretty important. Without the proper robots directives, all bots would fall into the Blackhole because they wouldn’t know any better. If a bot wants to crawl your site, it must obey the rules! The robots rule that we are using basically says, “All bots DO NOT visit the /blackhole/ directory or anything inside of it.” More on this in the next section..

Further customization: The previous five steps will get the Blackhole working, but there are some details that you’ll want to customize:

index.php (lines #27/28): Edit to/from email addresses with your own

index.php (lines #149/161): Check/replace path to your contact form

index.php (line #113): Line to whitelist user-agents (optional)

blackhole.php (line #30): Line to whitelist user-agents (optional)

blackhole.php (line #39): Check/replace path to your contact form

These are the recommended changes, but the PHP is clean and generates valid HTML5, so feel free to modify the source code as needed. Note that beyond these three items, no other edits need made.

Whitelisting Search Bots

Initially, the Blackhole blocked any bot that disobeyed the robots.txt directives. Unfortunately, as discussed in the comments, Googlebot, Yahoo, and other major search bots do not always obey robots rules. And while blocking Yahoo! Slurp is debatable, blocking Google, MSN/Bing, et al would just be dumb. Thus, the Blackhole now “whitelists” any user agent identifying as any of the following:

googlebot (Google)

msnbot (MSN/Bing)

yandex (Yandex)

teoma (Ask)

slurp (Yahoo)

Whitelisting these user agents ensures that anything claiming to be a major search engine is allowed open access. The downside is that user-agent strings are easily spoofed, so a bad bot could crawl along and say, “hey look, I’m teh Googlebot!” and the whitelist would grant access. It is possible to verify the true identity of each bot, but as X3M explains in the comments, doing so consumes significant resources and could overload the server. Avoiding that scenario, the Blackhole errs on the side of caution: it’s better to allow a few spoofs than to block any of the major search engines.

License and Disclaimer

The Perishable Press Blackhole is released under GNU General Public License. Check the Creative Commons for a summary and/or see the Blackhole source code for additional information. Also note that by downloading the Blackhole, you agree to accept full responsibility for its use. In no way shall the author be held accountable for anything that happens after the file has been downloaded.

Check it out

Comments

Several hints:
# a lot of bots show into the useragent the googlebot string – thats why people thing that your script blocks the googlebot. Usally it doesnt need to be whitelisted.
# to only check for the useragent is bad in this case. Google explains how to verify your their bot correctly over there: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
# the best way to install this script is to add it to the robots.txt + wait a day to add the link somewhere in a hidden div.

Doesn’t seem to be working for me. After installation I’m getting no error, but in attempting to get myself banned by repeatedly visiting the banned page, nothing happens, and the dat file never gets written to. I’ve changed permissions to allow full access to everything (just to test), and still nothing gets written.

I receive the “Bad Bot” email, the “bad bot” page displays fine, and there are no errors reporting in my error log.

Now, can you make a black hole for all the spam posters to my blog. The ones that post…really love the way you write and your content is so good, never though along these lines before but you enlightened me….on my art blog with only pictures of my art or on my about page.

I would love to develop a list and block them from ever even trying to post.

It’s pretty amusing to read all the discussion about how to avoid banning GoogleBot. If they’re crawling a page that you’ve explicitly requested they ignore, then you should treat them like every other crawler. I don’t see the point of breaking your own rule in this case.

Eric, that’s the 2,000-pound gorilla in the room. Why bother whitelisting any search-engine? If they break the rules, ban them. Right? Unfortunately, Google owns the Web, so they pretty much decide what it is exactly that they will and won’t do. Sucks, but true.

If you read the url Tom has posted, you will notice that Google simply wants to be called explicitly and not via the wildcard. That way it obeys the robots.txt, according to the author of that site. :)

Books

Links

About the site

Perishable Press is the work of Jeff Starr, professional developer, designer, author, and publisher with over 10 years of experience. Check out some of Jeff's books and projects, follow on Twitter, or learn more »

Fun fact: Perishable Press has been online since 2005, and now features over 700 articles and more than 11,000 comments. More stats »