Fend off the bad bots…

I have now discovered a few tips and tricks to fend off those bad bots, crawlers, harvesters, you name it – all at the application layer.

First of all you need to create yourself a simple blocklist of IP’s, one IP per line, IP’s you don’t want visiting your site. (hopefully ours will be published for public use soon)

123.567.678.9
657.387.93.2
1.1.1.14
etc

The PHP needs to be something simple like a script that checks if the users IP is in the text file and if it is, display a “No Entry Error”. Originally my script only displayed a HTML page with text along the lines of “Your IP address has been blocked from our network due to suspicious activity, you are being monitored. If you feel you should not be blocked simply email someone@example.com”. – This did the job, just not very well, as the harvesters would pick up that email address and spam it like hell and the bots just kept coming back every hour (each time the error message was displayed it would log it on our admin areas).

So, I did some research on how to take this issue. To Google!

I came across the HTTP Error Code 403.6 (http://en.wikipedia.org/wiki/HTTP_403) which simply put is the correct error that is issued by the server (not a normal HTML page i.e. 200) and is interpreted by the browser as “Your IP has been blocked from this website”, perfect. However I have never come across HTTP error codes with decimals before so I went foruming to see how to go about this, I had a suspicion that if we hit the bots with an “official” error they would return to the site less often.

I originally set out asking questions on the webdesignerforum.co.uk with no avail, so off I went to my recently discovered StackOverflow (which is an incredible website with an amazing idea which works bloody well!), as I expected I got the answer and fast. Best thing is just to send out a 403 rather than a 403.6 as its very uncommon.

So I now changed the script to dish out a header including 403 and low-and-behold the log entries in the admin area of blocked IP’s slowly started decreasing. Once the bots got a 403 once or twice they rarely came back!

//blocklist script by h0zza (Ben Hoskins) licensed under the beerware license.
$blocklist = file('/path/to/your/blocklist.txt');
if(in_array($_SERVER['REMOTE_ADDR'] . "n", $blocklist) || in_array($_SERVER['REMOTE_ADDR'], $blocklist)) {
//boot user
header("HTTP/1.0 403 Forbidden");
echo '<h1>Access Forbidden!</h1>You have been banned from seeing this website because of suspicious activity. Our administrators have been notified and are monitoring your activity across our network. If you feel you should not be banned simply contact us on <script type='text/javascript'>var a = new Array('mple.com.com','abuse@exa');document.write("<a href='mailto:"+a[1]+a[0]+"'>"+a[1]+a[0]+"</a>");</script>.';
//if your going to add a loggin script, add it here.
exit();
}

You’ll notice the email address is now displayed in JavaScript (thanks to the Honey Pot Project, more on that soon), this is so that only humans can read the email address (if they were wrongly blocked) as most bots can’t understand JavaScript yet.