Secure Apache: Out, Damned Bot!

Learn how to defend your Web server against abusive spiders and 'flies.'

The introduction of almost any technology is closely followed by attempts to figure out how to abuse it. As the technology matures, the methods of abuse become more and more sophisticated.

The Web is no different; almost as soon as people started publishing content, others began trying to figure out how to steal it. I'll call these people and their ilk 'perps.' As soon as pages became read/write instead of just read-only, perps began figuring out how to use them to publish their own content on other people's servers. Wikis and blog comments are one example of this.

Describing ways of dealing with such abuses is the end goal of this series of articles, but I'm going to cover some more basic issues first.

Spiders and Flies

The tools and robots that crawl the Web looking for content (for whatever reason) are frequently called 'spiders,' or sometimes 'bots.' Some spiders are good, such as the Google bot, which loads the Google search engine with what it finds. Others have a much more questionable goodness quotient, such as those that search Web pages for e-mail addresses to add to spam lists, or look for trademark references so that the information can be sold to the trademark holders for possible lawsuits.

While the term spider is in common use, I've never heard anyone give a name to the other type of abuse  that of hijacking writable Web pages such as blog comments and wikis. I'm going to coin the term 'flies' for abusive tools of this type, since they cluster around and crawl all over pages, leaving flyspecks and crap on them.

Abuses can be handled either proactively, reactively  or I suppose there's the third option of 'not at all.'

Proactive measure include SSL, user memberships, credential-protected pages, and scrutiny of submitted content (called 'moderation') before acceptance. As usual, a common result is that innocent users suffer because of the bad behavior of the perps, having to jump through hoops, click through multiple pages, and CAPTCHA challenges. (You know, like the obfuscated images of warped words you must type in to prove you're not a bot.)

Unsure About an Acronym or Term?
Search the ServerWatch Glossary

Handling abuses reactively usually means you detect when someone misbehaves and enact restrictions that will prevent it from happening again. Doing this correctly can be an art, since making the conditions too narrow will let similar-but-not-identical abuses get through, while making them too broad can lock out legitimate visitors.

The Bouncer

When the toxic spider problem first surfaced, it took the form of simply gathering too much information (and thereby occasionally affecting server performance). It wasn't long before a solution appeared  the Robot Exclusion Standard (RES). It described the format of a file called "robots.txt" that you could put on your site to indicate which areas are available for crawling and which were not.

The RES is intended to stand at the door of your site and control access by the paparazzi, er, spiders. The idea was that legitimate bots would check for the file and obey its restrictions. Of course, it doesn't directly help block those that don't check the file, but it was quickly adopted by the Spiders in the White Hats and has become a fixture of today's Web.

However, a standard like this is a little like a traffic signal; it works only when people agree to abide by the rules. Spiders that don't abide by the rules can often cause crashes.

With a little cleverness, we can use the toxic spiders' RES non-compliance against them. To flog the analogy a little bit more, note that some municipalities have installed cameras to take photographs of malefactors who break the traffic laws.

We're going to do something little bit like that to deal with these naughty bots. Consider these possibilities, listed in order by increasing nastiness:

The spider checks for robots.txt, and doesn't crawl prohibited areas. (Good bot! Here, have a cookie.)

The spider checks for robots.txt, but doesn't comply with the restrictions.

The spider doesn't even bother to check for robots.txt at all.

The spider reads robots.txt, scans for 'allow' stanzas1 that apply to other spiders, and then masquerades as those in order to access the protected areas.

The first case covers the Spiders in the White Hats, so we will not worry about it. Handling the others requires applying some intelligence to the process, which means recording what a particular bot is doing and making decisions based on its activities.

1 The original RES didn't support 'allow' stanzas, and not all RES-compliant bots recognize them. However, the basic issue is the same even for 'disallow' stanzas  a bot with evil intentions can conceivably change its access by pretending to be one of those for which you have explicit rules.

Thoreau had a good idea when he advised us to "simplify, simplify!" Let's assume you have different restrictions for different bots  such as for Google versus Yahoo!, for example. If your robots.txt file is static, it will need to have stanzas for each of the specific bot rules  which means that all bots can see what their competitors' access rules are. (Can you see that I'm going after case #4?)

If it's a dynamic document, however, we can feed each bot only those rules that apply to it and it alone. The robots.txt that the bot will see is much simpler than our overall set of rules  and that's where "simplify, simplify" comes in  even though it means a little more work being done by our server. One of the truisms of security work is that increased security always costs something. I say that a lot, so get used to it.

Unsure About an Acronym or Term?
Search the ServerWatch Glossary

For a first step, let's make your existing robots.txt file in PHP script that just returns its current contents. Nothing new and fancy, just making it dynamic in the most basic way. Where I need to mention server configuration directives, I'll use those for the Apache Web server. Make adjustments as appropriate for whatever server you're using.

1. First, make the server aware that robots.txt is a script and not an actual text file. Add the following to your httpd.conf file and then restart Apache.

SetHandler application/x-httpd-php

2. Edit your robots.txt file and add the following to the top:

Header('Content-type: text/plain');
?>

3. Try to access it from your Web browser. If all is working properly, you'll see only the normal rules, not the PHP code segment you just added.

(If you're not familiar or comfortable with PHP, feel free to use some other scripting language of your choice. All my code examples are going to be in PHP, though.)

You should now have a basic dynamic robots.txt document. Go right ahead and play with it to see what you can do. You may actually want to work with a different file (called new-robots.txt or something like that), so that if you make any mistakes you won't screw up the rules spiders are currently using to crawl your site.

In my next article I'll go into much more detail about fleshing this basic script out to do some actual work. This one is primarily intended to raise your consciousness, get you started, and possibly spur you to do a little research on your own.

Advertiser Disclosure:
Some of the products that appear on this site are from companies from which QuinStreet receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. QuinStreet does not include all companies or all types of products available in the marketplace.