At first, this seems like it should just work. If $second_ip == $first_ip, all is good, right? Well, not really. If you monitor the output of gethostbyaddr() you will see that it often returns the original string if it cannot resolve it into a host name. Below, I have included a small sample of values I got from using gethostbyaddr() and gethostbyname().

IP

Host

Reverse IP

$first_ip

$hostname

$second_ip

66.249.65.232

crawl-66-249-65-232.googlebot.com

66.249.65.232

72.30.79.95

llf531274.crawl.yahoo.net

72.30.79.95

92.70.112.242

static.kpn.net

static.kpn.net

80.27.102.88

80.27.102.88

80.27.102.88

32.154.39.98

mobile-032-154-039-098.mycingular.net

mobile-032-154-039-098.mycingular.net

After looking at the table above, you can see that simply checking to see if $first_ip == $second_ip is not a good check. In the case of 80.27.102.88, you can see that it will pass the test. In the case of a spam bot, the hostname will most likely be the ip address. In effect, we just whitelisted the spam bot. By the way, getting hostnames with just the ip address is very common. It does not mean that the user is a bot. It only mean the hosting company didn't register a more human readable hostname. In that case we shouldn't backlist them.

What is a blacklist vs. a whitelist? A blacklist is a list of hosts we always want to block. They have been determined to be hostile. A whitelist is a list of hosts we always want to let in. For example we would always want to let Google, Yahoo, or Bing bots scan our site. They should be on the whitelist. Everyone else is allowed in, unless they try something not allowed.

1 comment:

Er, sounds good, but how do you know that 80.27.102.88 is a spam bot if the test you're performing is intended to weed out the spam bots. What you're then saying is that you first need to know which are the spam bots in order to work out how to detect spam-bots. That's a bit circular isn't it?