I got 290ish trackback spams last night, and that’s after quite a bit of anti-spam filters. For some reason spammers think I’ll approve their spam through excessive volume. Well, they couldn’t be more wrong. In fact, I’ve been thinking of interesting ways to detect them. For those of you who don’t run blogs, trackback spam is when robots pretend to be other blogs linking to my site. My site picks up the post requests from the robot, who tells it a few things, like the link to the site and a title and some sample text. Trackback spam is difficult to stop because it is doesn’t act like normal traffic (even when it’s working normally). So today I came up with a few semi-clever tactics to end the madness.

The first is the IP address. This is one thing the robot cannot fake. The robot normally must run from the webserver that the trackback is coming from. If it isn’t, that’s a huge signal that it’s a robot. So what if I connect to the same IP address on port 80 and look for a webserver? If I don’t see one, I can be 99% sure it’s fake traffic. The only way that wouldn’t be true is if the site just temporarily went down or the server is on another port. Either way, do I really care?

Next is the IP address of the link. The link itself should match the IP address. Why would a site be doing a trackback link for some other website? That makes no sense, and therefore again is 99% spam. The only way the spammers could get around this is to temporarily spoof the DNS entry to my server, but even still they’d have to be running a webserver on that IP address. In this way, you can quickly exhaust the number of sites they can spam from because they must run a webserver on it to get it to work (which they do in less than 1% of the cases I’ve looked at thus far). And even still they must also link to that same server. That greatly increases the work of a spammer to even get a link to show up in my moderation queue, and I can simply ban that IP address going forward, since I know it is truly the same IP as the spam site that I don’t care to see anyway.

It’ll be fun writing the software. They spammed the wrong guy 290 times!

This entry was posted
on Wednesday, March 21st, 2007 at 8:58 am and is filed under spam.
Responses are currently closed, but you can trackback from your own site.

19 Responses to “Tracking Back The Trackback Spam”

And you could start storing/blocking the source address for x number of days, since you can be sure they would probably try to spam again from that IP.. of course, keeping it indefinitely may not be the best thing, but keeping it for say 60 days might do the trick..

What we really need to do is start classifying the people we block and putting it online. Maybe we don’t need to block their IP address entirely but we could block them from submitting any comments/trackbacks indefinitely. They still have email as a remediation.

Also, when you compare IP addresses of the trackback sender and the link target - I hope you mean comparing subnet parts of the address? There is such a thing as load balancing, IP addresses don’t have to match exactly…

@Gaz - that would slow down the server, leaving ports open like that, although I like the concept.

@drew - I’ve stopped downloading new versions of Wordpress, so although I use the base framework my code is getting more and more divergent.

@Wladimir - I really couldn’t care less about proxies. They can turn it off if they really absolutely must have their link on my page. Trackback links are a feature, not a right. And actually no, I wasn’t talking about subnets, I was actually talking about looping through the list of all possible IPs used by the DNS (including failover). It’s better than subnets since some companies load balance across subnets. Look at gethostbynamel() in PHP to see what I mean.

Trackbacks are just a way for websites to tell other websites that they are talking about them. Specifically so you can know who is linking to your blog, and give them some reciprocal traffic if it’s interesting enough to your users to follow the link and the snippet of text associated with the trackback.