Snook.ca

How I built an effective blog comment spam blocker

Mention comment spam and most people, in particular those crazy WordPress users, mention Akismet. Great tool and I have nothing against it but I wanted to build my own, avoiding the external call to the Akismet service. What has been interesting to see, is just how effective it is. Turns out, my spammers are quite obvious.

As you might see, I don't use CAPTCHAs and I don't use JavaScript detection. I just use a number of rules that validate each comment on the server. Oh, and I don't use nofollow.

Points System

I use a points system, which I got the idea from Movable Type, whose spam protection is also based on a points system. For everything in a comment that I like, you get a point. For everything I don't like, you lose a point (or two, or three). If you get a 1 or higher, you've made it on the site as a valid comment. If you get a 0, it's set for moderation and I'll take a look at it. If it's below 0, it's marked as spam and I'll never see it (although I check every couple weeks just in case a legitimate comment needs to be unflagged). If it falls below -10, I don't even bother saving it to the database since it is so obviously spam.

Types of Spam

There are two main types of spam: automated and manual.

Automated spam is the most obvious. There are a number of tricks they try to pull and stands out when you see the same message a dozen times posted within seconds of each other. Automated spam is also the easiest to catch. So insanely simple that just a few rules would catch about 95% of all comment spam hitting a server. (That percentage may even be higher...I'm just guessing).

Manual spam, on the other hand, is more devious. People actually try and respond to the article at hand, which makes it slightly harder to catch. I say slightly because the vast majority of manual spammers do such a poor job at leaving a comment that they stand out like a sore thumb. The remaining few are usually the ones you end up filtering by hand.

Quick Solution

The quickest solution to reducing the amount of comment spam you get, and doesn't require any server-side programming and is built into almost all blogging tools, is to simply turn off the comments on a post after a certain amount of time. It works quite well and here are the two major reasons why:

Automated spam has a database of pages to which they try to submit to. If the form is no longer there then you don't get spam. Spammers are forced to discover new pages in which to spam.

Manual spam often tries to hit pages that have higher page ranks. There's plenty of search engine tools to help people look this information up. (I'd actually see referrers from these search tools, followed shortly by a new blog comment.) Higher page ranks will happen on older and popular posts. By shutting down the comment form, manual spammers are left to target newer pages in the hopes of getting missed until the page gets a higher ranking.

I've had old posts that I left the comments open for years and would still see users come across it and add to the discussion in meaningful ways. I loved that. However, that almost never happens now. So, I finally gave in and just close comments.

The Rules

In a blog comment, there are 5 fields and I test each one separately and in various combinations for various rules. The fields are: body, email, author name, url, and ip.

Once you have a database of spam messages, you can observe certain patterns. In checking some information from time to time, I discovered some interesting stats:

Body length

Write something of consequence. If it's less than 20 characters, you obviously don't have much to say.

URL matches

Most people who include a URL usually have a top level domain or a subdomain that they use. They're not using querystring parameters or any other crazy URL structures. And I'm sorry for all the German, Polish or Chinese but a few of your fellow countrymen aren't being very nice.

URL Length

URLs that are longer than 30 characters are almost always spam. This ties in with the last filter. If you've got a URL, it's short, sweet and sexy. It's not crazy long — although I have seen some crazy long, perfectly legitimate URLs.

Body matches

It may seem like I'm being overly severe on people who start their comments like this but it's a very specific pattern that I'm matching. I was getting 10 to 20 hits of the same message coming in. It was just easier to match the messages and essentially ban them.

Random character matches

The other thing I noticed was email addresses or author names that were just a random string of characters. If there's no vowels, sure you might be Polish but more likely that you're spam. Rarely do even the Polish have 5 consonants in a row!

Effective?

How effective has it been? These days, I only see a new spam message get through maybe once every week or two. It's usually a message that somebody has handtyped to be relevant to the page but the comment is near useless and their author name is most evidently spam.

I've also reworded the disclaimer text under the submit box to let people know that I'm actively on the look out for spam and even legitimate comments will get edited or marked as spam if they plan to abuse the system. This lets those people know — like those who like to leave signatures on a blog comment or who like to use their company name as their author name — that being underhanded will not be rewarded.

Despite my past frustration with spam, things are at a point now where I'm happy to leave comments open on recent posts for a couple weeks and then just close them up and never have to worry about them again. It certainly isn't the death of comments I thought it might need to come to.

Conversation

Very cool. I wonder how hard it would be to hook into spamaassin's bayesian filter to add some additional checks. Still, I love the idea.

I wrote a text based challenge/response. It asks questions like what's 2+2, etc. I wrote it for nucleuscms and added it to php form mail. I'm currently adapting it to WP as well. It uses a rotating key for the question id, so that every time a question is answered a new key is generated and are never the same twice. I only have 4 questions and in a year of running it stopped 100% of the automated spam. I'd love to add some rules to catch the hand written stuff. I'll certainly be looking at your idea here for that.

so, if i were to start a comment with "interesting post, but i disagree that..." i would be permabanned from commenting again and my comments wouldn't even go to the database in the future? that hardly seems fair.

also, by posting this aren't you helping the spammers learn what type of comments they need to post in order to become trusted users?

@kyle: I tried to qualify the word filter by saying "there's a very specific pattern I'm matching." Suffice it to say, 99.9% of average users will not set it off. As to whether I make it easier for spammers, it's possible but very few of the rules would really make much of a difference. There's only a couple and they were added to handle edge cases (like URLs longer than 30 characters).

The keyword matching is probably the easiest to catch and the hardest for spammers to get around since they need to use keywords to gain google juice.

I find this very interesting. I have been looking for a method to block comment spam for a while, and though my developer partner and I have had some success, it is relatively limited. Our new CMS, Ministry(Starter), will have an Akismet plug-in, but it would be nice to have an in house system too. Thank you for your relevant articles, Snook.

I've got a pretty effective system I built (which includes Akismet, but also some local stuff), but this is freaking great. I think I'll definitely build a very similar scoring system into my Django-powered CMS. Thanks so much for the inspiration, Snookums! :)

@Olivier: I don't like to do that for accessibility and aesthetic reasons. This page is still just as perfectly usable without CSS or JavaScript. Basically, I never put the onus on the user to solve the spam problem, I put it on myself as the operator of this site.

I like this post. The algorithm you've put together here is quite intriguing, though I'm a bit worried about a couple of the filter rules.

For instance, my URL is more than 30 characters, simply because I'm using a subdomain at a free blog host. Does that necessarily mean I'm a spammer? No. I've been thinking about getting my own domain, but -- and this is where it gets interesting -- my ideal registered domain would be 7+3+1+14+1+3+1=30 characters exactly, including http:// and a trailing slash.

I suppose I should be thankful that you don't seem to be filtering out *.blogspot.com completely, as I've seen suggested elsewhere on the 'Net.

Regarding your filtering URLs containing .html, ?, or &, does that apply to the comment body or just the URL field? Some blog platforms and content sites (PC World for example) end their pages in .html, and/or use query strings to retrieve the correct page, so I'm just curious about that.

But I disagree with your blocking of .de TLDs (maybe since I am german?), I never saw .de URLs in my Blog's comments... But this is like it is, because if you saw spammers like this, its perfectly OK and the rest of your rules seems is a great point to start with.

I used to have terrible problems with comment spam (upwards of 400 messages a day on my homebrew CMS). Askimet didn't help, flagging as many genuine messages as spam as spam that got through. I've been slowly tweaking my own system that uses local page elements, which I know you didn't want to use. Having said that, in the past 6 months I've had over 25,000 spam attempts and 12 have gotten through (http://hybridlogic.co.uk/journal/63/comment-spam-follow-up). The only spam that gets through now are the really short messages, however I have a lot of visitors leaving short, often one word, replies so I didn't want to add a character count limit.

I like your filter combinations though, I'm tempted t update mine to assign points to each check now. Thanks for the article.

You know it's a really great idea to use points system. But I think it can be even better if rules won't be so simple. For example: if comment contains 2 spam-keywords it will reduce not 2 points but already 3. The more rules are broken the bigger is a coefficient.
You know, it's like 2+2=5 (synergy), but in opposite way.
Maybe fuzzy logic can be good for that purpose too.
Well... thanks for great article. Now I know what will be my first thing to do after passing exams :)

Cool. I'm Polish but I will try to pass thru. We use Akismet and so far it works as advertized. I know it doesn't work for some people. I'm curious how it is possible for a company which specializes in fighting spam to develop a product which is worse than one man's work.

On the other hand, there is no way to build 100% effective automated spam filter. Building an automated bot which bypasses any (even unknown for the bot) filter with say 10% efficiency is not a hard task. And 10% is ok for spammers these days (take a look at email spam which has circa 1% or less). The key is (and always was) to mimic a real person. There are generally two things you have to consider:

1. Using a browser engine (you have to be able to execute js in order to bypass filters based on execution of js), all the bad guys should learn WatiR / Selenium and similar testing tools.

2. Writing a message that is statistically legit.

There are not a lot of these advanced bots in the wild now and the only reason is that old-school one-file .php bot-scripts are still effective.

My conclusion is that, no matter what, we will spend some time in our life marking those viagra and levitra messages as spam by hand and mark messages like this humble as non spam. How many points did I get? -9? :)

I came to think. If a certain e-mail adress has made lots of comments (like yourself, maybe you don't even run your own comments through the filter) wouldn't the hit on the db be fairly noticeable (at least if you're using active record, if you manually search for it and use mysql_num_rows() I would assume the effect gets smaller).

If you were to expand this idea you could set up a database with emails and a karma score where you add the total karma for a certain e-mail adress. If you aren't in the database and make a spam comment you instantly get ip-banned.

Nice work Jon! I'm one of those "Crazy wordpress users" you talk about and at the moment I don't use any spam blocking but am looking around for the best option. I only get a couple of pieces a week but it is on the rise.

What's wierd is, on my site, it seems like only one or two posts attract 99% of the spam. There's nothing special i can see about those posts... I've looked and they're not fundamentally different from all my others. It must be the subject matter.

@Voyagerfan5761: Yeah, the lengthy blogspot domain did get you moderated. :) A lot of spam comes from blogspot, which is unfortunate but that's why I try not to outright ban. On the plus side, now you won't get moderated.

@Michael Siebert: most spammers coming from a .de domain were manual spammers who were targetting the site. I know 456bereastreet.com has a similar .de rule. Admittedly, I haven't noticed it to be much of a problem as of late so it might be something I reconsider.

Looks like a good approach too me, although being German I find it sad that a .de Domain lowers the score. Plus, I think -10 points for a body starting with the wrong words seems a little extreme. Anyway, I was looking for a good scoring system, might give this a try with some minor modifications.

Jonathan, these seem to be good heuristics -- and they must be if they're working so well on your site. Phil Haack has written several articles about invisible captcha techniques including a honeypot captcha technique that hides an input field to visitors via CSS then catches bots when they fill it out.

interesting, actually the whole first thing about time constrained comment periods does in fact eliminate 95% of it I have found. I would also like to add that if you can restrict the number of comments that an IP can make in to 1 per 5 minutes also helps alot too.

Great writeup, Jonathan. I really like the +1 per previously approved comment rule. However, don't you fear that being a potential backdoor for manual spammers? They could use the frequent commenter's email address.

Carl, Snook has already express dislike for that technique in his 2nd comment.

rb, I strongly oppose that rule. Often visitors make 2 comments in a row to add something they've forgotten or to correct themselves if they notice something after they posted.

Ok, now for real: my previous two comments went into moderation. This method is pretty effective, although I think you still have quite a lot of comments to moderate, especially when you post something that attracts a lot of new users.

@Mislav Oops. I didn't catch that. Honeypot captcha works without CSS as long as you place a "leave this blank" message next to the field. I don't see that as placing an onus on the user, but since brain cells are actually invoked to bypass the field I suppose that counts as an onus to some.

@Neil: I think the Y technically counts as a vowel. (you know, "sometimes Y") I don't include Y as a consonate in my checks.

@Mislav: An email address of a popular poster could be used as a backdoor attack but a spammer would need to know a popular email address being used specifically on this site. A spammer could post a couple valid messages and then use that but they're still limited based on a number of the other rules. And once I see a spam message and flag it as spam, they're done. Basically, it's a lot of work for little gain.

@Robin: Generally i don't have to moderate many messages. With this post, of course, some people are trying to check the filter and are getting moderated accordingly but for the most part, it's pretty reliable.

Your method seems so simple, but it makes so much sense. As a TextPattern user, I've always relied on spam blacklists (such as Askimet or Spamhaus). For the most part they seem to work alright, but every once in a while I find a domain blocked (such as my workplace off and on) that seems like it shouldn't. I don't receive nearly as much traffic as you do so it's rarely a problem, but with a method like this, I'd no longer have to rely on other black lists to take care of the situation. Can someone say TXP Plugin...

I don't know if it's really possible to eliminate human generated spam (since they could just tweak a comment until it passes), but I've had similar success against bot generated spam, with a WordPress plugin I wrote that uses a similar set of rules.

At the moment, I'm trying out a slightly different methodology to blocking the spam, and that is to be slightly more aggressive in blocking the comment spam. Then if a comment gets flagged as spam, a "second chance" screen is presented with a captcha that allows the comment to be verified. So far, it seems to be working quite well, and it basically eliminates false positives (tho not necessarily human generated spam).

I'm a crazy Wordpress user/developer myself and although Akismet works great for my own blog, I can imagine with a higher volume of SPAM an annoying number of SPAM comments would still get through. However, you could always write a Wordpress plugin that hooks after Akismet and applies further filters to comments that pass Akismet. Or if you'd rather not reinvent the wheel, you could use an existing strike-counting plugin like Spaminator.

If you wanted to get really crazy, you could feed comments to Spamassassin and you could just tweak Spamassassin's settings to work against your SPAM. In fact, I wouldn't be surprised if Automattic leverages Spamassassin for Akismet.

Without knowing the rules I would have triggered the -10 points "starts with" rule. Heh.

The honeypot thing Oliver D. mentioned works pretty well. In order to keep it accessible you can label it accordingly. You can also do it the other way around (i.e. one field which contains some random garbage to begin with and shouldn't be changed).

The ".de" rule surprised me a bit. With a German domain everyone can look up who you are and where you're living. I would just give em a call (it's almost 2 am right now haha). Additionally, I would file a complaint.

Another thing you can check is the existence of invalid markup. They will often try to use bb code and html at the same time.

@Ben Hirsch: my class is tied into my own blog structure and CakePHP since it makes a few DB calls to determine things like previous email usage counts so I have no plans to open this up beyond what I have detailed here. Sorry.

Pure, unadulterated, Snook Genius. :) I see you've put a lot more thought into blocking spam than I have. When the classic MT-Blacklist started to fail for me (and I got tired of adding variations of the word v1agra) I set up my silly "Type the following letter" captcha - if you can call it that. Somehow that simple trick has almost completely eliminated spam for me.

I know we've briefly traded emails about this before, but it's really nice to get a breakdown on the specifics of the scoring system you're using now - thanks for the insight!

I've personally been using a combination of the Akismet library, but I think I may have to use your system as some inspiration to create my own filter... thus negating the need to connect to the Akismet server.

I kinda like this idea, but it seems like it would take quite a bit of work to keep up to date on all the different ways people spam you - different domains, keywords, etc. I enable Akismet on all client blogs and websites because I know it works, and requires no work on my part of the clients. It is just easier for the clients and that means i have less to deal with.

However, your idea is still rather interesting to me and I wonder if such an algorithm could ever become public.

This is probably the most efficient looking system of blocking spam I have ever seen. On my site right now, I have gotten over 3000 spam messages left in my comments on one day.

In my current redesign of my site, I am going to incorporate your method for blocking spam bots. That is a brilliant idea to look at the patterns to, I never would have thought of something like this!!

I thought I've seen all the ways of dealing with spam but this is definitely a new one to add to the list. On my blog I've been using reCAPTCHA, yeah it makes the user type the captchas but it also contributes to digitizing books so I don't feel as bad about using it.

I want to commend you on all the custom solutions you create, it's very easy to get caught up in all the plugins and frameworks out there.

Kind of a like a home-brewed SpamAssassin, eh? Love it. I might've done something like this in my WordPress install (yes, I'm one of the crazy ones) as a followup to WP-Gatekeeper but I've had phenomenal results with a combination of Akismet (which has blocked, as of the moment I post this, 678,928 bits of spam), a plugin that wanrs users when their comment is Akismetted, and the "hold posts from new e-mail addresses" setting built into WordPress. The convenience is worth the external call to the Akismet service for me. It's worked so well that I don't even use Gatekeeper on my own blog any more.

This is definitely some cool stuff that I might look at integrating into my own blog platform (it would be interesting to see a low level design since you have decided not to release it though ;-).

I've been looking at a custom solution for a while now, mainly because I think that my own email address might be blacklisted in Akismet. After posting on 37signals on a regular basis, my comments stopped showing up there (I'm guessing because I'm pro-Microsoft, I can't think of any other reason that they might have blocked me since my comments usually contribute to the discussion), and after that happened my comments stopped showing up on other blogs too (I think one of them was Jeff Croft's) leading me to believe that I got blacklisted for no apparent reason.

I've just built my own custom CMS in Rails for my own site and have been thinking about preventing comment spam. I had seen an article on integrating Akismet with Rails but this has inspired me to develop my own filter.

Is there any rationale behind the points you've assigned to each rule or is it nothing beyond analysis of spam you've received previously? (Not to demean the time and effort that went into such analysis of course!)

@Robin: the points were tweaked based on spam analysis. It's nice when you get a large collection of data and can perform various queries on it to see what comes of it. The less than 20 characters and the URLs longer than 30 came out of that analysis. I could count how many comments would get flagged based on that criteria. If a particular pattern was almost likely to be spam, then I can give it more weight by increasing the points. If it's more of a grey area, it uses less points.

A very interesting read. I love the simplicity of the rule set you've laid out, although I'm a bit wary of the body match rule: I feel that might falsely trash good comments. I don't know if someone has mentioned this or not (don't have the time to read all 60 previous comments - you're becoming so popular snook!) but what would be cool is if your blog you've built from CakePHP could keep a record of a user's accumulated points, so a less strict rule set could be applied against them for previous good behavior. In any case you've got me thinking! Thanks.

Thanks by yours Ideas.
Others points to take:
- First, make sure the form was posted from a browser.
- Make sure the form was indeed POST.
- Host names from where the form is authorized
- Attempt to defend against header injections "Content-Type:", "MIME-Version:", "Content-Transfer-Encoding:", "bcc:", "cc:"

And I who thought of start to use a new email (Since the old one is quite outdated) ;) Well, the algorithm for catching spam is really intresting - it's quite different from getting the CAPTCHA's to work for you (like the 2+2 or the enter the characters on the image) but I am a bit thoughtful about the length of the links as well the intro text. I find it quite unfair to entirely ban "Interesting, sorry, cool" because I feel that soon spammer would skip the word and just to "I think..." - maybe -5 instead?

I like the idea of the approved comments +1/per; because thats what internet has become more about, earn trust/fame/etc. instead of get everything instantly for a couple of days and either your fame continues well earned or it sinks. Great system overall though!

Okay, I just need to make this longer than 20 words and I'll deflect Snook's spam patrol. Let me just put on this deflection suit (with +5 agility and defense).

The point-based system of moderating comments is great, especially the way in which you've broken it down; clear as salt water. As for spammers becoming smarter, sure they will, they do every day and I wonder if they'll ever create a script that'll add up the mathematical captchas to push the comment through? Those are easier to break (maybe) than your run-of-the-mill captcha.

Whatever the case you've built some great tools to circumvent the issue and every day someone keeps at it means sending these goldminers to the hills, not permanently but better than what we were dealing with just 2 years ago.

Okay, so let me give this a try. I have opted not to use my blogger's address as that's a possible filter. This is longer than it should be.

But I really honestly like the idea and am considering stuff like this to put into my own site - which right now is nothing, really, it's nothing, I just got the domain and sat on it for a few years - I'm still sitting on it.

I'm a programmer at heart but I've been trying to learn php and cms since, maybe, 5 to 10 years ago. And I like "english-code" like this, or pseudo-code so a human being can understand it, and use whatever server side thing to actually implement it.

Math captchas are funny though. You should have Jeopardy style captchas or "Are you smarter than a 5th grader" questions. Sorry, you need a higher IQ to make a comment.

The intelligent spammers will figure out your database of questions, but it will take them awhile.

Just a question Jonathan: Is it really worth it? Maybe I'm missing something, but why in the world would you need to go through all of this to keep some "spam" comments out? If it is manual comments, on recent blogs, why would you feel the need to make sure that "nobody slips though"? I'm not asking this to be a jerk. Obviously I don't understand something. There must eb a good reason for people spending hours every day to keep spam out. What is that reason??

You forgot some very typical words. If the words cheap, cheapest or buy in the URL the score must be -20. I delete them all without reading. But you are right with the german guys. It is very sad whats going on here.

While I must admit closed comments is annoying I totally understand why you do it now. I think the way you rate spam is quite logical although, dependent on the blog and it's topics, a few of the rules would have to be altered (such as the amount of links in a post).

I currently use Akismet for my wordpress blogs - it's okay but sometimes a couple of legitimate posts get in there somehow.

Excuse me. What more felicity can fall to creature, than to enjoy delight with liberty.
I am from Canada and now study English, tell me whether I wrote the following sentence: "The independent service for resolving disputes between consumers and financial firms."

Hey. If you really do put a small value upon yourself, rest assured that the world will not raise your price.
I am from Ukraine and learning to write in English, give true I wrote the following sentence: "Resume? Learn about effective design and layouts, what to include and what to leave off your resume."

Y'know, I actually had to build a sort of spam filter for my Machine Learning class at school. We had a big sample of spam messages and non-spam messages, with word frequencies (like, how many times the v-word appears -- I'm afraid to write it here). We then used an SVM (http://en.wikipedia.org/wiki/Support_vector_machine -- oh God, am I over the 30 char limit? :p) to split the messages into spam and non-spam. Using your idea for moderation, you can flag things close to the line as awaiting moderation, and things further from the line as clearly spam or non-spam. I'd also use other dimensions like the ones you mentioned -- author's name, and URLs and the characters within... you just to collect a bunch of spam, and then you can train it off-line. You might have a little fun writing a classifier in PHP though. Would be a fun little project for me to try some day....

I wanted to make public in to repulse you in quittance looking tailored this capacious look during the course of!! I some conditions ago enjoying every dialect trig soupå™Šn of it I gomerel you bookmarked to juxtapose turn up broken of the closet far-off pronounced screw up you recording

To address the SpamAssassin question: SA is made for e-mail. A lot of checks are only valid for e-mail, so even if you'd use it, you'd have to handpick your ruleset. You'll end up with a handful of rules the most powerful being the Bayesian classifier, which can be very easily implemented in any web programming language as well. Drupal even has a module for it.

Anyway, the ruleset is very nice, creating an update service for such rules (like sa-update) would be nice.