Version 3.0

29 May 200710:27 AM

3 Comments

Email spam is bad, but blog spam is even worse. When a junk message arrives in your inbox, only you see it; when a junk message gets posted to your blog, anyone who visits can see it. Spammers use automated programs to post comments on blogs, partly to attract people to their sites, but mostly to improve their rankings in search engines like Google. Comment spam was so frustrating, and stopping it so elusive that I turned off comments for a year. When a new version of MovableType came out that offered better spam protection, I turned comments back on. The filtering worked, and it was good.

Nothing good lasts forever. Starting a few months ago, bunches of comments started to get through the filters. Spammers are like viruses: they adapt. Before long, I had comments all over my site that I was embarrassed to read, much less have other people reading. To give you a sense of the volume that we’re talking about, in the last three years I have received 191 legitimate comments. In the last seven days I have received 1,141 spam comments. Here’s what this last week looks like:

And if anything, the spammers are speeding up. In three years, I’ve received 17,431 spam comments, which is about 100 a week on average. In the last seven days there were over a thousand.

What to do? First, I resolved not to turn off comments again. I have slowly but surely begun attracting feedback; I don’t want to squash what little progress I have made by eliminating feedback completely. There are services that make readers sign up for an account to comment, but the associated overhead for my readers is enough to discourage activity. My default spam filters were catching 97% of junk comments, which is pretty good, but the extra 3% meant that five or six comments showed up on my site each day, most of which were X-rated. I don’t have daily access to the internet, so often comments accumulated and remained on the site for days. While 97% may be an A in school, it’s not good enough in this case. I thought I could add some observations to get my spam filters back on-track for near 100% accuracy. I went wading in the spam pool to see what I could find in the comments that were slipping through. Here are some samples.

I noticed one common thread was some compliment, e.g. “awesome site!!”. Spammers, you flatter me. I take pride in my site, but it’s a little minnow in the giant ocean of the Internet. Awesome? You have me confused with FlashEarth. Cool? Maybe you meant Google’s GapMinder. Perfect? I can’t even comprehend the metaphysical implications of a perfect website. The truth is that I am normal, and I added a spam filter to take advantage of being average. Now every comments that contains “perfect,” “nice,” “good,” “great,” “cool,” or “awesome” followed by “site” goes right into the trash. Amazingly enough, that rule covered 252 of the 1,141 in the last week. These spammers praise early and praise often. No more; now the praise falls on deaf ears. I also blacklisted a handful of recurring words, some of which I don’t care to write here, and others like “slot machine,” “credit card,” “ambien,” “cialis,” and “hardcore.”

This was a step in the right direction, but junk comments kept trickling in even with my additions. Spam protection is an ongoing battle: once you’ve made some development to counter spammers’ tactics, spammers change them, and the race starts anew. I doubt that any spammer will read my blog and stop leaving “Great site!” comments, but certainly someday the content will change. Plus it’s a waste of my time to have to read through junk comments, trying to identify the newest common characteristics. This is ultimately why spam filters and blacklists aren’t the ideal approach. Paul Graham, a well-known programmer, explains the reasons in his seminal essay on spam protection and points to the new direction, which is spam protection based on Bayesian filtering. This means that automated systems have adaptive algorithms that continually re-evaluate what message characteristics correlate with being spam. Once a lot of people flag messages with the phrase “buy cialis” and many outgoing links, the system will learn that messages like these are likely to be spam. I signed up with Akismet, a free service that evaluates all my comments using this system and marks them as legit or spam. Since then, I haven’t received a single junk comment. As far as I can tell, no real comments have been accidentally marked as junk either.

Do your worst, spammers. I’m ready. And to those real human beings who want to leave a comment like, “Totally awesome site!!! By the way, I have cheap phentermine that I’m unloading at bargain-basement prices….,” my apologies. Chances are your message will end up in my circular file cabinet.

Totally radical sight, man! Want to buy some vie-agra? Seriously, I know the problem. I’ve had an unthinkable number of spam comments at LancesWorld, as well, to the point where I’ve turned off auto-posting of comments and approve them manually. I’m not quite at the thousands of spam comments per week mark, so the manual method is still manageable.

Keep up the good work!

By the way, a coworker of mine fell victim to a scam similar to the one Michael mentioned. He received an e-mail from “Bank of America” saying that his account had to be verified. Interestingly, my coworker said to himself, “I didn’t know they had this e-mail address,” and then proceeded to fill in all of the info anyway. Within 5 minutes nearly $10,000 had been charged to his account. Thankfully he caught it after I pointed out the complete idiocy of his ways. Some people are so gullible.

Note from Ryan: This comment was flagged as spam by Akismet. I dug it out of the trash after Lance wrote me:

Looks like your spam filter is pretty sophisticated… I tried to trick it with some misspellings and it tells me that my comment is “being held for review.”

Entry Navigation

About

Ryan does stuff and writes about it here. He studied philosophy at Notre Dame and worked afterwards with Holy Cross Associates in Chile. Some of his interests are technology, design, and photography. He was a student at UC Berkeley’s School of Information and currently works as an engineer at Slack.

Colophon

This site's valid markup was created using TextMate and the invaluable CSSEdit. Pages get served up after MovableType fills in the blanks. Various PHP scripts hold the ship together.