This is a brief snippet of my Mail/Junk folder. All of this was carefully hidden from me using the power of spamassassin combined with bogofilter.

Spamassassin is the king of pattern-matching spam filtering software. Bogofilter is one of many implementations of Paul Graham's Bayesian spam filtering detailed in A Plan for Spam. I chose bogofilter because someone has already conveniently packaged it and put it in Debian testing.

The reason why I have installed both sets of filters is simple. I was not blocking enough of my incoming spam using spamassassin alone, but I did not have a large (greater than a thousand) corpus of spam to train bogofilter on. However, I believe the results will be better than if I used either by itself.

In the snippet of my Mail/Junk folder above, messages beginning *****SPAM***** were tagged by spamassassin. The remaining messages were caught by bogofilter. To understand the difference between the two, let's examine the first message, 43, and the last message, 51, in more detail.

The first message is generic unsolicited bulk email. It's from someone I don't know about a service I don't care about. This is classic spam. Spamassassin is really very good about catching this sort of thing. It's from a known abusive relay and has red HTML. I'll probably never hear from this spammer again.

The last message is a spammer who has been sending me email for months. I don't want the damn "Game industry news". Somewhere along the line they got my email address. I can't even remember if I gave it to them or not. All I know is I don't want to look at these messages anymore. However, these messages look like solicited newsletters and dlists to spamassassin. It lets them through. However, I have accumulated enough of these particular messages to train bogofilter to spot them. As a result, bogofilter catches all of them, even though spamassassin thinks they're legit.

Conversely, when I receive email from a new spammer about a new service or product, bogofilter simply won't know what to do with them. If I had a really large training set, maybe I could get bogofilter to spot them based on the characteristic language. I don't, at the moment. Hence, the two-pronged approach.

Currently, I let spamassassin process all my email, then bogofilter. I then decide whether to filter it based on the bogofilter score. I am experimenting with the spamassassin score. I have been using it as well to filter email, but I am going to start relying completely on the bogofilter score. Theoretically, bogofilter should start to recognize spamassassin's output as a clear indication of spam and send it on its merry way to Mail/Junk.

I haven't had a test case yet, but I am hoping this approach will result in fewer false positives. That is, bogofilter may one day overrule spamassassin if it determines that it really is a real email. If I do get a false positive or a false positive save, I'll make that known.

I am using procmail to do my mail filtering. Eventually, I hope to eliminate the middleman and use bogofilter and spamassassin in my MTA (exim) at SMTP time, so I never even accept spam and my email address get taken off lists. I'm waiting on a Debian exim4 package for this. In the interim, I use the following procmail recipe, gleaned from spamassassin and bogofilter documentation:

Currently I still receive some spam, perhaps a couple per month. For instance, I received a perfectly legit and well-formed email from a website claiming to be a monthly update and containing information supposedly specific to me. That was a lie. It also contained language about making money by becoming an associate. This is the type of spam that a well-trained bogofilter ought to catch. Unfortunately, I'm still building up my database, and it slipped through.

Spamassassin is a perl module and can be run on any system that supports perl. Even better, the spamassassin website lists products that have integrated spamassassin support. The list includes a POP3 filter, IMAP filter, Outlook filter, and Eudora filter.

***

Update: I use Comcast's built-in spam filter as my main source of filtering. Bayesian filtering obviously caught on in a big way since this article was first posted. I use Evolution's built-in Junk Filter, but I get most of the junk mail on my Hiptop anyway so it's not terribly useful.

I have turned over my spam-fighting duties to GMail, and I am horribly unimpressed. I am marking the same damn "small cap stock" spams over and over, every day since the beginning of the year. It's pretty good at filtering the pr0n spam, though.

I've found the latest version of SpamAssasin to be quite robust in catching all sorts of crap. In conjunction with Greylisting I'm only recieving a spam every other day. Invariably this spam is sent to my alumni email address at CMU and forwarded to me. I regret registering for that stupid address since, to this very day, I have yet to recieve 1 legimate piece of email from that forwarding address.

Back to greylisting: the delay can be annoying. For instance I was writing drafts of my "statement of purpose" and people would send drafts back to me. Under pressure, the greylisting delay could have been problematic (since whitelists need to be renewed everyweek in my lameass greylisting implementation), except I just had people send the mail to my gmail account instead. So the only time-critical emails I actually miss are unanticipated time-critical emails, and there aren't any.

I wish I had numbers to indicate the marginal rate of spam interception of implementing Greylisting on a machine that is already running Spam Assasin, but since I implemented Greylisting first, and then SpamAssasin, I couldn't tell you. Looking at it the other way around I can say that even with Greylisting, SpamAssasin is still a must because too much crap slips through.

In conjunction of the two software packages, the spam problem has been reduced to near 1998 levels, which I find entirely acceptable.

I am still receiving small cap stock spam in my inbox to the ratio of approximately 4:1 spams per legitimate email, at which point the ol' JonathanFilter starts to get a bit inaccurate. I may have to ditch GMail for a better spam blocker.