SpamAssasin Rarely Misses

Posted by J.D. FalkJanuary 28, 2010

by J.D. FalkDirector of Product Strategy, Receiver Services

SpamAssassin is, by any measure, the most popular open source spam filtering software. It has won numerous awards, and has been incorporated into many commercial filtering appliances. On Tuesday, the SpamAssassin developers announced version 3.3.0, their first major update since 2007.

SpamAssassin was born in 2001, when Justin Mason (who is still involved in the project) rewrote & updated an earlier open-source filtering script. At present it primarily consists of a set of message tests of varying complexity, each analyzing portions of the headers or body and adding to or subtracting from the resulting spam score.

In general, any message with a SpamAssassin score of 5 or more is considered to be spam. It’s very rare for a single test to contribute enough to the score to be definitively spam or definitively not spam; instead, the effects of multiple tests are cumulative.

A few examples of tests in this version:

MPART_ALT_DIFF detects when the HTML and plain text parts of a message are substantially different.

DRUG_ED_CAPS catches messages which shout out the names of popular erectile dysfunction drugs.

FSL_GEO_ABUSE looks for any geocities.com URL in the message; the site was finally closed after many years of being a favorite with spammers, so now any link to it is invalid.

FH_DATE_PAST_20XX, intended to detect when a message’s Date: header is too far into the future; spammers do that to make sure their messages show up at the top of your inbox, assuming they’ll be reverse-sorted by date. This wasn’t updated in time for 2010, which caused some concern for a few days but was fixed quickly.

The software is most commonly invoked by a process lying between the Message Transport Agent (MTA), which receives messages from other servers on the Internet, and the Message Delivery Agent (MDA), which places those messages into the appropriate mailbox file. Depending on how the system is configured, the message may be tagged by adding ***** SPAM ***** to the Subject: line, or by adding X-Spam: headers with details about which tests contributed to the score. Downstream processes in the MDA or the email client can use this information to place the message in an appropriate folder, or delete it outright.

Alternatively, some systems feed the message to SpamAssassin directly from the MTA during the initial SMTP transaction, which allows them to reject it with a 550 SMTP reply when the spam score is sufficiently high — usually 10 or more.

Mail system administrators can automatically download updated tests from the project, and have the ability to override any of those default settings. This allows the SpamAssassin developers to stay current in the face of ever-changing spamming techniques, and to remove or reduce the score of any tests which are inappropriately catching non-spam email. Administrators may choose to disable these automatic updates, but it’s unclear why they’d want to.

There are also a few new network tests which we’re particularly pleased with:

RCVD_IN_RP_SAFE detects messages sent from IP addresses on our Safe whitelist, and reduces the spam score by 2.

RCVD_IN_RP_CERTIFIED detects messages sent from IP addresses on our Certified whitelist, and reduces the spam score by 3. Every IP on Certified is also on Safe, so it’s actually reduced by 5.

RCVD_IN_RP_RNBL detects messages sent from IP addresses on our Reputation Network Blacklist. It only affects the score by 1.2-1.3 points at present, because messages sent by those IPs tend to also trigger lots of other tests.

RCVD_IN_RP_SAFE and RCVD_IN_RP_CERTIFIED replace old tests left over from the Bonded Sender and Habeas days, which was important because some members of the SpamAssassin community still believed that senders had to pay a bond to be on the Bonded Sender list, or that an X-Habeas: haiku header denoted approval by Habeas, neither of which has been true in many years.

We didn’t pay the Apache Foundation (which hosts & sponsors the SpamAssassin project) for these scores, or try to “sell” the developers on using it. We did talk about the products with them for quite a while: what the listing criteria is, our plans for the future, et cetera. Some of the developers & community members were friendly, others…not so much. In the end, it was SpamAssassin’s own testing process which convinced them to include these tests with these scores. The data spoke for itself, and they saw the value in it.

This is standard procedure for the SpamAssassin development team, with its deep roots in the open source community. Being open, anyone can participate in the discussions — which is both a blessing and a curse. Like any other debates about spam, conversations within the community occasionally get heated, and a few members are nearly ridiculous in their intractability. Yet when it comes to the product itself, the developers trust the data produced by their nightly testing framework. If the data shows a test is accurate and effective, they’ll include it. If not, they won’t — or it’ll be given a low score.

I could conclude this article by saying that we look forward to continuing our relationship with the SpamAssassin community, and that’s certainly true — but it’s not the whole story. I use SpamAssassin to protect my personal email, as do many others among the technical staff here at Return Path. We also use SpamAssassin to protect some of our corporate email systems; it’s that good. You’ll hear similar stories across the industry, and beyond. It is one of the few software packages to truly deserve to be called “ubiquitous.”