Training Spam Filters

10 Apr 2004

One of the problems with switching to a new spam filter is, quite simply, training it. Most spam filters these days are Bayesian-based, which means it does statistical analysis on the words present in your incoming email to decide whether it is spam (junk email) or ham (real email). The downside to these filters is, you have to train them what is spam and what is not.

Earlier this week, after receiving an avalanche of spam, and seeing that bmf hadn’t been updated in quite some time, I decided I needed to change from bmf to dspam. After getting dspam installed, I had to feed it some “corpus,” i.e. messages that were essentially known quantities. Being the email packrat I am, I had plenty of emails, both good and bad, to feed it. Problem is, it’s all in maildir format and the dspam_corpus program expects mbox format. So it meant making copies of several maildir folders, putting it through maildir2mbox, putting that through dspam_corpus with the right options so it knows whether or not I’m feeding it ham or spam, and sitting back and waiting.

Another “tool” I have in identifying spam is automatically redirecting certain email accounts into a “spam honeypot,” so to speak, i.e. marking any incoming emails to those address as spam because I either have never legitimately used the address for any reason or I no longer have a legitimate use for that address. I’ve actually got several of these addresses given that my FireWall-1 Gurus Mailing List used to operate under a different name and that I have quite a number of accounts in various locations. Most of them are ‘[email protected]’ variety. Guess what, even if I’ve never used the address anywhere, it’s a common enough “username” that many spamhouses will simply “randomly generate” possible username at various domains. Sometimes they’re right, sometimes they’re not. Guess what, it doesn’t cost them anything to try it, so why not?

With proper feeding of the spam filters, I hope to get my spam problem under control. Today alone, I probably had 300 messages not correctly classified as spam today. That’s way more than I usually get. Hopefully this is just a side effect of having a new, not yet fully trained spam filter, and not yet another escalation in the spam war. Quite frankly, I don’t know how much longer I can handle the escalation of the spam war…