Florian octo Forster's Homepage

octo's bayesian mail filter - obmf

I know there are a lot of spam filters out
there, and most of the bayesian ones will work
better than this one. Also, there are lots of
people out there, that get a lot more spam
than I do. But my interest in pattern matching
and recognition were reason enough for me to
try to code one myself.

Features

Pure perl
obmf is written in perl for rapid development, the
ability to run (almost?) everywhere perl does, great
string handling and personal preference. If you don't
like perl because you have seen a lot of bad examples,
let me assure you that I have taken care of the code
being readable, well documented and easy to understand.
The one and only perl-module obmf depends on is
"DBI" for database-connectivity, which is part
of almost every (desktop-)OS.. So the real feature is
probably obmf's ease to be customized. Anyone with some
basic knowledge of perl should be able to do with it
whatever (s)he wants it to do.
Though I have not tried it myself I'm pretty sure obmf
will need perl 5.6 or later.

Text-only
obmf ignores non-text parts of the mail, understands
multipart messages and saves each mail's message-id so a
mail is not examined twice.

Easy interface
Sample configurations for mutt and procmail are also
included. Anyone with other systems is welcome to send
the config for his/her favorite mail prgram.

Download

Just download one of the following links,
extract the file and read the
readme.

Usefull links

Interesting papers

"A
Plan For Spam" by Paul Graham. This paper
is, AFAIK, the start of it all. Paul describes how
he is trying to fight spam using a statistical
approach.

This
namelss paper on spam detection by Gary
Robinson describes how the algorithms can be
modified to match better. It contains lots of
links to third party pages which describe several
detail aspects of the mathematical formulae
used.

Better
Bayesian Filtering is the second essay about
this topic by Paul Graham. This aproach is far
more complex than the first method (see link
above) but might just work..

Other, similar programs

the
Controllable Regex Mutilator
"crm114" is a very interesting approach
using some sort of mutating regular expressions.
It can be used for other things as well; filtering
firewall log, for example.

the digramic
bayesian classifier "dbacl" can
classify mail (and other texts) in more than one
cathegory. Therefore it can be used as
semi-intelligent procmail alternative.

SpamProbe
is a very complete implementation of Paul's (see
above) idea with a rather long feature list.