Bayesian Filtering with bogofilter and Sylpheed Claws

In August 2002, Paul Graham published a paper suggesting that Bayes'
probability theorem (see Resources) applied to
the spam emails we receive. The gist of Graham's paper is that each word
you receive in your emails -- including those that make up the email
header -- carry a spam value of 0 to 1. This number is calculated by
studying a large number of emails that are known to be spam versus another
set of emails that are known to be legitimate. If a particular word only
appears in spam emails, there is a high probability that the next time you
see this word in an email message, it will be part of a spam
message. Similarly if a word, such as your secret nickname that only a few
people know or the From: address of a coworker, tends to appear only in
good emails, that word will have a higher probability of being present in
a non-spam email message. Of course, we should score all of the words in a
message and get an average "spam probability value" for the whole message
so that an email from a friend trying to let you know about "a great
business opportunity !!" does not go into your trash bin or a spam email
about "how to copy your DVDs" don't go into your good email folder just
because it addressed you by your first name.

What makes Bayesian filtering special is that false positives --
legitimate emails marked as spam -- are very rare. As Graham points out,
spammers can fool every system we put in place, but they still have to
deliver their commercial message. This message is exactly what causes them
to shoot themselves in the foot. It is trivial to recognize spam email if
you take a quick look at the subject and the message body. This action can
be emulated very successfully using a Bayesian filter that learns on your
behalf, applying acquired knowledge to your future emails. If you notice
the filter is making a mistake, you can teach it to not do the same thing
again. After a very short while, the filter will be almost bullet proof.

Shortly after Graham's article, a number of people implemented spam
filters that use the Bayesian algorithms. For this article we will look at
bogofilter written by
Eric S. Raymond. We have chosen bogofilter
because of its speediness, which arises from its being written in C and
using BerkeleyDB as its storage
facility, as opposed to a plain text file. As long as we're picking
software based on speed, I decided it would only make sense to pick Sylpheed (of the "claws" variety) as our email client
to demonstrate bogofilter. (See my previous article about Sylpheed
and Sylpheed claws.)

Installing bogofilter

It's fairly simple to configure and install bogofilter. You can either
download the latest source package or find a package for your operating
system. The current latest version 0.9.0.5 is available as an RPM or
FreeBSD package. The Gentoo
distribution also has an ebuild for it in its portage package collection.

If you will be installing it from the source package, all you have to
do is download it in a temporary directory, decompress it and run
./configure && make then make install as root in
the uncompressed source directory. Coincidentally, these are the generic
instructions to configure, compile and install a source package on Unix
and Linux systems. If something goes wrong, I suggest asking for
assistance from somebody with adequate experience. Often everything will
go as planned and the installation procedure will create the program
binaries and put them in /usr/bin/. It will also create a
sample configuration file (with which you need not concern yourself) and
place it in the /etc directory.

By default, bogofilter keeps its data in two database files called
goodlist.db and spamlist.db. These files are
stored in a .bogofilter directory in the user's home
directory. You need not create the directory or the files explicitly since
they will be created by bogofilter while training it.

Training bogofilter

As mentioned above, bogofilter, like all other Bayesian filters, does
its magic based on the principles of probability. For this reason you need
a archive of spam and non-spam emails. The more emails you have gathered,
the finer tuned your filter will be. I normally just ignore spam emails
instead of deleting them, so for me it wasn't very difficult to find
hundreds of spam emails in my incoming email directory in Sylpheed. We
will create two mail directories in Sylpheed and call one of them
SPAM and the other NONSPAM. If you disinfect
your regular incoming email directory by removing each and every spam
message, you can do without a dedicated NONSPAM directory. If
you choose to do so, make sure you keep this incoming directory free of
spam in the future too.

Before starting to train your bogofilter, make sure there's at least
100 emails in each folder. This should be a nice quantity and variety. If
you don't have enough spam messages (if you delete them as you receive
them or if you don't receive any -- those were the days!) , you can
download a batch of spam messages from a Bayesian spam filtering web
site. I recommend against doing this since every individual receives a
different variety of spam messages and what looks like spam to somebody
else might actually be something you receive as good mail regularly.
(Many people confuse spam with emails they once asked to receive but don't
want anymore.) You will find that the spam accumulated over a few days
will be enough to tune your filter. Better yet, keep training it as you
go along. The result is a highly customized personal filter that will
allow bogofilter to think and act just like you would.

We will start with training bogofilter to recognize spam words. In
order to do this we will start a shell and go into the SPAM
directory. By default Sylpheed keeps its emails in a Mail
directory in the user's home directory. This directory contains all spam
messages, each in its own file. Sylpheed uses an identifying number for
each filename. The directory resembles:

We will need to feed the whole message text, header and body, into the
bogofilter command and mark them as spam by using the
-s option. Since the number of messages is irrelevant to the
Bayesian algorithm, we can run the command in one of two ways.

The following command feeds all spam messages into bogofilter at
once. The -v option increases the verbosity of the command
and prints out some useful information.

grog SPAM # cat * | bogofilter -s -v
# 93861 words, 3 messages

We can also invoke the bogofilter command one at a time and have bogofilter
process them individually as can be seen from the partial output below.

Whichever method you use, bogofilter will create the
.bogofilter directory as well as a spamlist.db
database file. Please do not access this or the goodlist.db
file directly as they are both in a binary format. Repeat the above steps
in the ~/Mail/NONSPAM directory to create the non-spam list
database. Since these are non-spam files, you will need to
substitute the -s option for the -n
option such that the command is now bogofilter -n
-v. If everything goes as planned, you will now have both the good
words list goodlist.db and the spam words list
spamlist.db. We're ready to filter out spam.

Marrying bogofilter to Sylpheed

If you run bogofilter manually on a bunch of text (i.e., an email
message), it will return either 0 or 1 depending on whether the email is
found to be good or spam. However, it would be inconvenient to run this
command manually for every email that we receive. Instead we will
configure Sylpheed to run the command on our behalf each time it receives
an email, before delivering the message to the appropriate
directory. Using Sylpheed-claws, this is done by selecting
Configuration from the menu and clicking on
Filtering. There are 3 fields to fill in. The first field is
the Condition. Here we execute bogofilter with the current
incoming email. Enter the following line:

execute "/usr/bin/bogofilter < %F"

The second field determines which action to take if the email is found to
be spam. I recommend leaving this at Move to move the spam email
to the SPAM folder. You could also Delete the email
or just mark it as spam and deliver as usual but I don't recommend either. If
you choose Move as the action, then you should also specify the
mail directory to which to move the messages. Using the Select...
button, choose the SPAM folder we created earlier. Finally,
activate the new filtering rule by clicking Register. Figure 1
shows what the filtering rule should look like.

Figure 1 -- the filtering configuration window

Keeping bogofilter Sharp

The configuration we have implemented so far will probably catch more
spam than you think it would. However, the key to success is keeping
bogofilter on its toes at all times. Keep training the filter to be able
to deal with new types of spam messages and be able to identify non-spam
messages for years to come. It would be really convenient to have a
"register as spam" button on all email clients. In the future they will
probably have this. For now, we have to emulate this functionality
ourselves. It's really pretty simple.

We will move spam messages that bogofilter misses to the
SPAM directory manually. After you do this, make bogofilter
process the message by running it with the -s filter again.
It will be too much work to do this manually, so we will create a cron job
that automates this process. This way we can keep moving spam messages to
the SPAM folder as we receive them (effectively scheduling
them to be marked as spam) and rely on the cron job to take care of the
rest for us. You might also want to copy a bunch of good emails into the
NONSPAM directory every once in a while since non-spam words
need to be up to date as well. Here's what a typical script to train
bogofilter everyday may look like: