An Introduction to the Spambayes Project

A trainable system that works with your current e-mail system to catch and filter junk mail.

The Spambayes Project is one of many
projects inspired by Paul Graham's “A Plan for Spam”
(www.paulgraham.com/spam.html).
This famous article talks about using a statistical technique
called Bayesian Analysis to identify whether an e-mail message is
spam. For the full story of how the mathematics behind Spambayes
works and how it has evolved, see Gary Robinson's accompanying
article on page 58.

In a nutshell, the system is trained by a set of known spam
messages and set of known non-spam, or “ham”, messages. It breaks
the messages into tokens (words, loosely speaking) and gives each
token a score according to how frequently it appears in each type
of message. These scores are stored in a database. A new message is
tokenized and the tokens are compared with those in the score
database in order to classify the message. The tokens together give
an overall score—a probability that the message is spam.

The fact that you train Spambayes by using your own messages
is one of its strengths. It learns about the kinds of messages,
both ham and spam, that you receive. Other spam-filtering tools
that use blacklists, generic spam-identification rules or databases
of known spams don't have this advantage.

The Spambayes software classifies e-mail by adding an
X-Spambayes-Classification header to each message. This header has
a value of spam, ham or unsure. You then use your existing e-mail
software to filter based on the value of that header. We use a
scale of spamminess going from 0 (ham) to 1 (spam). By default,
< 0.2 means ham and > 0.9 means spam. Any e-mail between
those figures is marked as unsure. You can tune these thresholds
yourself; see below for information on how to configure the
software.

Why Spambayes Is Different

Spambayes is different from other spam classifiers in three
ways: its test-based design philosophy, its tokenizer and its
classifier.

We can all think of obvious ways to identify spam: it has
SHOUTING subject lines; it tells you how to Make Money Fast!!!; it
purports to be from the vice president of Nigeria or his wife. It's
tempting to tune any spam-classification software according to
obvious rules. For instance, it should obviously be case-sensitive,
because FREE is a much better spam clue than free. But the
Spambayes team refused from the outset to take anything at face
value. One of the earliest components of the software was a solid
testing framework, which would compare new ideas against the
previous version. Any idea that didn't improve the results was
ditched. The results were often surprising; for instance, case
sensitivity made no significant difference. This
prove-it-or-lose-it approach has helped develop an incredibly
accurate system, with little wasted effort.

The tokenizer does the job of splitting messages into tokens.
It has evolved from simple split-on-whitespace into something that
knows about the structure of messages, for instance, tagging words
in the Subject line so that they are separately identified from
words in the body. It also knows about their content, for instance,
tokenizing embedded URLs differently from plain text. All the
special rules in the tokenizer have been rigorously tested and
proven to improve accuracy. This includes deliberately hiding
certain tokens—for example, we strip HTML decorations and ignore
most headers by default. Surprising decisions, but they're backed
up by testing.

The classifier is the statistical core of Spambayes, the
number cruncher. This has evolved a great deal since its beginnings
in Paul Graham's article, again through test-based development.
Gary's article, “A Statistical Approach to the Spam Problem”
(page 58), covers the classifier in detail.

Requirements and Installation

The Spambayes software is available for download from
sf.net/projects/spambayes.
It requires Python 2.2 or above and version 2.4.3 or above of the
Python e-mail package. If you're running Python 2.2.2 or above, you
should already have this. If not, you can download it from
mimelib.sf.net and
install it: unpack the archive, cd to the
email-2.4.3 directory and type setup.py install.
This will install it in your Python site-packages directory. You'll
also need to move aside the standard e-mail library; go to your
Python Lib directory, and rename the file email as
email_old.

Keeping up to Date

Because the project is in constant development, things are
sure to change between my writing this article and the magazine
hitting the newsstand. I'll publish a summary of any major changes
on an Update page at
www.entrian.com/spambayes.

Some of the things we're working on as I write this article
include more flexible command-line training; enabling integration
with more e-mail clients, such as Mutt; web-based configuration;
security features for the web interface; and easier installation.
I'll provide full details of these items on the Update page.