qsf reads a single email on standard input, and by default outputs it on
standard output. If the email is determined to be spam, an additional header
("X-Spam: YES") will be added, and optionally the subject line can have
"[SPAM]" prepended to it.

qsf is intended to be used in a
procmail(1)
recipe, in a ruleset such as this:

Before
qsf can be used properly, it needs to be trained. A good way to train
qsf is to collect a copy of all your email into two folders - one for spam, and
one for non-spam. Once you have done this, you can use the training
function, like this:

qsf -aT spam-folder non-spam-folder

This will generate a database that can be used by
qsf to guess whether email received in the future is spam or not.
Note that this initial training run may take a long time, but you should
only need to do it once.

To mark a
single message as
spam, pipe it to
qsf with the
--mark-spam or -m ("mark as spam") option. This will update the database accordingly and
discard the email.

To mark a
single message as
non-spam, pipe it to
qsf with the
--mark-nonspam or -M ("mark as non-spam") option. Again, this will discard the email.

If a message has been mis-tagged, simply send it to
qsf as the opposite type, i.e. if it has been mistakenly tagged as spam, pipe it
into
qsf --mark-nonspam --weight=2 to add it to the non-spam side of the database with double the usual
weighting.

Use
FILE as the spam/non-spam database. The default is to use
/var/lib/qsfdb and, if that is not available or is read-only,
$HOME/.qsfdb. This option can also be useful if there is a system-wide database but you do
not want to use it - specifying your own here will override the default.

If you prefix the filename with a
TYPE, of the form
btree:$HOME/.qsfdb, then this will specify what kind of database
FILE is, such as
list, btree, gdbm, sqlite and so on. Check the output of
qsf -V to see which database backends are available. The default is to auto-detect
the type, or, if the file does not already exist, use
list. Note that
TYPE is not case-sensitive.

-g, --global [TYPE:]FILE

Use
FILE as the default global database, instead of
/var/lib/qsfdb. If you also specify a database with
-d, then this "global" database will be used in read-only mode in conjunction
with the read-write database specified with
-d. The
-g option can be used a second time to specify a third database, which will
also be used in read-only mode.
Again, the filename can optionally be prefixed with a
TYPE which specifies the database type.

-P, --plain-map FILE

Maintain a mapping of all database tokens to their non-hashed counterparts in
FILE, one token per line. This can be useful if you want to be able to list the
contents of your database at a later date, for instance to get a list of
email addresses in your allow-list. Note that using this option may slow
qsf down, and only entries written to the database while this option is active
will be stored in
FILE.

-s, --subject

Rewrite the Subject line of any email that turns out to be spam, adding
"[SPAM]" to the start of the line.

-S, --subject-marker SUBJECT

Instead of adding "[SPAM]", add
SUBJECT to the Subject line of any email that turns out to be spam. Implies
-s.

-H, --header-marker MARK

Instead of setting the X-Spam header to "YES", set it to
MARK if email turns out to be spam. This can be useful if your email client
can only search all headers for a string, rather than one particular header
(so searching for "YES" might match more than just the output of
qsf).

-n, --no-header

Do not add an X-Spam header to messages.

-r, --add-rating

Insert an additional header X-Spam-Rating which is a rating of the
"spamminess" of a message from 0 to 100; 90 and above are counted as spam,
anything under 90 is not considered spam.
If combined with
-t, then the rating (0-100) will be output, on its own, on standard output.

-A, --asterisk

Insert an additional header X-Spam-Level which will contain between 0 and 20
asterisks (*), depending on the spam rating.

-t, --test

Instead of passing the message out on standard output, output nothing,
and exit 0 if the message is not spam, or exit 1 if the message is spam.
If combined with
-r, then the spam rating will be output on standard output.

-a, --allowlist

Enable the allow-list. This causes the email addresses given in the
messages "From:" and "Return-Path:" headers to be checked against a list;
if either one matches, then the message is always treated as non-spam,
regardless of what the token database says. When specified with a retraining
flag,
-a -m (mark as spam) will remove that address from the allow-list as well as
marking the message as spam, and
-a -M (mark as non-spam) will add that address to the allow-list as well as
marking the message as non-spam. The idea is that you add all of your
friends to the allow-list, and then none of their messages ever get marked
as spam.

-y, --denylist

Enable the deny-list. This causes the email addresses given in the
messages "From:" and "Return-Path:" headers to be checked against a second
list; if either one matches, then theh message is always treated as spam.
Training works in the same way as with
-a, except that you must specify
-m or
-M twice to modify the deny-list instead of the allow-list, and with the
reverse syntax:
-y -m -m (mark as spam) will add that address to the deny-list, whereas
-y -M -M (mark as non-spam) will remove that address from the deny-list.
This double specification is so that the usual retraining process never
touches the deny-list; the deny-list should be carefully maintained
rather than automatically generated.

Normally you would not need to use the deny-list.

-L, --level, --threshold LEVEL

Change the spam scoring threshold level which must be reached before an
email is classified as spam. The default is 90.

-Q, --min-tokens NUM

Only give a score if more than
NUM tokens are found in the message - otherwise the message is assumed to be
non-spam, and it is not modified in any way. The default is 0. This option
might be useful if you find that very short messages are being frequently
miscategorised.

-e, --email, --email-only EMAIL

Query or update the allow-list entry for the email address
EMAIL. With no other options, this will simply output "YES" if
EMAIL is in the allow-list, or "NO" if it is not. With
-t, it will not output anything, but will exit 0 (success) if
EMAIL is in the allow-list, or 1 (failure) if it is not. With the
-m (mark-spam) option, any previous allow-list entry for
EMAIL will be removed. Finally, with the
-M (mark-nonspam) option,
EMAIL will be added to the allow-list if it is not already on it.

If
EMAIL is just the word
MSG on its own, then an email will be read from standard input, and the email
addresses given in the "From:" and "Return-Path:" headers will be used.

Using
-e automatically switches on
-a.

If you also specify
-y, then the deny-list will be operated on. Remember that
-m and
-M are reversed with the deny-list.

If you specify an email address of the form
@domain (nothing before the @), then the whole
domain will be allow or deny listed.

-v, --verbose

Add extra
X-QSF-Info headers to any filtered email, containing error messages and so on if
applicable. Specify
-v more than once to increase verbosity.

-T, --train SPAM NONSPAM [MAXROUNDS]

Train the database using the two mbox folders
SPAM and
NONSPAM, by testing each message in each folder and updating the database each time
a message is miscategorised. This is done several times, and may take a
while to run. Specify the
-a (allow-list) flag to add every sender in the
NONSPAM folder to your allow-list as a side-effect of the training process.
If
MAXROUNDS is specified, training will end after this number of rounds if the results
are still not good enough. The default is a maximum of 200 rounds.

-m, --mark-spam

Instead of passing the message out on standard output, mark its contents as
spam and update the database accordingly. If the allow-list
(-a) is enabled, the messages "From:" and "Return-Path:" addresses are removed
from the allow-list.
If the deny-list
(-y) is enabled and you specify
-m twice, the messages addresses are added to the deny-list instead.

-M, --mark-nonspam

Instead of passing the message out on standard output, mark its contents as
non-spam and update the database accordingly. If the allow-list
(-a) is enabled, the messages "From:" and "Return-Path:" addresses are added to
the allow-list (see the
-a option above).
If the deny-list
(-y) is enabled and you specify
-M twice, the messages addresses are removed from the deny-list instead.

-w, --weight WEIGHT

When marking as spam or non-spam, update the database with a weighting of
WEIGHT per token instead of the default of 1. Useful when correcting mistakes,
eg a message that has been mistakenly detected as spam should be marked as
non-spam using a weighting of 2, i.e. double the usual weighting, to
counteract the error.

-D, --dump [FILE]

Dump the contents of the database as a platform-independent text file,
suitable for archival, transfer to another machine, and so on. The data
is output on stdout or into the given
FILE.

-R, --restore [FILE]

Rebuild the database from scratch from the text file on stdin. If a
FILE is given, data is read from there instead of from stdin.

-O, --tokens

Instead of filtering, output a list of the tokens found in the message read
from standard input, along with the number of times each token was found.
This is only useful if you want to use
qsf as a general tokeniser for use with another filtering package.

-E, --merge OTHERDB

Merge the
OTHERDB database into the current database. This can be useful if you want to take
one users mailbox and merge it into the system-wide one, for instance (this
would be done by, as root, doing
qsf -d /var/lib/qsfdb -E /home/user/.qsfdb and then removing
/home/user/.qsfdb).

-B, --benchmark SPAM NONSPAM [MAXROUNDS]

Benchmark the training process using the two mbox folders
SPAM and
NONSPAM. A temporary database is created and trained using the first 75% of the
messages in each folder, and then the entire contents of each folder is
tested to see how many false positives and false negatives occur. Some
timing information is also displayed.

This can be used to decide which backend is best on your system. Use
-d to select a backend, eg
qsf -B spam nonspam -d GDBM - this will create a temporary database which is removed afterwards.

The exception to this is the MySQL backend, where a full database
specification must be given
(-d MySQL:database=db;host=localhost;...) and the database table given will not be wiped beforehand or dropped
afterwards.

As with
-T, if
MAXROUNDS is specified, training will never be done for more than this number of
rounds; the default is 200.

-h, --help

Print a usage message on standard output and exit successfully.

-V, --version

Print version information, including a list of available database backends,
on standard output and exit successfully.

The following options are only for use with the old binary tree database
backend or old databases that havent been upgraded to the new format that
came in with version 1.1.0.

-N, --no-autoprune

When marking as spam or nonspam, never automatically prune the database.
Usually the database is pruned after every 500 marks; if you would rather
--prune manually, use
-N to disable automatic pruning.

-p, --prune

Remove redundant entries from the database and clean it up a little. This
is automatically done after several calls to
--mark-spam or
--mark-nonspam, and during training with
--train if the training takes a large number of rounds, so it should rarely be
necessary to use
--prune manually unless you are using
-N / --no-autoprune.

-X, --prune-max NUM

When the database is being pruned, no more than
NUM entries will be considered for removal. This is to prevent CPU and memory
resources being taken over. The default is 100,000 but in some
circumstances (if you find that pruning takes too long) this option may be
used to reduce it to a more manageable number.

The default (system-wide) spam database. If you wish to install
qsf system-wide, this should be read-only to everyone; there should be one user
with write access who can update the spam database with
qsf --mark-spam and
qsf --mark-non-spam when necessary.

/var/lib/qsfdb2

A second, read-only, system-wide database. This can be useful when
installing
qsf system-wide and using third-party spam databases; the first global database
can be updated with system-specific changes, and this second database can be
periodically updated when the third-party spam database is updated.

$HOME/.qsfdb

The default spam database for per-user data. Users without write access to
the system-wide database will have their data written here, and the two
databases will be read together. The per-user database will be given a
weighting equivalent to 10 times the weighting of the global database.

Currently, you cannot use
qsf to check for spam while the database is being updated. This means that
while an update is in progress, all email is passed through as non-spam.

There is an upper size limit of 512Kb on incoming email; anything larger
than this is just passed through as non-spam, to avoid tying up machine
resources.

The plaintext token mapping maintained by
--plain-map will never shrink, only grow. It is intended for use by housekeeping and
user interface scripts that, for instance, the user can use to list all
email addresses on their allow-list. These scripts should take care of
weeding out entries for tokens that are no longer in the database. If you
have no such scripts, there is probably no point in using
--plain-map anyway.

Avoid using the deny-list
(-y) in any automated retraining, as it can be cause the filter to reject
mail unnecessarily. In general the deny-list is probably best left unused
unless explicitly required by your particular setup.

If both the allow-list and the deny-list are enabled, then email addresses
will first be checked against the deny-list, then the allow-list, then the
domain of the email address will be checked for matching "@domain" entries
in the deny-list and then in the allow-list.

To do the same, but cleverly, so that only email to
spambox@yourdomain.com which
qsf does NOT already classify as spam gets marked as spam in the database (this
stops the database getting too heavily weighted):

Remove the
-a option in the above examples if you dont want to use the allow-list.

A more complicated filtering example - this will only run
qsf on messages which dont have a subject line saying "your <something> is on
fire" and which dont have a sender address ending in "@foobar.com", meaning
that messages with that subject line OR that sender address will NEVER be
marked as spam, no matter what:

If you use
qmail(7),
then to get
procmail working with it you will need to put a line containing just
DEFAULT=./Maildir/ at the top of your
~/.procmailrc file, so that
procmail delivers to your Maildir folder instead of trying to deliver to
/var/spool/mail/$USER, and you will need to put this in your
~/.qmail file:

| preline procmail

This will cause all your mail to be delivered via
procmail instead of being delivered directly into your mail directory.

See the
qmail(7)
documentation for more about mail delivery with qmail.

If you use
postfix(1),
you can set up a system-wide mail filter by creating a user account for the
purpose of filtering mail, populating that accounts
.qsfdb, and then creating a shell script, to run as that user, which runs
qsf on stdin and passes stdout to
sendmail(8).

Doing this requires some knowledge of
postfix configuration and care needs to be taken to avoid mail loops.
One
qsf users full HOWTO is included in the
doc/ directory with this package.

A feature called the "allow-list" can be switched on by specifying the
--allowlist or -a option. This causes messages "From:" and "Return-Path:" addresses to be
checked against a list of people you have said to allow all messages from,
and if a messages "From:" or "Return-Path:" address is in the list, it is
never marked as spam. This means you can add all your friends to an
"allow-list" and
qsf will then never mis-file their messages - a quick way to do this is to use
-a with
-T (train); everyone in your non-spam folder who has sent you an email will be
added to the allow-list automatically during training.

You can manually add and remove addresses to and from the allow-list using
the
-e (email) option. For instance, to add
foo@bar.com to the allow-list, do this:

In general, you probably always want to enable the allow-list, so always
specify the
-a option when using
qsf. This will automatically maintain the allow-list based on what you classify
as spam or non-spam.

The only times you might want to turn it off are when people on your
allow-list are prone to getting viruses or if a virus is causing email to be
sent to you that is pretending to be from someone on your allow-list.

Because the database format is platform-specific, it is a good idea to
periodically dump the database to a text file using
qsf -D so that, if necessary, it can be transferred to another machine and restored
with
qsf -R later on.

Also note that since the actual contents of email messages are never stored
in the database (see
TECHNICAL DETAILS), you can safely share your
qsf database with friends - simply dump your database to a file, like this:

qsf -D > your-database-dump.txt

Once you have sent
your-database-dump.txt to another person, they can do this:

When a message is passed to
qsf, any attachments are decoded, all HTML elements are removed, and the message
text is then broken up into "tokens", where a "token" is a single word or
URL. Each token is hashed using the MD5 algorithm (see below for why), and
that hash is then used to look up each token in the
qsf database.

For full details of which parts of an email (headers, body, attachments,
etc) are used to calculate the spam rating, see the
TOKENISATION section below.

Within the database, each token has two numbers associated with it: the
number of times that token has been seen in spam, and the number of times it
has been seen in non-spam. These two numbers, along with the total number
of spam and non-spam messages seen, are then used to give a "spamminess"
value for that particular token. This "spamminess" value ranges from
"definitely not spammy" at one end of the scale, through "neutral" in the
middle, up to "definitely spammy" at the other end.

Once a "spamminess" value has been calculated for all of the tokens in the
message, a summary calculation is made to give an overall "is this spam?"
probability rating for the message. If the overall probability is 0.9 or
above, the message is flagged as spam.

In addition to the probability test is the "allow-list". If enabled (with
the
-a option), the whole probability check is skipped if the sender of the message
is listed in the allow-list, and the message is not marked as spam.

When training the database, a message is split up into tokens as described
above, and then the numbers in the database for each token are simply added
to: if you tell
qsf that a message is spam, it adds one to the "number of times seen in spam"
counter for each token, and if you tell it a message is not spam, it adds
one to the "number of times seen in non-spam" counter for each token. If
you specify a weight, with
-w, then the number you specify is added instead of one.

To stop the database growing uncontrollably, the database keeps track of
when a token was last used. Underused tokens are automatically removed from
the database. (The old method was to "prune" every 500 updates).

Finally, the reason MD5 hashes were used is privacy. If the actual tokens
from the messages, and the actual email addresses in the allow-list, were
stored, you could not share a single
qsf database between multiple users because bits of everyones messages would be
in the database - things like emailed passwords, keywords relating to
personal gossip, and so on. So a hash is stored instead. A hash is a
"one-way" function; it is easy to turn a token into a hash but very hard
(some might say impossible) to turn a hash back into the token that created
it. This means that you end up with a database with no personal information
in it.

When a message is broken up into tokens, various parts of the message are
treated in different ways.

First, all header fields are discarded, except for the important ones:
From,
Return-Path,
Sender,
To,
Reply-To, and
Subject.

Next, any MIME-encoded attachments are decoded. Any attachments whose MIME
type starts with "text/" (i.e. HTML and text) are tokenised, after having
any HTML tags stripped. Any non-textual attachments are replaced with their
MD5 hash (such that two identical attachments will have the same hash), and
that hash is then used as a token.

In addition to single-word tokens from textual message parts,
qsf adds doubled-up tokens so that word pairs get added to the database. This
makes the database a bit bigger (although the automatic pruning tends to
take care of that) but makes matching more exact.

As well as using the textual content of email to detect spam,
qsf also uses special filters which create "pseudo-tokens" based on various
rules. This means that specific patterns, not just individual words, can be
used to determine whether a message is spam or not.

For example, if a message contains lots of words with multiple consonants,
like "ashjkbnxcsdjh", then each time a word like that is seen the special
token ".GIBBERISH-CONSONANTS." is added to the list of tokens found in the
message. If it turns out that most messages with words that trigger this
filter rule are spam, then other messages with gibberish consonant strings
will be more likely to be flagged as spam.

Currently the special filters are:

GTUBE

Flags any message containing the string
XJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-EMAIL*C.34X as spam - useful for testing that your
qsf installation is working.

Adds a token for every attachment whose filename ends in ".gif", ".jpg" or
".jpeg", and ".png" respectively.

ATTACH-DOCATTACH-XLSATTACH-PDF

Adds a token for every attachment whose filename ends in ".doc", ".xls", or
".pdf" respectively (these tend to indicate a non-spam email).

SINGLE-IMAGE

Adds a token if the message contains exactly one attached image.

MULTIPLE-IMAGES

Adds a token if the message contains more than one attached image.

GIBBERISH-CONSONANTS

Adds a token for every word found that has multiple consonants in a row, as
described above. Spam often contains strings of gibberish.

GIBBERISH-VOWELS

Adds a token for every word found that has multiple vowels in a row, eg
"aeaiaiaeeio".

GIBBERISH-FROMCONS

Like
GIBBERISH-CONSONANTS, but only for the "From:" and "Return-Path:" addresses on their own.

GIBBERISH-FROMVOWL

Like
GIBBERISH-VOWELS, but only for the "From:" and "Return-Path:" addresses on their own.

GIBBERISH-BADSTART

Adds a token for every word that starts with a bad character such as %.

GIBBERISH-HYPHENS

Adds a token for every word with more than three hyphens or underscores in
it.

GIBBERISH-LONGWORDS

Adds a token for every word with over 30 characters in it (but less than 60).

HTML-COMMENTS-IN-WORDS

Adds a token for every HTML comment found in the middle of a word. Spam
often contains HTML inside words, like this: w<!--dsgfhsdgjgh-->ord

HTML-EXTERNAL-IMG

Adds a token for every HTML <img> (image) tag found that contains :// (i.e.
it refers to an external image).

HTML-FONT

Adds a token for every HTML <font> tag found.

HTML-IP-IN-URLS

Adds a token for every URL found containing an IP address.

HTML-INT-IN-URL

Adds a token for every URL found containing an integer in its hostname.

HTML-URLENCODED-URL

Adds a token for every URL found containing a % sign in its hostname.

Normally, filters will just cause a token to be added, and these tokens are
processed by the normal weighting algorithm. However the
GTUBE filter will immediately flag any matching message as spam, bypassing the
token matching.

The inbuilt "list" database backend will not necessarily provide the best
performance, but is provided because using it requires no external
libraries.

If, when
qsf was compiled, the correct libraries were available, then it will be possible
to use
qsf with alternative database backends. To find out which backends you have
available, run
qsf -V (capital V) and read the second line of output. To see how well a backend
performs, collect some spam and non-spam and use
qsf -d BACKEND -B SPAM NONSPAM (see the entry for
-B above).

Some people find that they get the best performance out of the
gdbm backend; this is a library that is widely available on many systems.

To efficiently share a
qsf database across multiple machines, you may find the MySQL backend useful.
However, using it is a little more complicated.

To use the MySQL backend you will need to create a table with the fields
key1, key2, token, value1, value2 and value3. The
token, value1, value2, and value3 fields must be
VARCHAR(64) ,
BIGINT or
INT , and
BIGINT or
INT
respectively, and indexing on the
token field is a good idea. The
key1 and key2 fields can be anything, but they must be present.

The
key1 and key2 fields allow you to have multiple
qsf databases in one table, by specifying different
key1 and key2 values on invocation.

Instead of specifying a database file with the
--database / -d option, you must specify either a specification string as described below,
or the name of a file containing such a string on its first line.

If you have problems with
qsf, please check the list below; if this does not help, go to the
qsf home page and investigate the mailing lists, or email the author.

Nothing is being marked as spam.

First, use the
-r option to switch on the
X-Spam-Rating header, and check that this header appears in email passed through
qsf. If it does not, then it is likely that
qsf is not being run at all - check your configuration of
procmail(1)
or its equivalent.

If you are seeing
X-Spam-Rating headers, and different emails have different scores, then you may simply
need to retrain your database a little more. Take more spam email and pass
it to
qsf -m.

If you are seeing
X-Spam-Rating headers but they all give the same spam rating, then the most likely reason
is that
qsf is not reading any database. Make sure that whatever is processing the email
has read permissions on
/var/lib/qsfdb and/or
~/.qsfdb - and make sure that, if you are using
~/.qsfdb, what your database creator thought was
~ ($HOME) is the same as it is for whatever is processing the email.

Retraining sometimes takes a very long time.

With the
obtree backend or 2-column MySQL or SQLite tables, every 500th retrain
(-m or -M), the database is pruned. On some systems this may take some time, and during
this time the database is locked (except when using the MySQL or SQLite backends).
If you constantly do a lot of retraining and want to avoid this, then use
the
-N option to suppress auto-pruning, and then have a
cron(8)
job or something run a manual prune
(qsf -p) every now and again.

Running qsf from procmail fails with an error.

If you can run
qsf from the command line, but in your
procmail log file you get errors about "qsf: cannot execute binary file", then
contact your system administrator for help. It may be that incoming email is
handled by a different server to the one you normally shell into, and either
they are of a different architecture or operating system, or the mail server
is not permitted to execute user-owned binaries.