On Thu, Oct 17, 2002 at 06:30:18AM +0200, Tollef Fog Heen wrote:
> * Duncan Findlay
>
> | Joey, if you (or someone else) could separate all mail coming into the
> | BTS into spam and non-spam (manually verified), we could use that to
> | create optimised scores for SA.
>
> If you want spam to feed to some automated scanner in SA, I can
> provide you with about 32k spam messages, approaching 500MB. Contact
> me off-list if you are interested.
The scores produced for SpamAssassin are determined based on a corpus
of spam and nonspam provided by volunteers. Although it may have a
slight technical bias, we try to include as much commercial non-spam
and legitimate mailing lists as possible.
If we were able to base scores solely on the kind of mail we recieve
for the BTS, it will be able to filter more effectively. Think about
it as optimising SpamAssassin for a specific type of mail.
I would estimate that customised (evolved) scores would cut the
false-negatives at least by half, and the false-positives even more.
The problem involves the creation of the corpuses, on which the scores
must be based. Any spam in a non-spam corpus (or vice versa) would
have a huge impact. The corpuses don't have to be _too_ large. The
corpuses used for the default scores are 33k spam, 170k non-spam, but
we'd probably get decent results with a total of about 20k (with the
split about equal to the split of mail recieved by the BTS)
--
Duncan Findlay