We've been examining the debug output in one of the PMs, and while it does not look like the Bayesian filter is corrupted, it does appear that it has lost accuracy. The filter "learns as you go", but if users do not occasionally force the delivery of good emails that were blocked by mistake, the Bayesian filter could mis-learn and become less accurate.

I would recommend to reset the bayesian database so that you start with a fresh statistical database. To do so, please stop SpamFilter, delete or rename the \SpamFilter\corpus directory, then restart SpamFilter.

Thought you would say that...I did rename the corpus folder and keep the origional database just in case it was corrupt. The new DB is working fine now. How long before it kicks into action?? i.e. how many emails does it take before it starts working again?

Also, is there any way that I can force train the dayabase?? I have several hundred thousand spam and ham emails which I would like to use to train the bayesian filter in the same way that you can manually trian SpamAssassin.

By default, the Bayesian filter starts to block emails after receiving 5,000 spam and 5,000 good emails.the only way to "train" the database is by forcing the delivery of emails that were blocked by mistake. This teaches the filter what emails were valid. The opposite (teaching what emails were spam) is not currently possible. This is because after an email is delivered, depending on the email client used, it may be impossible for the end-user to have access to the original, unmodified source of the email (Microsoft Outlook for example *completely* alters the email's source). The only solution would be for SpamFilter to retain a copy of the original emails that were delivered, and that is something we've been looking at for a while, but still have not decided if it is a good idea or not.

We actually use SF in the tag and deliver mode, forwarding all emails to an internal server for futher processing. We therefore keep ALL emails, both ham and spam and therefore do indeed have access to the origional email. Knowing this, is that any way that I can retrain the corpus as we have several hundred thousand emails of each that has proved to be a good corpus for us. Wiping that out because the corpus got corrupt is going to be a HUGE pain for us.

If you do have the *original* emails, and know which is spam and which is not, you could in theory re-send all the emails back to SpamFilter (we'd suggest installing a separate copy of SpamFilter somewhere else, with a "fake" destination SMTP server so emails won't be delivered again, and starting with a blank bayesian database). You could then first send all the "good" emails, possibly whitelisting the sender's IP address so as to tell SpamFilter they are "clean". You could then send all the spam ones, this time blacklisting the IP to tell SpamFilter they are spam.We'll need to double-check to ensure that manually blacklisting and manually whitelisting IPs will still cause SpamFilter to "teach" the bayesian filter about these emails, as there's a chance that this manual intervention may cause emails to be skipped by the learning process.

We confirm that in both cases the emails will be passed thru the bayesian learning process. For the one where the sender's IP is blacklisted however, you will need to ensure that the emails will be quarantined. Only if they are quarantined will the emails be forwarded to the "learning" process. The quarantine database must then be enabled, and the option to "do not quarantine" for the blacklisted IP filters must not be checked. Please note that the quarantine database can be a different database than your "real" production data, if you are performing this on a separate instance of SpamFilter.

Can you please clarify the following statement for me..."Only if they are quarantined will the emails be forwarded to the "learning" process."

Our normal setup is that we do NOT use the SF quarantine database, instead we tag the emails and pass them on to another server for futher processing and use our own quarantine. In doing this, if I read your last statement correctly, have our emails never been learnt as spam??

My fault. I should have said "quarantined or delivered". Only in those cases does the bayesian filter actually see the contents of the email and will thus analyze them. "Tagging" does cause them to be delivered so yes, they will be going thru the learning process.

.... re-send all the emails back to SpamFilter (we'd suggest installing a separate copy of SpamFilter somewhere else...

The bayes is server specific. if you set up a second copy that that second server will have the correct bayes info but the primary one will not. Too bad there is no way to share bayes info amongst multiple SFE's.

Still think greylisting is the best primary defense even if it is slower amongst multiple SFE's. We'll just get a faster box ;-)

If you have two "live" servers running SpamFilter and receiving emails in real-time, for example a primary MX and a secondary MX server, WebGuyz is correct. It's not that they "won't work", but rather it has to do with statistics. Most legitimate mail servers will send emails to your primary MX server only. Spammers will send spam to both servers. This means that, statistically, the emails received by your primary MX server will be *very* different than the emails received by the secondary MX server. For this reason, copying the Bayesian statistical database (which is build by examining the types of emails received, and marking incoming emails by comparing them to the "average" emails being received) between those two servers will often result in completely incorrect results.

In your case, you are re-submitting emails that have been already received by your single, live server, and are allowing the Bayesian filter to re-process them. There are bound to be some inaccuracies, for example the fact that the the bayesian filter keeps track of the time the various words were received, while if you submit them all at once this timestamp will be inaccurate. However the timestamp is used to "age" old words that are no longer being received, and to eventually remove them from the database. We don't think this will cause huge inaccuracies if you submit them all at once rather than in the spam of serveral days... But again, please note that the process you are performing has never been done before, and that we did recommend to start from scratch...!

I will perform the retrain possably tomorrow and will let you know how it goes. If it does indeed not work, then as you say, I can always delete the corpus and star from scratch, but I'd rather have a go at this first....just in case it doesactually work.

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot delete your posts in this forumYou cannot edit your posts in this forumYou cannot create polls in this forumYou cannot vote in polls in this forum