Just looking to see the balance b/w good and bad corpi. Given your corpus sizes, you would think -- if anything -- that the error would be the other way (i.e., more false positives). My wife's installation of PM at home is 3:1 in favor of junk words (3,000 good words; 9,00 bad words). Here results are even better than mine -- 99.82%.

Another thought: since your results are so mediocre anyway, if you have a large body of spam in your junk or trash folder, why not run all of that throught the junk training button -- even if some of it's already been through, then reset your stats and see if that makes any difference.

You could also backup your two corpus files, just in case your results get even worse. Then you could always go back to your 90% corpi files.

Until 10 days ago I was a bit disapointed about Poco BF, I couldnt get more than 91/92% accuracy, before I was using K9 with a 99%, as I receive nearly 500 spam messages a day, the 8% difference still meant a lot of messages coming thru.

So, I decided to make some experiments and now I am at 98.7% (I can live with that) but it keeps improving.

My conclusion is that accuracy goes down when the BF is "overtrained".

This is what I did to be now at 98.7% with Poco BF, it may be a little tedious during the first days, but it really worked for me.

Good mail bias set to 1
Junk mail score set to 20
Custom sensitivity set to the lowest to avoid the filter to move automatically the messages to the junk folder.

I trained the BF with very few messages (spam and not spam) just to get the 1000 words needed for the BF to work.

I went back to my K9-Poco configuration that gave me 99%

I created a folder named Junk K9
I created a folder named Known

I made my download filters work this way:

1) Run Junk mail filters
2) If junk score is 20 mark the message with a colour
3) If the from header is in the address book or my exceptsenders file move the message to the known folder
4) If junk score is 20 move the message to the junk folder
5) if the message is marked as spam by K9 move to the Junk K9 folder

So the messages end this way:

In the known folder the false positives are coloured and easy identificable.

In the Junk K9 are the spam not recogized by Poco BF

In the IN are the spam not recognized either by K9 or Poco

Now the hard job:

Once a day I go to Junk K9 folder (I had near 300 messages the first time)

1) I classify only the first message as spam moving it to the Poco junk folder
2) Then I select all the messages and use the option "Tag junk messages"
3) I then delete all the tagged messages
4) Back to 1) until ther is no spam message left.

I do the same with the false positives after that.

Doing this you get the BF trained with the exact number of words.

It requires some time at the begining but I have really got great results.

Now I'm again with the Poco BF only (without K9)

I hope this help and sorry for any grammar mistake as english is not my native language.

Neo

Last edited by neo on Fri Sep 24, 2004 7:22 am, edited 1 time in total.

I'm beginning to suspect more and more that the problem is that PM needs A LOT of training before it works. Neo receives 500 per day (!) and SFCurley wrote (I think) that he trained the BF with batches of spam messages.

On the other hand, I (and perhaps the average user?) only train the BF one message at a time.

Well guys, I'm still training the BF and I'm sitting at 83.77 % with 14,105 junk words and 23,691 good words in the corpi. Three times as many junk messages have been missed than caught!

And I've been scrupulous with training it properly.

I getting very disappointed again. I can almost get these results randomly.

Sounds like everyone invests an awful lot of time trying to train the BFs, and the results are so varied as to be ridiculous.

I really don't think much time or effort is put in to the filters by PSI. You all are getting reasonable results - not great but reasonable - yet most posts I read don't show nearly as high a success rate.

As I pointed out on another thread, we will look into this. Although, we think our filters are very effective, the level of activity on the Junk Mail Filtering forum implies the need for additional work on our part. So J-Mac, hold tight!