java_mail_filter source code.zip (Size: 25.78 KB / Downloads: 52)
Abstract
The filter implemented is used to block spam also called unsolicited email. It uses statistical approach called Bayesian filtering to block the spam. First of all the program has to be trained using a set of spam and non-spam mails. These are put in a database. The performance increases with the number of training it gets. When a new mail comes it is tokenised and probability of each word is found by looking into the database. The total probability is found out and if it is greater than 0.9 it is marked as spam. With good training it can block 99% of the spam mails with 0 false positives.
Presented By:
Binu Ashiq Y1066
National Institute of Technology, Calicut Department of Computer Engineering
1 Introduction
Spam is a growing problem for email users, and many solutions have been proposed, from a postage fee for email to Turing tests to simply not accepting email from people you don't know. Spam filtering is one way to reduce the impact of the problem on the individual user (though it does nothing to reduce the effect of the network traffic generated by spam). In its simplest form, a spam filter is a mechanism for classifying a message as either spam or not spam.
There are many techniques for classifying a message. It can be examined for "spam-markers" such as common spam subjects, known spammer addresses, known mail forwarding machines, or simply common spam phrases. The header and/or the body can be examined for these markers. Another method is to classify all messages not from known addresses as spam. Another is to compare with messages that others have received, and find common spam messages. And another technique, probably the most popular at the moment, is to apply machine learning techniques in an email classifier.
2 Design
The filter uses a method called bayesian filtering. The project and implimentation is implemented in C language and uses a linux platform for its working. A database called SQLITE is used to store the training data.
2.1 Bayesian Filtering
In a nutshell, the approach is to tokenize a large corpus of spam and a large corpus of non-spam. Certain tokens will be common in spam messages and uncommon in non-spam messages, and certain other tokens will be common in non-spam messages and uncommon in spam messages. When a message is to be classified, we tokenize it and see whether the tokens are more like those of a spam message or those of a non-spam message. How we determine this similarity is what the math is all about. It isn't complicated, but it has a number of variations.
2.2 Theory of Operation
Probabilities in this algorithm are calculated using a degenerate case of Bayes' Rule. There are two simplifying assumptions: that the probabilities of features (i.e. words) are independent, and that we know nothing about the prior probability of an email being spam.
The first assumption is widespread in text classification. Algorithms that use it are called "naive Bayesian.'
If spammers get good enough at obscuring tokens for this to be a problem, we can respond by simply removing whitespace, periods, commas, etc. and using a dictionary to pick the words out of the resulting sequence. And of course finding words this way that weren't visible in the original text would in itself be evidence of spam.
Picking out the words won't be trivial. It will require more than just reconstructing word boundaries; spammers both add ("xHot nPorn cSite") and omit ("Prn") letters. Vision research may be useful here, since human vision is the limit that such tricks will approach.
3 Implementation
The user first trains the filter. The training data is stored in the database. Initially, the database is empty.
On spam detection, the user can choose to move spam to a Spam table in the database by using -g option. Initially for training, non-spam message are moved to a Ham table in the database by using -b option. Finally we get to a stage with one corpus of spam and one of non-spam mail.
To train the database do: ./a.out dbase.db -g *good.msg -b *bad.msg To classify do: ./a.out dbase.db message.msg
4 Conclusion
Once you have enough spam messages and non-spam messages correctly classified, you can think about using a Bayesian filter. You really want a few hundred of each type, preferably more. You also want to make sure there isn't an unintended identifying feature of the spam messages or non-spam messages. For example, don't use non-spam messages from the past 6 months and only the last month of spam messages; the learning algorithm might decide that messages with old dates are non-spam messages and messages with new dates are spam messages. Don't try to pad the numbers with duplicates; it will overtrain the filter on the features in those messages.
5 References
[1] Paul Graham. "A Plan for Spam." August 2002. paulgrahamspam.html.
[2] Steven Hauser. "Statistical Spam Filter Works for Me." sofbot.com.
[3] Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz. "A Bayesian ApÃ‚Â¬proach to Filtering Junk E-Mail." Proceedings of AAAI-98 Workshop on Learning for Text Categorization.

Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion