This page discusses the application of Bayes Theorem as a simple classifier for text and outlines the mathematical basis and the algorithmic approach.

This page discusses the application of Bayes Theorem as a simple classifier for text and outlines the mathematical basis and the algorithmic approach.

-

The information in this page is heavily cribbed from the Wikipedia articles on [[wikipedia>​Bayesian spam filtering]],​ [[wikipedia>​naive Bayes classifier]] and [[wikipedia>​Bayes'​ Theorem]].

+

The information in this page is heavily cribbed from the Wikipedia articles on [[wikipedia>​Bayesian spam filtering]],​ [[wikipedia>​naive Bayes classifier]] and [[wikipedia>​Bayes'​ Theorem]]. There'​s also a [[http://​cs.wellesley.edu/​~anderson/​writing/​naive-bayes.pdf|useful paper on combining word probabilities]] which is worth a read, especially the final section which discusses an erroneous assumption that some implementations make.

===== Bayes' Theorem =====

===== Bayes' Theorem =====

Line 90:

Line 90:

Please forgive the slightly loose use of notation, there are a few too many dimensions over which to iterate for clarity.

Please forgive the slightly loose use of notation, there are a few too many dimensions over which to iterate for clarity.

+

+

One slight simplification to note results from the fact that $P(C_i)$ is presumably determined by dividing a number of trained messages by the total number of messages trained. Let $N_{C_i}$ indicate the number of messages trained in category $C_i$, $N$ indicate the number of messages trained overall and $N_{C_i}(W_a)$ indicate the number of messages containing token $W_a$ that were trained in category $C_i$. Thus the equation above becomes:

Where $x$ is the total number of words. This version may help avoid underflow, but may instead be susceptible to overflow due to the exponentiation involved. As a result, it may be preferable to move the divisions back inside the iterations.

+

+

==== Two-category case ====

+

+

A common case is that there are two categories --- for example, this is the case for email spam detection. In this case it can be tempting to simplify the above equation using the fact that $P(C_1) = (1 - P(C_2))$. However, this is not as effective as it seems as you would also need to assume that $P(W_i|C_1) = (1 - P(W_i|C_2))$ to achieve any significant simplifcation. However, this is clearly not the case --- just because the word "​drugs"​ occurs in 20% of spam email, for example, it doesn'​t follow that it occurs in 80% of non-spam.

+

+

==== Precision issues ====

+

+

Since many of the probabilities for particular words may be quite low once a large corpus of messages has been analysed, the product of large numbers of them can lead to underflow if floating point representations are used. One solution to this is to limit the analysis to a small number of "most interesting"​ words - this has other performance improvements as well.

+

+

Another technique which can also be used is to perform the multiplications in the log space and use addition instead of multiplication. This uses the identity: