This assumes that $P(A) \not= 0$ and $P(B) \not= 0$. This is a simple statement of Bayes' Theorem. If we assume that $P(B)$ can be partitioned into a series mutually exclusive possibilities which sum to $P(B)$ then we can generated the extended form:

This may be easier to interpet using the concrete example of a Bayesian classifier, where $P(B)$ represents the probability that a specific word will occur in a message, and $P(B|A_i)$ represents the probability that the word will occur in a message of a specific category $A_i$, on the assumption that categories are complete (i.e. each message is always of exactly one of the predefined categories). It is therefore easy to see that summing all of the $P(B|A_i)$ will yield $P(B)$ since they are mutually exclusive and cover all the possible ways that $P(B)$ can occur.

Naive Bayesian Classifier

The process of determining the class of a piece of text involves splitting it up into tokens (words) and calculating the probability of each word occurring in each class of message. We assume $n$ classifications of messages, $C_1, C_2, ... C_n$, in the examples below and consider the effect of a word $W$.

The classifier is naive because it assumes the contribution of each token to the classification of the message is independent. Cases where tokens occurring together provide a much stronger indication than either token appearing individually may not be suitable for the naive approach.

Classification based on a word

Using the extended form of Bayes' Theorem, we can specify the probability that a message containing a particular word $W$ will be given a particular classification $C_i$:

This depends partly on the ratio of messages with particular classifications $P(C_i)$. However, some classifiers make the simplifying assumption that all classifications are initially equally likely, which yields:

This allows the probability of a given word classifying the message correctly in terms of the relative frequencies of that word in the different categories, which is easily acquired through suitable training.

Combining words

Of course, messages may be made up of many tokens and each has its own contribution to make to the overall classification. Combining the probabilities from the previous section requires more applications of Bayes' Theorem and some other techniques. For the sake of this illustration we assume there are only two classifications $C_1$ and $C_2$, and only two tokens $W_a$ and $W_b$ that have been found in the message. We shall generalise the approach after this simple example.

So, we are interested in determining the probability that the message can be classified as a particular classification, we shall take $C_1$ for this example. We already have $P(C_1|W_a)$ and $P(C_1|W_b)$ as derived in the previous section. As a starting point for the derivation, consider a simple application of Bayes' Theorem to calculate the probability of the message being classified as $C_1$ given that both tokens have been found:

Clearly this isn't suitable as each token's occurrence in messages is only recorded independently (because this is a naive classifier) so neither $P(W_a \cap W_b)$ nor $P(W_a \cap W_b|C_1)$ is available. We can simplify this based on the assumption that the tokens are conditionally independent based on knowing whether the message is spam. This allows us to use the equality:

This is an improvement, but now we need to remove the denominator term as well. We can do this via normalisation. This uses the fact that the message containing both tokens must be classified as either $C_1$ or $C_2$ — that is to say their probabilities must sum to 1. We can then restate these probabilities using Bayes' Theorem and then multiply through by the shared denominator to obtain an equivalent form for it: