Building a Spam Filter Using Machine Learning

Machine learning is everywhere. From self driving cars to face recognition on Facebook, it is machine learning behind the scenes that drives all of it. If you’ve ever used GMail or Yahoo Mail, you must have seen a folder named “Spam” where all unwanted mail goes in. Have you ever wondered how that works? That’s machine learning at work, too!

In this article, we’re going to develop a simple spam filter in node.js using a machine learning technique named “Naive Bayes”. The filter will be able to determine whether an email is spam by looking at its content.

The basics of machine learning

The word “machine learning” has a certain aura around it. Journalists and entrepreneurs talk about it as if something out of the world happened. In reality though, it is much simpler.

Machine learning is a field of computer science where computers can learn to do something, without the need to explicitly program them for the task. First, the algorithm is made to look at a certain set of data, in order to train it for the task. Then, we give the algorithm data it has never seen before, and perform the task on this data.

Thus, a machine learning algorithm can be thought to have two phases: “training” and “prediction”. For each of these phases, we use various mathematical methods.

There are a wide variety of machine learning algorithms. Depending upon how these algorithms “learn”, they can be categorized as:

In supervised learning, the algorithm is provided with data, along with the correct answer for it. So, if we were to develop an algorithm to predict house prices, and you gave the size of the land and the price to the algorithm, it would fall into this category.

In unsupervised learning, the algorithm is provided with data, but the answers are not provided to it. It is upon the algorithm to find structure in the data, and figure out things from there. They are commonly used in places such as market segment analysis. We don’t know what kind of market segments are there for your product — and the algorithm must figure it out.

Again, based on the type of output that a machine learning algorithm produces, we can categorize them into two types:

Classification: These algorithms produce outputs that categorize the data. For example, an algorithm which takes in medical information about a patient and produces a diagnosis that may be only one of “no cancer”, “lung cancer” or “colon cancer” would be of this type.

Regression: In regression, the output types are continuous valued. For example, consider the previous example of predicting house prices. The predicted price would depend on the size of the land. Unlike regression, we don’t have outputs that nicely categorize the data.

As you’ll see later in this article, we’ll train our filter using a collection of spam and non-spam(aka “ham”) emails. So, we’ll provide right answers to train the filter, and later in the prediction phase, its output for a given message would be either “spam” or “ham”. So, this filter is an example of a supervised classification algorithm.

Now that you know the types of machine learning algorithms, let us see what we would need to build the filter.

The training phase

As we’ve already said, our filter would analyze the content of an email to tell if it’s spam. Let us now take a deeper look at the problem.

If you take a look at the typical spam email, you’d find words such as “replica”, “loans” and “singles”. These words are hard to come across in ham, though. Similarly, words such as “presentation” and “manager” is typical in ham, but hard to find in spam.

During the training phase, we’d tell our program to look at a set of spam or ham emails. For every distinct word in the email, it would note down the probability with which the word occurs in either spam or ham. In this article, we’ll refer to these values as “spammicity” and “hammicity”. If we were to define spammicity in a more verbose way, we could say it was the probability of a word occurring, given that the mail is spam. We’ll represent this as P(W|S). Similarly, we could represent hammicity as P(W|H).

So, for example, you have 10000 emails in your training dataset, out of which 6000 mails are spam and the rest is ham. The word “replica” occurs in 2000 spam emails and 10 ham emails. So, P("replica"|S) = \frac{2000}{6000} = 0.333 and P("replica"|H) = \frac{10}{4000} = 0.0025.

You might be thinking as to why we need the hammicity value at all. If both your spam and ham emails contain the word “car” frequently, we’d have high spammicity and hammicity values. If we didn’t take the hammicity of the emails into consideration, then we’d classify every email containing “car” as spam.

Next, we’ll discuss how the prediction phase would work.

The prediction phase

We now know the spammicity and hammicity values for every word. We can also easily find out the probability of a message being spam or ham. Considering our previous example, the probability that of a message is spam is \frac{6000}{10000} = 0.6 and the probability of a message being ham is \frac{4000}{10000} = 0.4. We’ll refer to these probabilities as P(S) and P(H), respectively.

Now, let’s say, we’ve got a new email containing an arbitrary word W in it. Our job is to find out whether the message is spam, given that it contains W. This is an exact opposite situation of finding the spammicity, and can be represented as P(S|W).

Here’s where we can use Bayes Theorem, and calculate the probability as:

P(S|W) = \frac{P(W|S)P(S)}{P(W|S)P(S) + P(W|H)P(H)}

Continuing with the “replica” example, the probability of the message being spam would be:

So far, we’ve considered a single word. An email consists of multiple words, and we have to use Bayes theorem once again to find out the overall probablility of it being spam. This model assumes all words are equally likely to appear at any given position in the message. However, due to the grammar rules of a language, this never happens; so this model is less than perfect. However, it is good enough for our purposes, and many spam filters use this model due to its simplicity.

It may so happen that the filter may encounter a new word when classifying an unknown email. We’d first discard these words, because there’s no way we can make predictions about them. After this, assume there are n distinct words in the email, W_{1}, W_{2}, ..., W_{n}. Then, we find out p_{1} = P(S|W_{1}), p_{2} = P(S|W_{2}), ..., p_{n} = P(S|W_{n}). We can then find out the overall probability of the email being spam like so:

If p contains a sufficiently large value (say, > 0.5), we’ll assume the email to be spam.

A few practical considerations

Before we write the code, there are a few practical considerations to make so that our filter can work better.

In the formula for P(S|W), we’ve considered the values for P(S) and P(W). However, in real situations, P(S) can be high as 0.8, which would lead to misclassifications of ham emails as spam. Given a new email, we have no prior reason to suspect it may be spam, so we’ll use P(S) = P(W) = 0.5. In other words, we assume it is equally probable for the message to be either spam or ham. Thus, the formula for P(S|W) would get reduced to:

P(S|W) = \frac{P(W|S)}{P(W|S) + P(W|H)}

There’s another optimization we can make. In every email, you’d find some common words like “if”, “be”, “then” and so on. There’s no need to consider these words because they aren’t helpful to find out if the message is spam.

Now, take a look at the formula for calculating the overall probability the email being spam. Probability values always lie between 0 and 1, and computers don’t handle multiplying such small values well. As a result, to avoid errors creeping into the value of p, we rewrite the formula like so:

Writing the classifier

Here, we’ve used the fs module, which helps us to read from files and write to them. The rest of the code discussed in this section would go into the classifier object. This object will be responsible for learning from the training examples and for classifying new emails.

Now, we need to define a data structure to hold information about previously seen emails. For this purpose, we define a dataset object like so:

Initially, we have seen exactly zero spam and ham emails, and we don’t know anything about the words in an email. This is why total_spam and total_ham are initialized to zero, and the spammicity and hammicity objects are empty. After training, the spammicity and hammicity objects will contain words and their corresponding values like the example below:

spammicity: {
"replica": 0.8831255,
"watches": 0.910244
}

We would also need to load the training data from the disk and save it when we’re done. In order to do this, we define two functions, load() and save(). These functions save this data as JSON into training.json.

Now, say, we’ve loaded the data of a spam email into a string. We have to ignore the common words in this string, and create a data structure consisting of all the distinct words inside the string. For this purpose, we’ve defined a function, createTable():

This function returns and object, which contains the list of words like so:

{
"hello": true,
"world": true
}

At this point, you might be thinking, why not use an array? Using an array would require checking to ensure there are no duplicate elements in it. Using the key-value association of Javascript objects is a much simpler and faster way of doing this, because duplicate keys cannot exist in an object.

The regular expression /\b([a-z]{2,}-)*[a-z]{3,}/gi fetches words with more than two letters and phrases (like “out-of-the-box”) from the string. Words with less than three letters are ignored because it would match words like “an”, “by” and “of” which are very common, and aren’t helpful for our purpose.

Despite this initial level of filtering, words such as “the” and “for” would still get through. Thus, we’ve defined another function, isCommonWord(), which filters out common names and some common words.

This function is a bit complex, so let’s take an example. Suppose, our classifier has seen five emails. Three of them contain “replica” and two of them contain “loans”. Thus, we would have:

P("loans"|S) = \frac{2}{5} = 0.4P("replica"|S) = \frac{3}{5} = 0.6

Now, suppose we have an email which reads: “replica watches”. The value of total_spam and the denominators of all the fractions would be incremented by 1. Now:

The filter has never seen “watches” but it occurs in the new message. So, the numerator will be equal to one, because after the training, the filter would have seen it exactly once.

The filter has seen “loans”, but it does not occur in this message. So, the new numerator won’t increase; it will be equal to the previous numerator. However, since we didn’t store the old numerator in the dataset, we have to calculate it indirectly with total * old_sapmmicity.

The word “replica” occurs both in the new message and in the previously seen messages. Thus, we must add one to the old numerator.

We’ll get these new values, and you can manually verify that these are indeed correct:

The learnHam() function is similar, except for the fact that it works on hammicity values.

Making predictions

The predict() function creates a list of words with createTable(), and calculates P(S|W) for each word. It then combines them with the alternative formula for p we discussed above. We indicate that the message is spam by returning true when the p > 0.5.

When a spammicity or hammicity entry cannot be found for a given word, we assume it to be zero. This might seem strange, but it is the right thing to do. For example, if the word “replica” never occurs in ham messages, there would be no entry in the hammicity table. By assuming a hammicity of zero, we get P(S|"replica") = \frac{1}{1 + 0} = 1, which is the right answer.

If both of spammicity and hammicity values are zero, then, we’ll ignore the word as it has never been seen by the filter.

The driver program

This completes our classifier. Now, we need some code to read files and to call the functions defined in classifier. This code should be defined outside the classifier object and below it. The code is fairly self-explanatory, and we won’t describe it here.

The dataset contains a set of training files and labels (spam/ham) and a set of test files containing emails. These emails are contained in EML files, and we need to extract the bodies of these emails. In order to do this, you should pull in the mailparser module with:

$ npm install mailparser

Next, save the script below as eml2txt.js. This script takes in the name of a directory containing EML files, decodes them and saves them in another directory.

You’ll find two new directories, training and testing. These folders contain the decoded files. Now, we can train our filter by using the labels from the SPAMTrain.label file provided. Files with a label of 0 are spam, while those with a label of 1 are ham.

Testing the filter

Feel free to test this on a few messages. You’ll find that the filter gets it right most of the time. If we train it on a more comprehensive dataset, the results would improve. In addition, you can also play around with the threshold value of 0.5 in predict() and see if it improves things.

Improving the design

To keep things easy to understand, we’ve kept our spam filter simple. However, we could add a variety of bells and whistles to make it more accurate. For example, in an actual implementation, we could assume a higher value of P(S) if the email originates from an IP address known for sending out spam prolifically.

Again, our filter is not resilient to letter substitutions that a spammer could make. For example, the spammer could write “replica” as “rεplica”. Humans can still read it, but our implementation would have trouble detecting this as a word. However, even with a filter that can detect it as a word, “replica” might still be considered different than “rεplica”. An actual implementation could detect words with such mixed character sets and automatically assign them high spammicity values, even though the modified word has never been seen before.

There are many more ways a determined spammer could work their way around our filter. For example, they could write their spam messages like so:

The <div> containing the article about horses is hidden from the user, but it can be still seen by the filter. Unfortunately, due to the large amount of legitimate text, the value of p would be low, and the message would be marked as ham. A robust spam filter would probably have its own HTML and CSS parser, remove invisible regions from the text, and find out p for the remaining text.

Conclusion

This was a really long article. If you’ve made it this far: congratulations on building your first machine learning based spam filter! Let that sink in — you designed an algorithm and showed it examples of spam and ham messages. After a bit of training, the algorithm has learned how to distinguish between them!

It is no wonder thus that machine learning is making inroads everywhere. Performing tasks without the need for programming things explicitly is what makes machine learning so powerful.