The basics of Spam detection

During my numerous years as a software engineer I have spent many an occasion developing solutions to combat Spam. This article introduces the origins of spam and then looks at a number of ways it can be detected.

It’s important to note that nothing I write here is new. A lot of this information can be found if you are prepared to do enough Google searching. The point of this article is that it coalesces these ideas into one article.

It should also be noted that whilst I have made every effort to ensure these details are correct it is inevitable that it will contain errors and/or omissions. If you do find something that doesn’t appear to be correct please let me know and I will update the article, accordingly.

So, with that out of the way let us begin our journey into the wonderful world of Spam detection…

Spam, spam spam spam…

What is Spam?

It would seem fitting to start with a basic definition of Spam.

The classical definition of Spam is Unsolicited Bulk email (UBE). This definition, by today’s standards, is no longer adequate and it can be more appropriately defined as “flooding the Internet with many copies of the same message, in an attempt to force the message on people who would not otherwise choose to receive it”. Pretty much anywhere on The Internet that allows for the propagation or creation of user generated content can and will be targeted by spammers.

Spam. It’s a problem. In fact, it’s a huge problem. There is probably not a single person on the planet who has access to The Internet who has not been affected by Spam in some way. Most (over 80%) of today’s Spam either originates from or is at least facilitated by organised Spam gangs and most of these either fund or are funded by organised crime.

Why Spam?

The point of Spam (normally) is to convince a victim to part with their hard earned cash. This can be directly, such as trying to get them to purchase something (normally worthless) or indirectly by tricking them into parting with information or signing up to something that has hidden costs they are not aware of at the time. More often than not spam is the virtual version of a con and spammers are the confidence tricksters.

Often, a spammer will try and convince you to purchase something. The kind of vile things spammers will happily pedal are: pharmaceuticals that are at best placebo and at worse poisonous, pornography (including kiddie porn) and various scams (such as 419 scams or stock).

Other times the point of the spam is not to sell but to trick you into giving away personal information (phishing) and/or installing a Trojan program. The malware can be anything from spyware, keyboard loggers or a stealth program that turns the victims computer into part of a Spam botnet so as to percolate even more spam.

It’s hard to understand the psyche of a spammer but one thing is very clear, to them Spam equals £££ (or $$$). Given that it costs next to nothing to propagate Spam and given The Internet is now so pervasive the ratio of cost to number of messages sent means that even if 0.1% of a spammers victims are taken in that makes it worth their while. Put simply, Spam is big business and if there is a way a spammer can exploit a system that allows them to get their message across they will.

The Spam War

In the “good ol’ days” email was a very insecure protocol. Mail Transfer Agents would happily forward on anything they were sent. The Internet was a young a trusting place. Then, the spammers arrived and they soon realised they could abuse this trust by churning out vast amounts of unsolicited email through the Open Relays. So began the downfall of email and the start of the war on spam!

For years the spammers abused the email system, literally to the point where it practically became useless for anything because it was so saturated with Spam. Over time the good guys learned to fight back. They closed the open relays and invented techniques such as Sender Policy Framework (SPF) to make life for the spammers hard.

The spammers fought back and started setting up their own relays that were dedicated to forwarding spam. Once again the good guys retaliated by setting up Real-time Black-hole List (RBLs). Of course, the spammers then fought back by spreading malware that turned victims computers into botnets that allowed them to propagate their vile wares indirectly. The Anti-Malware community countered by adding detection for these Trojans into their products.

The war email Spam is being won. The combination of improvements in anti-spam filter techniques, various techniques, such as reputation services, for blocking spam at the source and the pervasive introduction of Webmail mean we’re in a position where the amount of email Spam is finally declining.

Spammers have tried to get past these filters by using techniques such as image Spam and Tag Soup spam using malformed HTML. Detection techniques; however, have reached a point where spammers can rarely penetrate competent filters.

A New Frontier

And so here we are, 2013. The arms race has been won, right? The end of Spam is nigh, yes? No. Sadly, not even close! The recent boom of social networking services on The Internet has given the spammers a whole new vehicle to peddle their vile wares.

One of the reasons for the decline in email Spam is that email just isn’t as pervasive as it once was. Although most people do have an email account the amount of time they spend interacting with their account is generally far less than the amount of time they will spend on a social network.

For example, on an average day how long do you spend logged into your Facebook account compared to monitoring your (personal) email? Combine the decline in the usage of email with the boom of social networking and it doesn’t take much to realise the spammers now have a new agenda.

Nowadays, spammers are more often than not targeting social networking sites. The reasons should be obvious. They have a captive audience. Most sites allow users to interact with user developed content and apps.

Anyone can get an account. The whole point of a social networking site is to make as many connections as you can — something the spammers rely on. There is not a day that goes by where most Facebook users are not blighted with spam and often they don’t even realised it!

For example, consider the various applications that people sign up to on Facebook, often indiscriminately. A lot of these require you to enter personal information or click on links that take you to websites with advertising where you must sign up.

Although a lot of these are reputable a lot are rogue-ware taking advantage of social engineering techniques. They will fool unsuspecting victims into granting access permission to access personal information on their Facebook account so they can then target them and their friends with more direct spam by with sending private messages or posting on your or friends walls without their consent.

It’s not just social networking sites that are proliferated with spam. Spammers have also realised that any site that accepts user generated content is also fair game. For example, do you have a blog? Chances are that unless it is protected by an anti-spam service (such as Akismet) the user comments section will be riddle with spam. Often, these are disguised as legitimate feedback, praising

In short, sites that allow user generated content are a veritable breeding ground for Spam. Spammers can easily create an account (depending on where they sign up from it is almost trivial to create a new account) and once in they can post content pretty much anywhere on the site and send private messages to whoever they like.

Fighting Back

For the purposes of simplifying the rest of this document the term spam will be used to generically refer to unwanted user generated content that may be either for commercial gain, offensive material of just vandalism. Each of these have one common attribute; we can automatically detect and prevent to a high degree of statistical accuracy.

Welcome to the era of anti-spam. A dream world where spam is no more. But is this really just a dream? Probably but what about if we could eradicate 98% of spam? Would you at least want to consider the possibility? If you’ve just answered, “yes”, well done; that was the correct answer. Read on!

It’s time we took control. It’s time we fought back against these spammers (and vandals); the minority that ruin The Internet for the majority. It’s time we started implementing techniques to automatically control and manage the sites user generated content.

The rest of this document will present different spam filtering techniques and discuss how they could utilised to tackle the every growing problem of spam on The Internet. Each technique, individually, could help automatically reduce spam but if used together, in a blended approach, the detection rates should be incredibly accurate.

Statistical Filtering

Statistical filtering is based on the idea that certain words will appear more frequently in spam than in ham (non-spam) and vice versa. A statistical filter will analyse the word content and then using a database of previously generated metrics it will calculate the probability that the email can be classified as either ham or spam.

Spammers Weakness

Question: what is a spammers weakness? Answer: his message.

That’s right. A spammer can do many things to evade detection. They can obfuscate their message they can try by-passing reputation systems with botnets they can write bot programs that will spam a site but the one thing they cannot do is remove the content that is ultimately destined to be read by the end-user.

This gives us, the anti-spammer a significant advantage. We can turn this weakness against them. We can use the very content of the spam itself to our advantage.

Words Are Unique

The written word has a unique fingerprint. No two people write in the same way. Cyber-crime scientists have recently come up with a way of detecting who the author of an email is with an 80% to 90% certainty by processing just 10 examples of all candidates and using statistical analysis to figure out who the original author was.

Of course, to defeat spam we don’t need to go into that level of analysis. All we need is a way to be sure to a reasonable degree that a message is actually spam. The higher the degree of certainly the more emphatic we can be in the action we take; action that would be automatic and require no human intervention!

For example, let’s assume we have a black box application that can parse a message and give it a probability grade from 0 to 10 where 0 is absolutely not spam and 10 being absolutely spam. Anything that scores 5 or less we could assume is not spam. Anything scoring between 5 and 10 we could assume might be spam and take action (such as slow-tracking or blocking if the user sending the message is sending more than a couple in any period of time). If we get a score of 10 we know it must be spam (or, at least we can be sure to a high degree of probability) and so we can take decisive action like blocking the message and if the user tries again we can (temporarily at least) mute or even block their account automatically.

Fine. “Sounds wonderful”, I can hear you scream, “but where do we get such a black box?”. That’s simple. They already exist. They are called Bayesian Spam Filters and they use statistical filtering techniques that are well known in the anti-spam industry.

A Plan For Spam

Bayes

The idea of using statistics for filtering spam was first conceived by Paul Graham in his 2002 “essay” entitled A Plan For Spam. The techniques discussed were later improved upon and published in his 2003 “essay” called Better Bayesian Filtering. Since the publication of these articles many well know anti-spam products such as Spam Assassin and Spam Bayes have implemented variations of the techniques Paul discusses.

Since then many improvements have been made to the techniques but they all follow the basic principle that given a collection of examples of both good and bad email a classifier will be able to analyse a candidate email and estimate the probability that it is either good or bad. Since an email is nothing more than text (normally) and since statistical filtering will work with any text it follows that the same techniques could be used to detect (nearly) any spam.

Training

With a well trained database of both good and bad text a statistical filter is capable of incredible accuracy (in the order of 98% or greater). But, herein lays the problem. A Bayesian classifier needs to be trained. That’s right, it needs to have a database that is populated with enough representative examples of both good and bad to be able to make a classification.

This has the potential to take a lot of time an effort. On top of that a classifier isn’t a static entity that, once trained, will work forever detecting spam. On the contrary, spam changes over time. Spammers learn new detection evasion techniques so a good classifier will need to be re-trained on a regular or on-going basis.

Self-learning

On-going? That’s right… on-going. A classifier can learn from itself. Providing it has a reasonable database to start with it can then learn from its mistakes and get better. The basic premise is that if the classifier scores very high (say 90% probability) it will automatically add the message to its database of bad messages. Likewise, if the message scores very low (say 10% probability) it will automatically add the message to it’s database of good messages. Over time, the databases get more and more accurate but also if the format of good or bad messages changes over time the classifier will learn and keep up with these changes.

On top of this a regular check should be made on how the classifier is doing. Any false positives or false negatives should be used to train the classifier so that it learns from its mistakes. Unfortunately, this does require some manual intervention but providing it is something that is done on a regular basis it shouldn’t end up being too big of a chore and should be considered general good house-keeping.

Poisoning

Great. A solution to our training problems! Well, not quite. You see the very fact a classifier can learn can be used against it by an unscrupulous spammer using a technique called Bayesian Poisoning.

Put simply, the spammer will include a load of random words (known as word salad) or paragraphs from a novel (Shakespeare is a favourite) in their email. The hope is that there is enough non-spammy words to trick the classifier into making an incorrect classification.

In very simple terms, if a spammer sends spam that contains lots of (often random) non-spammy words — a trick often used is to quote random passages from novels — it might fool a statistical filter into classifying it as good and given a high enough score that email may get added to the good database thus increasing the likelihood of false negatives (marked as not spam when it isn’t) for future spam containing the same spammy words.

The other (possibly worse) scenario is that email is classified as bad and gets added to the bad database. This would then increase the chances of emails containing those words in the spam that were not really spammy actually contributing to the spam score of a non-spammy email thus creating a false positive (marked as spam when it isn’t). In the world of anti-spam this is the worse possible outcome as it means real (and possibly important) message get blocked.

In reality, this isn’t actually as bad as it sounds and to some extent should be considered a strawman argument against using statistical classification as it fails to take into consideration that fact that an email that contains poison is, in itself, actually a spammy trait. Consider; how many real emails contain word salad? Not many!

A good classifier will not only take on board all of the words in an email but it will also consider tokenising phrases (one of the reasons a good tokenising strategy is important) and preserving other traits such as word count, word order and even grammar and semantics. The very action of attempting to poison a classifier can work against the spammer since they have actually created a very unique statistical fingerprint for their spam that is easier to detect.

Heuristics

Sometimes, a message may score low on a statistical filter but may still be spammy. Whilst statistical filtering can be very accurate its Achilles’ heel is that it is a token based analysis classifier and if there are not enough tokens in the candidate message the classifier may be unable to make a determination. Since false positives are the worse case scenario for any spam detector (falsely identifying a message as spam when it is not) the normal thing to do is treat an unknown result as not being spam.

Spammers are not stupid (well, at least technically) and so they have a number of tricks up their sleeves to try and by-pass statistical classifiers. Here are some examples:

Content only contains an embedded image

Content has been purposefully obfuscated or malformed

Content contains only a (normally obfuscated) URL

Rules of engagement

So, what is heuristic filtering? Simply put, it is a way of detecting spam using a set of “rule of thumb” detection techniques. Put another way we are roughly saying, “if a message contains trait X, Y or Z or any combination of them there is a good probability it is spammy”. The more traits that match the higher the probability.

When other filtering/classification techniques fail we have the option to fall back on heuristic filtering. Unlike a statistical filter a heuristic filter doesn’t rely on any one specific approach to clarify content, it will use a collection of “fuzzy rules” to make a distinction.

When content is scanned by a heuristic filters it will apply each of the rules and those that “trigger” will contribute towards a final spamminess score. If that score crosses a certain threshold it can be classified as spam. Clearly, the higher the score is the more confident we can be about this classification.

Of course, statistical classification and heuristic (or any other type of) filtering are not mutually exclusive. We can aggregate the results from different classifiers and use that to draw a conclusion on how likely a message is to be spammy. The more techniques we use the more confidence we can have in the final classification. It’s really just like being a CSI, where we are analysing all the available evidence (damning or not) to try and establish if content is likely to be spam.

Heuristic in action

Let’s have a look as some of the more obvious techniques spammers might use to evade spam filters and how heuristics might be used to combat them.

Mark-up Obfuscation

This technique requires the message format to support some kind of mark-up language. Generally, that will be HTML in the case of email spam but for user generated content (such as on blogs or bulletin boards) it may be BBCode.

This kind of technique relies on a spam filter working on the raw mark-up code. The spammer will include loads of mark-up tags to separate letters that make up words. With HTML this is pretty simple; spammers will use “Faux-HTML”. These are artificial HTML tags that won’t be rendered by the message client but are invisible and break up words to obscure their meaning.

In the case of BBCode it’s not so straight forward for a spammer since unknown tags are generally rendered as part of the parsed output. It is; however, still perfectly viable for the spammer to include real BBCode tags providing they leave the rendered text human readable once it’s been parsed.

When rendered, the BBCode above says, “this is spam”. Further the text will be rendered as a link to [1]. As you can see, we’ve used BBCode to obfuscate the text that will, ultimately, be presented to the end user. Of course, this is a trivial example and it’s pretty obvious to the human eye what is going on here but it’s not so obvious to a filter.

In reality, this is a pretty simple problem for a non-heuristic filter to get round as long as it knows how to decode the mark-up to get at the human readable version, which can be processed by your filter of choice. Also, the mark-up itself is a very telling sign this is probably spam. So much so that if our self-learning statistical filter tokeniser included mark-up tags there is a very good chance that eventually those tags would start to score high, indicating probably spam.

From a heuristic point of view, mark-up obfuscation is a relatively simple thing to detect. Of course, the very fact that the message contains such a high ratio of tags to message text is also a very telling sign that this message is probably spam. So is the fact that the number of letters between each tag is so small, suggesting an attempt to obfuscate words.

Our heuristic scanner can simple generate a score based upon the number of tags vs. the number text characters. The higher the ratio the higher the probability of spam, especially if one or more of those tags is a “url” tag. To allow this filter to “self learn” if could keep track of the average ratio of tags to letters of good vs. bad and use these are a benchmark for when generating a probability score.

CAPITAL LETTERS

Spammers love to SHOUT about their wares. For this reason you will often find spam has a high ration of upper case to lower case letters. A statistical filter that considers case will probably end up scoring quite high for upper case letters. Unfortunately, by considering case in a statistical filter we dilute the value of the word semantics.

For example, SpAm and sPaM will not be comparative and so the spammer could use random combinations of upper and lower case letters to by-pass a statistical filter. On the other hand, if they do this enough a self learning statistical filter will come to realise that these different variations of words are likely to be spammy and more so than the same words that are just all one (lower) case. So, it’s swings and round-abouts; in the short term the statistical filter will probably be poor at detecting such spam but after a while its accuracy will improve and will actually be more accurate than ignoring case.

Heuristically, we can do a similar thing to we did with detecting Mark-up Obfuscation. If we examine the ratio of upper to lower case text in non-spam it’ll generally be much lower than that of spammy content. For this reason, we can assume that the higher the ration the more likelihood the message is spam. Further, we can track the average rations of good vs. bad to allow the heuristic scanner to be self learning.

Obfuscation

To try and avoid simple filtering system spammers will often transpose letters in words or even miss our the vowels. The human brain is amazing. It can read text even when it’s seriously obfuscated. Let’s look at an example:

Take a look at this paragraph. Can you read what it says? All the letters have been jumbled (mixed). Only the first and last letter of each word is in the right place:

“Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr theltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at therghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.”

Another technique is one that uses symbols to represent letters. For example, l33t is very popular on The Internet and most if not all people can read and understand it. Of course, it’s just one more way to obfuscate text in an attempt to make it hard to classify the content.

Ok, so no one is claiming spammers will send out text quite that obfuscated but they only have to change a few letters around or miss out a few vowels and a rules based classifier is likely to fail to detect certain key words.

As well as purposeful obfuscation, spam is often riddled with poor grammar and spelling. This is because most spam originates from countries where English may not be their first language. Spammers want to target the widest audience possible and whilst it would be wrong to assume all spam is in English it is probably not wrong to assume a majority of it will be.

How can we detect this then? Simple, we don’t. Eh? That’s right we don’t. At least we don’t do anything special. There is no need. We let the statistical filter learn from the grammar and spelling mistakes made by spammers so that it can use them to help classify the content.

URL redirection

A URL redirect is where a URL doesn’t point to the final content but, instead, directs you to a service (this includes URL shortening services) that will then redirect you to the content (or maybe even another service). Spammers love these because it means they can frequently change their URLs by changing between redirect services. In effect it is the URL equivalent of Money Laundering. The redirection “cleanses” the URL — at least, that’s what the spammer hope!

One way to handle this is to perform a HTTP header lookup in real time and try to resolve the redirect chain. Unfortunately, this is quite an expensive thing to do and for a real-time detection mechanism it’s certainly not feasible to do this (at least, not in real time). Another way to handle this is to maintain a table of known redirect services, flag those that are reputable (for example tiny URL are very active in blocking spammers) and then immediately block any that are not in that list.

The alternative is to “slow-track” known redirects. In essence, add the to a queue for a separate service to investigate and then make a classification. Meanwhile, if we are seeing a lot of the same redirect URL it will get bumped up the queue. This won’t catch the initial postings but it will eventually contribute towards classification and is likely to stop a spam campaign pretty quickly.

Meanwhile, the URBl services will also be looking to blacklist any redirect URLs so there is a good chance that redirects will quickly end up on real-time blacklists.

URL Obfuscation

Spammers will often try and obfuscate URLs to prevent rules based detection. For example, they may encode the URL or add unnecessary parameters. None of these are really that hard to deal with. There are simple rules that can be used to get a [canonical URL http://en.wikipedia.org/wiki/URL_normalization]. Once these rules are applied all various forms of a URL will be normalised to one canonical form.

Challenge-Response

Spammers rely on the fact they can pump out hundreds, if not thousands, of messages as quickly as possible. A spammers most precious resource is time. If they can’t bang out messages unhindered they are likely to give up and move on.

A Challenge-Response (C-R) system is designed to inconvenience the spammer whilst minimising the impact on a legitimate user. A common C-R mechanism is to send am email to a user when it is their first time posting with a link to a page where they have to enter a unique code. This relies on the fact that spammers generally don’t have valid email addresses so will never get the C-R request. Generally, this only needs to be done the once; however, if a user is showing an unusual pattern of sending messages it could be repeated (for example, if they send more than 10 messages in 24 hours).

Of course, another C-R that is popular these days is CAPTCHA(Completely Automated Public Turing test to tell Computers and Humans Apart). These can be very effective but they are also generally disliked by legitimate users as they can be difficult to complete and are not a great user experience for partially sighted users. The CAPTCHA definitely has its place in fighting spam but it is a blunt instrument and should be used only when other, more user friendly options, have failed..

Reputation

A reputation service is one that grades how much a certain entity can be trusted. The greater their “reputation” the more we can trust them. Using simple rules it’s possible to award or remove kudos points from an entity and, thus, build a profile of just how trust-worthy that entity is over a period of time.

User

User reputation can be measured using a range of metrics:

How long have they been a member?

When is the last time the account was active?

Does the account’s registered email address map to another account?

If it does is the other account trustworthy?

Is the users IP address known to us (for the wrong reasons)

Has the user been issued with any warnings

How frequently do they post comments

What is their aggregate spam score for their previous postings?

Do they post lots of URLs

Have they previously been banned

…and so on.

None of these attributes are specifically spammy but over time and in combination with spam detection we can balance a users reputation against the spam score of a posting to add more weight to the final classification.

RBL (Real-time Black List)

Using Real Time Blacklists (RBLs) we can see if a message contains any content that has a bad reputation. For example, SURBL provides a URBL (Url Real-time Black List) that can be used to check the reputation of a URL.

Unfortunately, most RBLs are geared towards email content and not website content (so called Comment Spam); however, there are a number of services (some free, some subscription) that specialise in website content:

It’s not clear yet just how useful these will be so some analysis of example data will be necessary to decide if it is worth putting effort into developing an interface for such services.

Rule Based Filtering

There are going to be key words or phrases that are going to be an immediate indicator of unwanted content (which may of may not also be classified as spam).

For example, content that contains (excess?) profanity or racially extremist content. For detected stuff we consider to be more of less black and white a simple rules based pattern detection mechanism (using regular expressions) is a very simple way to filter out unwanted content.

Of course, some rules will be more emphatic than others, so for that reason each rule should be given a score and only if a score threshold is reached should the message be considered undesirable content. For example, very offensive swear words may have a very high score whereas words like Viagra may have a medium score and words like crap or idiot may have a very low score. The combination of the score of all rules that fire will be the overall score.

Conclusion

Spam is a real problem. Probably the best way to handle it is using a statistical classifier. Unfortunately, statistical classification is not a panacea. There is a lot of up-front investment in both implementing the system and then training it. There is also some on-going effort required to monitor the classifiers activities and to aid in the classifiers self-learning. A better way is to use a blended approach to spam detection; a combination of a number of well known and proven techniques to detect and eradicate spam.

You can also read this article on the Experts Exchange technical blog.

This is an adaptation of an internal article I wrote whilst working for Last FM.

Like this:

Related

Published by evilrix

An expert in cross-platform ANSI C/C++ development; evilrix specialises in high performance/low latency solutions and complex meta-template programming techniques, using Boost and the C++11 ANSI standard.
View all posts by evilrix