Could data science turn the tide in the fight against cybercrime?

Cybercrime is booming, with malicious hacking and online fraud growing at an alarming rate. Reversing this trend appears an almost sisyphean task, but machine learning and bayesian statistics are proving invaluable to cyber security organisations

RSA's Anti-Fraud Command Centre, where thousands of cyber attacks are detected and tackled every day Photograph: RSA/guardian.co.uk

Hacking, fraud and other clandestine online activities have been making headlines in recent weeks, giving rise to concerns that law enforcement agencies are losing the war against cybercriminals. But just how serious a threat to the public is cyber crime, and could data science hold the key to reversing the trend?

RSA, the cyber security arm of US big data firm EMC, specialises in the use of advanced analytics and machine learning to predict and prevent online fraud. Its Anti-Fraud Command Centre (AFCC) has identified and terminated 500,000 such attacks in its eight year existence, half of which came in 2012 alone.

This increasing detection rate is due in no small part to its rapid adoption of machine learning techniques. Five years ago RSA's Israeli operation undertook a step change, moving away from inflexible rule-based fraud detection systems, in favour of a self-improving approach using data science underpinned by Bayesian inferencing.

In the UK, Detica - the data intelligence arm of BAE Systems - incorporates similar technologies into their cyber security efforts, in one case identifying advanced persistent threats (APTs) through data science-led methods that had previously gone unnoticed for 18 months by firms across the world.

Every time a customer of one of RSA's clients makes a transaction using their online banking facility, 20 factors are recorded and fed automatically into the AFCC's database. These details are then combined into 150 fraud-risk features, each one consisting of a different mix of two or more of the 20 factors.

For example, combining an IP and MAC address will give a better indication of whether a transaction is likely to be fraudulent or nor than either detail would give on its own. The features are then grouped into either prior indicators or bayesian predictors, based on the way in which they indicate fraudulent activity.

Prior indicators are information that is indicative of fraud regardless of context, such as an initial failure to authenticate a transaction, while the predictors are data whose impact on fraud risk varies from one case to the next.

In a given transaction each feature is assigned a risk score out of 100, with higher scores indicating a higher likelihood of fraud. All 150 scores are then combined using a set of algorithms, with the influence if any one score on the final total dependent on a unique weighting.

The model's Bayesian nature arises from the fact that a predictor's weighting is calculated based on a constantly updating probability that, for a given customer, it is likely to indicate fraud. In this way, RSA's model becomes more accurate every time it spots a verified fraud event.

It also means that if cybercriminals suddenly change tactic and an entirely new method of online theft emerges, the system will automatically detect the new risk pattern and incorporate it into the risk calculation for any event from that point on.

One area where RSA is looking to improve their model is in the addition of new indicators to build on the 20 existing factors.

"We constantly try to add more in this domain, done by explorative research and talking to customers", said Alon Kaufman, director of security analytics at the AFCC. "We're especially interested in mobile, where there are strong identifiers of device such as SIM card."

Mobile phones are increasingly becoming the target of choice for cybercriminals, with a study by online security specialists Trend Micro identifying 350,000 malware threats aimed at mobiles in 2012 - up from 1,000 in 2011 - and forecasting one million in 2013.

Android handsets are disproportionately likely to be affected, according to RSA's head of knowledge delivery, Daniel Cohen. "In 2012 around 70% of new smartphones were on an Android operating system, but 98% of mobile malware was targeted at Android., he said.

One problem facing RSA, Detica and others is that sooner or later criminals will work out how to fool a system into thinking they are the person they are impersonating, by replicating the indicators their target would present. The challenge is to find data that is harder for a fraudster to copy.

"We're looking increasingly at behavioural metrics - how you interact with your machine", said Kaufman. "Information such as typing speed, mouse movement and the order you access your bank's web pages are all strong indicators that you are who you say you are, but we can only incorporate these details into our model where our client provides them, and where they meet privacy regulations."

The use of behavioural measures carries with it inherent risks concerning data protection law and privacy rights, and the issue of consent is increasingly cited. Almost four in five respondents to a 2012 Demos poll said their primary concern around personal data was that companies would use it without their permission - coming in just above worries about companies losing user data.

Bridget Treacy, Head of UK privacy and information management practice at law firm Hunton & Williams said, "It might be possible to give individuals notice at the point at which their data are collected. In some circumstances notice may not be appropriate, but those circumstances should be limited, and clearly defined, with proper safeguards. Transparency and proportionality are key."

"Inevitably there are tensions between organisations' use of data and individuals' privacy rights, and finding the right balance is not easy. However, the use of data for purposes such as the prevention and detection of fraud is not necessarily a zero sum game. It may be possible for some processing to take place utilising anonymised or pseudonymised data", she said.

Detica operates a similar model to RSA, using intelligence gathering to guide its data-driven operations, and vice versa. "We will measure things such as a computer talking to a website that was registered two days ago, a computer talking to a website that no one else in that environment is talking to, or a laptop sending more data than it's receiving - which is more like server behaviour", said Richard Wilding, Detica's director of cyber security.

"If you look for any of these activities individually, you'll be flooded with false positives, but we use big data methods to move from that initial trait identification to the bigger picture, and then our analysts will use the insights from that data to form their understanding of the attack", said Wilding.

Following a data breach of its own in 2012, RSA now takes the same approach for its internal security, monitoring every time someone logs onto its internal network, their activity one inside and any extraction of data.

"After the breach, I told them just bring me the data, and I'll tell you how we can prevent it next time. The data are different, the personnel are different, but the process is the same in terms of the underlying data science", said Yael Villa, site leader at RSA Israel.

The model created by Villa's team flagged up retrospectively 30 potential sources of the breach, of which six were verified by EMC's global security officer as being events where an impersonator accessed the system in the digital guise of an employee.

Full disclosure: earlier this month I spent a day at RSA's AFCC in Israel at the expense of EMC.