Security organizations know that most of today’s third party SIEM, forensic, and malware detection applications have limits, which mostly stem from the fact that they have to resort to matching byte patterns instead of analyzing network traffic patterns intelligently. This is why companies are creating their own agile security labs based on data science—to outpace the growing size and sophistication of malicious software.

Recently, a six week Pivotal Data Science Labs project helped one of the world’s largest insurance providers understand where their existing, best-in-class malware detection tool was still falling short. In this time, Pivotal helped them detect malware they couldn’t previously detect, provided a new level of analytical power against malware threats, and advanced our customer’s growing, enterprise-wide data science group.

The Challenge—Data Science Applied to Malware Detection

According to a recent 9700+ person survey by PwC, security incidents increased 48% in 2014—to 42.8 million attacks—that’s over 100,000 attacks a day. The 2014 DBIR summarizes the problem well, “Attackers are getting better/faster at what they do at a higher rate than defenders are improving their trade. This doesn’t scale well, people.” The report also highlights that more than 75% of compromises happen in days and less than 25% are discovered in days.

Ultimately, the malware detection challenge is focused on quickly and comprehensively finding suspicious communication patterns—even one malware-infected user or server can cause financial loss or desecrate a brand. Within banks, insurance companies, retailers, healthcare providers, or any company storing personally identifiable information, malware behavior can be captured within proxy log files as the malicious apps try to communicate with and pass along information to their command and control servers outside the firewall.

A large part of the analytical problem has to do with the sheer volume of data and how to do a better job finding the needles in the haystack. In this project, one month of network data amounted to almost 1 terabyte, 1.5 billion rows of connections, close to 75,000 user (employee) accounts, almost 100,000 internal server IP addresses, and 500,000 of external web domains. As well, the target analytical system needed to scale as much as 10-fold, and the team needed to prove the concept within 6 weeks of elapsed time.

Recognizing the enormous data volumes involved with this problem, it also becomes clear why current security solutions fall short on analyzing this traffic intelligently. Until recently, data of this volume simply couldn’t be processed in a sensible amount of time. As mentioned, this is why today’s security landscape mostly focuses on matching known byte patterns in the network traffic, an operation which is computationally inexpensive. To intelligently analyze this kind of data and correlate features of the data, you need a sophisticated big data infrastructure like Pivotal Greenplum DB and new methods from data science—also mostly unheard of in the security space.

The Approach—Applying Data Science to Malware’s Network Traffic

The data science team turned to Pivotal Greenplum Database (GPDB)—the petabyte-scale, massively parallel, shared-nothing architecture moves the compute power to the data instead of moving the data to compute. Choosing this type of platform was an incredibly important part of the approach and allowed data scientists to iterate through algorithms very quickly during development. The max runtime for the single, most complex model took under an hour on the whole dataset of 1.5 Billion rows, versus taking a day or more to run and test on Apache Hive. This saved data scientists a lot of time and allowed for a truly agile, rapid development approach.

The project’s security lab infrastructure included an on-premise, quarter rack of Pivotal Data Computing Appliances (DCA), and the four main GPDB compute nodes included 64cores, 256 GB of memory, and 9 terabytes of disk. An overarching premise to data science engagements of this kind is restraining the use of a sample data set because what you are looking for is a couple hundred malware connections hidden somewhere inside the company’s network traffic. This can often include more than 1 billion connections per month. In other words, if you want to find the needles in the haystack, you need the whole haystack.

The data was prepared and models were processed completely within GPDB, and multiple models were used to identify different behavior patterns based on MADlib, R, and Python libraries. In fact, the IT team expected various algorithms to be used and evolved, helping them stay ahead of evolving threats. Architecturally, the data processing framework included four main stages. Raw proxy logs were first landed and unsuspicious domains and unsuccessful communications were filtered out. Then, domains and distinct users were extracted to whitelist additional domains. In the third step, external data was brought in to add intelligence, like threat likelihood or popularity, to existing domain information. In the most important, last step, the developed data science models were run to produce the results, which were usually lists of internal clients or external web domains, ranked by infection probability. Throughout these steps, the Pivotal Data Science team collaborated with the customer’s subject matter experts to improve the results.

The modeling methods included algorithms based on graph theory, natural language processing (NLP), anomaly detection, and clustering. Graph theory allowed the team to see which of the customer’s internal clients make a lot of connections to really obscure domains and how they interacted within the customer’s network afterwards. NLP methods were used to analyze suspicious domain names since malware often tries to hide its traffic to command and control servers behind certain types of domain names. Anomaly detection and clustering helped build a baseline of standard profiles to identify any clients that deviated in a significant way. When it came to visualization, various R and Python libraries were used during development to help ensure algorithms were working properly, and these were also used to help the customer envision future, production dashboards. However, the most valuable insight to the customer was provided by a simple list of suspicious nodes ranked by probability, handed over to the customer’s IT security team.

The Results of Data Science and Malware Detection

Many positive outcomes were achieved in this project. Most importantly, over a dozen infected nodes were found in the approach, all were previously undetected by the current, sophisticated, production security solution. The IT team was able to quarantine these users, confirm infection, and remove the threats.

The team also saw how the speed of GPDB was much faster than current approaches with Apache Hive™. For example, GPDB could group 1.5 billion connections by domain name in under 3 minutes and score 500,000 domain names using the previously mentioned NLP methods in under 10 seconds. They saw how this speed was essential for iterating over the malware detection models, allowing data scientists to quickly try variants or tweak parameters. It was also clear how new, innovative approaches with data science algorithms could help the IT team achieve security goals.

Lastly, Pivotal Data Labs provided the code and training to help the customer’s IT security team to collaborate with the customer’s internal data science team to maintain and refine the developed models on their own.