A Data Science Approach to Detecting Insider Security Threats

In my conversations with CISOs, one of their biggest fears is insider threat attack. Employees must access internal information freely to be productive, yet ill-intentioned information access must be guarded. According to the Verizon 2014 Data Break Investigation Report, the percentage of attacks from internal actors doubled in 2013, showing a large increase for the second year running.

Most of security tools today focus on identifying malware-initiated attacks. Malware often leaves informational trails which present many detection techniques to identify a malware signature for blacklisting; for example, matching based on packet or payload signature matching. Insider threat attacks, however, are committed by internal employees with valid data access. Unlike with detecting malware, there is no ready user signature to rely on.

Most current commercial products address insider threat attacks by relying on role-based access control policies to properly assign the right levels of data access privilege to the right users. Certainly policies prevent outright disallowed data access, but they are useless in preventing policy abuse where an ill-intentioned user is allowed to access data in an inappropriate way. A new approach to insider attack detection is needed.

The key is to proactively monitor user activities and flag alerts for anomalous behavior before potentially serious damage occurs. We see many opportunities for Big Data Analytics to address the problem of identifying anomalous user-to-resource access activities. In this post, I’ll share with you one such possibility using a patent-pending approach that we have found successful in client engagements.

The Active Directory log is a data set that records user’s authentication status on various network devices. Enterprises typically store such data for over a long time. This is a rich data repository that we can leverage to mine user behaviors. For each and every user, we examine who has attempted access to what devices over a particular historical period, and establish a behavioral profiling model to capture the historical norm. Then, given the current period of user’s device access records, we can measure its deviation against the historical norm of the user and his peers. You can then flag the user for further investigation if the deviation is large.

Here’s a specific example. In this particular large enterprise environment, there are two billion rows of Active Directory log data over six months of user-to-resource authentication records, with over 200K users across 300K devices. For every user, we built a behavioral profiling model. A less sophisticated model would simply count the average number devices accessed in the past and flag alert if the current number of devices is found to be much larger than the past average. However, such a simple behavioral metric is bound to fail due to the difficulty of establishing a threshold, and the high number of false positives. To achieve high precision results with a low number of false positives, the model should consider changes of access frequency in both new devices and seen devices, and compare that to the behavior of the user’s peers over their devices in the same period.

Figure 1 shows the visualization of a typical user behavior pattern in its access over devices across time. Cell color indicates the frequency of access over a device on a week. This typifies the device access patterns of most of the enterprise users.

Figure 1: visualization of a normal user’s behavior

In contrast to Figure 1, Figure 2 shows a user exhibiting an obviously anomalous device access pattern. Change in behavior that deviates from the norm is what the model would catch and flag accordingly.

Figure 2: visualization of an anomalous user’s behavior

It’s important to note that due to the data volume, such work is not possible without leveraging Pivotal’s Massively Parallel Processing (MPP) technology, using either the Pivotal Greenplum Database, or the HAWQ SQL processing engine available through Pivotal HD, Apache Hadoop® distribution. In such an environment, both the modeling training and scoring codes run within the database, taking only a fraction of a second to execute.

A data science model that uses Active Directory log data to detect insider threat attacks can provide a baseline alerting system. Other informational sources can further enrich the context of the alert for forensic investigation. For example, meta user information from Human Resources or project staffing databases can provide additional insight into the flagged user’s activities. Asset information provides additional context around the devices the user accessed. We can further correlate an alert from this model with alerts from other security products, bolstering the signal strength and increasing confidence in the alert.

A security data lake powered by Pivotal HD has the computing power to carry out sophisticated data science work, while offering the ability to freely inject data sources for alert correlation. Coupled with modern security tools such as the RSA Security Analytics monitoring platform, we have a powerful set of capabilities to detect a broader set of threats, and operationalize the response and remediation more effectively and efficiently.

Security work remains a wide green field of opportunity for data science applications. In future posts I’ll share more applications and examples of our work.

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.