In a previous post, we shared three primary reasons why the traditional, static approach to file security no longer works for today’s modern enterprises. Working groups are formed organically and are cross-functional by nature, making a black and white approach to file access control outdated—it can’t keep pace with a constantly changing environment and creates security gaps. Files can be lost, stolen or misused by malicious, careless, or compromised users.

We also introduced a new file security approach—one that leverages machine learning to build dynamic peer groups within an organization based on how users actually access files. By automatically identifying groups based on behavior, file access permissions can be accurately defined for each user and dynamically removed based on changes in user interaction with enterprise files over time.

In this post, we’ll review the algorithms used to create dynamic peer groups that identify suspicious file access activity and help solve the traditional access control problem.

Building Dynamic Peer Groups to Detect Suspicious File Access

Several steps are required to dynamically place users in virtual peer groups according to how they access data (see Figure 1).

First, granular file access data is collected and processed. Next, a behavioral baseline is established that accounts for every file and folder accessed by each user. Based on how they access enterprise files, the dynamic peer group algorithm assigns users who may belong to different Active Directory (AD) groups into virtual peer groups. If the algorithm does not have enough information to associate a user with a specific peer group, the user is placed in a new peer group in which they are the sole member. Once virtual peer groups are established, access to resources by unrelated users can be flagged; this enables IT personnel to immediately follow up on such incidents.

Figure 1 – Overview of suspicious file access detection process

Granular data inputs

Algorithm input comes from Imperva SecureSphere audit logs. These contain access activity that provides full visibility regarding which files users access over time. Each event contains the following fields:

NAME

DESCRIPTION

Date and Time

Date and time of file request

User Name

Username used to identify requesting user

User Department

Department to which user belongs (as registered in Active Directory)

User Domain

Domain in which the user is a member

Source IP

IP that initiated the file request

Destination IP

IP to which the file request was sent

File Path

Path of requested file

File Name

Requested file name

File Extension

Requested file extension

Operation

Requested file operation (e.g., create, delete)

Architecture

The behavioral models are created daily and simulate a sliding window on the audit data. This lets the profile dynamically learn new behavioral patterns and ignore old and irrelevant ones. Additionally, the audit files are periodically transferred to a behavior analytics engine. This improves existing behavioral models and reports suspicious incidents.

The behavior analytics engine is divided into two components:

Learning process (profilers) – Initially run over a baseline period, profilers are algorithms that profile the objects and activity in the file ecosystem and relate it to normal user behavior. These include users, peer groups, and folders, as well as the correlation between the objects. Profilers are activated daily afterward, both to enhance the profile as more data becomes available, and to keep pace with environmental changes (e.g., when new users are introduced).

Detection (detectors) – Audit data is usually aggregated over a short period (less than one day) before being processed by the detector. Activated when new data is received, detectors pass file access data from the profiler through predefined rules to identify anomalies. They then classify suspicious requests, reporting each as an incident.

Create peer groups using machine learning algorithms

To build peer groups, data must first be cleansed of irrelevant information—including files accessed by automatic processes, those that are accessed by a single user, and popular files frequently opened by many users in the organization.

Now with clean data, Imperva builds a matrix of the different users (rows) and folders accessed over time (columns). Each entry contains the number of times a user has accessed a given folder in the input data time frame.

The matrix is very sparse because the majority of users do not access most folders; therefore, dimensionality reduction is performed on that matrix to reduce both the scarcity and noise in the data. This leaves meaningful data access patterns which become the clustering algorithm input.

A density-based clustering algorithm is used to divide the different peer groups within the organization into homogeneous groups called clusters. Members of a given cluster have all accessed similar folders, with a typical cluster containing about four to nine users. The process also makes certain that users in different clusters are unique.

Define virtual permissions to enterprise files

The notion of “close” and “far” clusters are used to define the virtual permissions model of each user. For every cluster, the algorithm determines which peer groups are close and far based on the similarity between it and the other clusters. Distances are partitioned into two groups using a k-means algorithm; a smaller distance designates a closer cluster.

Each user is permitted access to folders accessed by others within their own cluster, or by users belonging to close clusters.

Detect suspicious file access

The detector aspect of the algorithm identifies suspicious folder access. Within a profiling period, for example, user John’s access to a given folder is considered suspicious if the folder is only accessed by users belonging to clusters far from his.

Imperva CounterBreach automatically determines the “true” peer groups in the organization and then detects unauthorized access from unauthorized users.

Incident severity (e.g., high, medium or low) is a function of the number of users and clusters having accessed the folder during the learning period. The ratio between the first and second quantities implies severity; higher values indicate higher severity (many users grouped in a small number of clusters). Lower values (close to 1) indicate reduced confidence, as the number of users equals or approaches the number of clusters. Personal folders and files are given careful consideration when ranking severity.

Adding context to accessed files with dynamic labels

With the goal of providing sufficient context to security teams so they can understand and validate each incident, Imperva presents typical behavior of the user who performed the suspicious file access activity. In addition, a label is applied to each folder accessed during the incident; this helps SOC teams evaluate the content or relevance of the files in question.

In assigning a label to a folder, the algorithm assesses the users who accessed it during the profiling period, as well as those from their peer groups. It then looks for the group (or groups) in Active Directory (AD) that best fits this set of users. This has two relevance aspects: the first, called precision, is how many users in the set are also in the AD group; the second is recall, the number of users in the AD group also contained in the user set. The best AD group (or groups) becomes the folder label—for example, Finance-Users, EnterpriseManagementTeam, or G&A-Administration. The label provides security teams with more context about the nature of the files pertaining to an incident.

Up Next: Examples from Customer Data

To validate the algorithms explained above, several Imperva customers allowed us to leverage production data from their SecureSphere audit logs. Containing highly granular data access activity, the log data provided full visibility into which files users accessed over a given duration—we saw the algorithms identify some very interesting real-life file access examples.

In our next post in this series we’ll review those examples and demonstrate the effectiveness of this automated approach to file access security.