Where machine learning meets security

The last few decades have seen tremendous progress in machine learning (ML) algorithms and techniques. This progress, combined with various open-source efforts to curate implementations of a large number of ML algorithms has lead to the true democratization of ML. It has become possible for practitioners with and without a background in statistical inference or optimization – the theoretical underpinnings of ML — to apply ML to problems in their domain.

One would be hard-pressed not to come across any mention of ML in today’s tech industry headlines, or on the product webpages of cybersecurity companies. Almost every new security solution on the market claims to be ML-driven in some aspect. However, there is a great deal of variety in the way that ML is actually applied.

In this blog, we discuss the general role of ML in cybersecurity and make the case that traditional methods, based on static indicators of compromise, are strengthened by ML and vice-versa within a symbiotic relationship.

First, let’s start with clear definition of what ML actually is: a collection of algorithms and techniques that is used to learn from the past to make predictions about the future.

What does machine learning bring to security?

ML algorithms build models of past observations that become a reference against which to compare future observations and on which to make predictions. The richer the corpus of observations in representing past behaviors, the more likely the models are to predict future behaviors.

For example: given a collection of plain text files, shell scripts, jpegs, and executable binaries, can we determine if a file that was written to disk is a shell script? Or – given network flows at the application layer – can we determine if the flow we are observing is similar to what we have seen before or different? How about the sequence of API calls being made by a service to another? Is that benign or anomalous behavior? This leads broadly to two classes of ML models, namely, generative models and discriminative models (however, we’ll save that topic for a more in-depth discussion in the next blog).

ML delivers a wealth of tools and techniques to security:

Malware similarity analysis

Deep packet inspection

Application behavior profiling for anomaly detection

Insider threat assessment via individual behavior profiling

Learning what events matter to customers (vs. what don’t)

A/B testing of usability flows

Determining audit-worthiness for compliance

This is only a partial list of the areas in security that are driven by ML. But to understand what ML brings above and beyond the traditional approaches to security, we need to first understand how ML can be symbiotic with many of these approaches.

Closing the detection efficacy gap created by signature-based methods

Security has relied on static indicators of compromise (also referred to as signatures) to detect malicious behaviors. These indicators are encoded in rules that codify signatures known to be malicious. For instance, was a file written to /bin? Did an application contact a known malicious external IP? Is the external IP a tor exit node? Did an existing system binary get overwritten by a process? Does the file hash of a binary match a known malware hash? Etc.

While there is tremendous value in using signatures for security, we make the case that ML can be used to enhance these approaches by maintaining continuous high detection efficacy, while simultaneously enabling advanced threat hunting and forensics. Threats identified by ML and confirmed by security investigators, can be used, in many cases, to generate new signatures to catch these threats both inside an organization and outside via crowdsourcing.

In rules-based approaches, threat detection is predicated on continuous and frequent updates to signatures as new tactics are developed by malicious actors that in turn change the indicators of compromise. The change in tactics are motivated by the need to evade detection and to exploit vulnerabilities before they are patched. This leads to a detection efficacy gap as shown in Figure 1. We posit that a combination of strategies that use both rules and ML makes for a sophisticated detection platform that drops the efficacy gap, surfacing anomalies and malicious behaviors, that can be used not only for downstream services such as enforcement and mitigation, but also to trace various attack stages through the network.

Figure 1. Detection efficacy as a function of time

ML at StackRox

At StackRox, we use a combination of generative and discriminative models to detect and surface anomalous and malicious behaviors (to learn more about StackRox features that leverage ML, visit our product page). The generative models are Bayesian Hierarchical Models that are used to summarize both benign application behaviors, which are behaviors of the application when not under attack, and malicious application behaviors when the applications are under attack. These behaviors include activity generated within applications, communication with other applications, and interactions with the underlying host. The discriminative models learn to distinguish between benign and known malicious behaviors. These models are used together with a rules-based model to surface alerts. We use a feedback mechanism to capture user feedback on alerts and use an adaptive learning algorithm to selectively update models using user feedback.

A case for ML in security

Consider the case of an application communicating with an external IP. A traditional rule-based approach would consult known blacklists to categorize the IP as benign or malicious. As early as 2012, Google search found 30 trillion unique URLs on the Web with a proportional, albeit smaller, number of IP addresses. Given this, and that new URLs are generated every day and IP addresses get re-assigned periodically, it is virtually impossible to keep blacklists for URLs and IP addresses up-to-date. As a matter of fact, no ML model can be a summary of all the benign and malicious URLs or IP addresses that are out there.

Having said that, ML can be used to find that which is common to known benign IPs and known malicious IPs to build models that can distinguish one from the other. Attackers continually adapt their tactics by using various obfuscation techniques, including obfuscating hosts with IPs, hosts with different domain names, URL shortening, algorithmic generation of URLs, fast-fluxing where a single domain name is back-ended by an ever changing set of proxies, etc. ML has been used with increasingly sophisticated sets of features to detect malicious URLs and malicious IPs.

Predicated on empirical evidence that URLs and DNS records change far more rapidly than IP addresses, ML has been used effectively with features based on the spatial structure of IP addresses to differentiate benign from malicious IPs. Once an IP that is surfaced as anomalous is confirmed to be malicious, ML provides a path towards expanding existing blacklists both inside an organization and outside via sharing of threat intelligence.

While URLs and IP addresses are part of one area to which ML is applied, behavior profiling of applications is another. In many approaches, anomalous or malicious activity is detected against a full context of an application’s normal behavior, which is typically built using ML in a controlled environment.

For example, using generative models and a rich feature set derived from application protocol level observations along with others down the network stack, ML has been used to build protocol anomaly detectors. These detectors then serve as a valuable additional source in detecting exfiltration and any lateral movement that may precede exfiltration, as part of an arsenal of detection strategies against increasingly sophisticated threat actors.

As mentioned above, ML is an integral part of StackRox’s ability to deliver precision threat detection. The models that are built also dynamically adapt to application and environment changes. This enables a high-fidelity understanding of application behaviors.

Conclusion

ML meets security in a good place. ML algorithms are being used with increasingly sophisticated features to tackle various problems in security. A machine learning-based approach is totally symbiotic with static and traditional approaches to security. When threat intelligence tells us that something is malicious, ML helps us trace breadcrumbs left by the attacker leading up to the malicious event, enabling identification of signatures that may act as precogs for similar attacks in the future. When threat intelligence cannot tell us that something is malicious, ML can help surface events as sufficiently anomalous and hence warrant further analysis; StackRox leverages ML to do just that, and also enables threat hunting and investigative efforts. As attackers get more sophisticated, it behooves us to step up to the challenge to use everything at our disposal, inventing techniques that complement each other to thwart even the most advanced attacks.

This has been the first of many dives into the deep and fascinating topic of machine learning and security– an area which our team here at StackRox is incredibly passionate about! Stay tuned for more.