Machine Learning in Infosecurity – Current Challenges and Future Applications

Daniel Faggella is the founder and CEO at Emerj. Called upon by the United Nations, World Bank, INTERPOL, and many global enterprises, Daniel is a sought-after expert on the competitive strategy implications of AI for business and government leaders.

Episode Summary: Uday Veeramachaneni is taking a new approach to machine learning in infosecurity aka infosec. Traditionally, infosec has approached predicting attacks in two ways: 1—through a system of hand-designed rules and 2—through anomaly detection, a technique that detects statistical outliers in the data. The problem with these approaches, Veermachaneni says, is that the signal-to-noise ratio is too low. In this episode, Veermachaneni discusses how his company, PatternEx, is using machine learning to provide more accurate attack prediction. He also discusses the cooperative role of man and machine in building robust automated cyberdefense systems and walks us through a common security attack scenario.

Brief Recognition: Prior to co-founding PatternEx in 2013, Uday Veermachaneni was a head of product management at Riverbed Technonology, a Principal Product Manager in the Cloud Networking Group at Citrix, a staff software engineer at Juniper Networks, and a Senior Engineer at Motorola. Veermachaneni holds a MS in Computer Science from the University of Texas, Arlington and a MS in Economics from the Birla Institute of Technology and Science.

Current Affiliations: Co-founder and CEO of PatternEx

Interview Highlights:

The following is a condensed version of the full audio interview, which is available in the above links on Emerj’s SoundCloud and iTunes stations.

(2:06) Give us a brief rundown on where you see gaps in anomaly detection when it’s used alone

Uday Veeramachaneni: Generally, the way infosec has worked is through rules, and these rules generate alerts. Human analysts then chase those alerts down and figure out if it’s an attack or a false positive. The industry noticed that those alerts produce a lot of noise, so they looked at anomaly detection as a way of improving the ratio of alerts generated for real attacks versus those generated for false positives. Anomaly detection catches statistical outliers.

The problem is that while attacks could be statistical outliers, not all statistical outliers are attacks. That’s the challenge with anomaly detection: it’s flagging outliers which may end up being false positives. The next evolution in fighting attacks is using a team of human analysts, who identify which events are actual attacks. What a machine should do is go back through the data, and for each human-identified attack it should look for patterns to see how it can identify that attack if it happens again. Once the machine has figured out the pattern it can use that knowledge to predict what a human would identify as an attack. That’s what needs to be done in infosec to address false positive issues.

(4:21) There’s this whole issue of context that the machine may not pick up on. For example, there may be a particular type of attack that happens primarily during the holidays, or in recent months has happened only in a particular industry. A human would immediately pick up on that, while a machine may have trouble with that broader conceptual understanding.

UV: For a machine to work, it needs some examples. Once a human gives it examples, it can go back and identify the patterns to predict that attack.

(5:13) You mentioned that anomaly detection is an unsupervised learning task, which means the machine is looking for patterns in data that does not have labels, like in infosec. When a human does label the data, indicating an attack, and thus transforming it into a supervised learning task. You had used the term “active learning”—how does that differ conceptually from reinforcement learning?

UV: The challenge in infosec is that there aren’t many examples to train the machine. That’s why we’ve used anomaly detection, because you don’t need examples for that. You just need to detect statistical outliers. So we’ve started using the term “active learning”, where a machine asks a human analyst what he thinks of a certain event, and as the human if giving feedback the machine goes back figures out how to construct a predictive model based on what he’s saying. You’re constructing a model on the fly based on the feedback the analyst is giving.

(7:03) What of the role of man and what is the role of machine in this case? What is the initial human job to “plug in” this kind of technology? What human effort is needed to get this system up and running?

UV: Any company could have 100+ sources of data. We need to adapt to those sources of data and ensure we consume that data in a real-time streaming mode. The real AI piece starts after that. Day 1, there are no training examples. We start with the output from an anomaly detection system, or if you have a rules-based system then perhaps 50 alerts from that rules-based system. The human reviews them, and says, “These 48 are normal, and these two are attacks.”

The machine crunches the data that reviews the data for patterns, and the next day presents event similar to the human-identified attacks from the day before. The human gives feedback on whether the machine is correct or not, and the analyst thus reinforces the machine, and the machine learns. And this happens continuously, because human attackers will evolve, and so the machine needs to evolve as well.

(10:12) What are some examples of signals in a cyberattack? What are the signals that would identify a run-of-the-mill attack?

UV: In infosec, human analysts have been trying to create these rules to figure out what an attack looks like. We’re flipping this around: humans tell the machine what an attack is, and the machine tell the humans what these attacks look like.

As an example, the very standard sort of attack is called command and control (C2) communication. At a very simple level, it’s when your computer is infected with a virus and controlled by the hacker’s remote server. Communication is very systematic; you may see it sending information every two hours, or every 30 minutes. You could create a rule saying, “If a computer communicates every 10 minutes, flag that as C2.” But you’ll get clobbered by false positives.

But if you flag many C2s and feed that to the machine, it could look for other parameters. It could be the standard deviations of the duration of the connections; it could be the number of bytes, or the number of packets. It’s going to identify those patterns for hundreds or thousands of parameters.

(13:27) What kind of information is sent back over a C2 attack? What’s often being taken?

UV: C2 is a stage in the attack cycle. Your machine has been infected with malware, and C2 is when the machine is communicating with the hacker. The hacker tells the machine to look for other things, which could be intellectual property, customer data, credit card information, and so on. They then export this information to his server.

(15:07)What are some other signals a machine might use to distinguish an attack?

UV: Humans generally look at sample statistics for bytes in, bytes out, packets in, packets out. Machines can look at more complex factors. For instance, this machine initiated 40 connections. What was the duration of each of those connections? What was the average of that duration? What was the standard deviation of that duration? And how does that data in previous attacks compare to the current data? Machine can look through those factors that humans can’t find.

(17:08) So there is a myriad of mathematical permutations of factors that a human wouldn’t be able to find, but the machines can find some of those underlying patterns that a human may not be able to even think of.

UV: That’s exactly what it is. Once the human say it’s an attack, the machine goes through millions of combinations to figure out what’s the correct combination to predict that attack.

(18:19) You talked about the “holy grail” of machine learning and AI in info security, where the “good guys” could share and conglomerate knowledge around what attacks look like and be able to protect themselves from attackers. What is that holy grail as you see it?

UV: The holy grail of AI infosec is having the machines that can find complex patterns that are good predictors. To do that, humans have to train the machine, and they’re doing that at every company. Can we share a complex pattern detected at one company with another company? That’s the holy grail. If you’re able to share a complex pattern, it’s very difficult for an attacker to adapt to that. They’d have to change the underlying tools they used, while exchanging a IP address or email address blacklist would be much easier for an attacker to adapt to.

(20:19) There’d have to be some sort of underlying body that would permit for a way to source those common attack pattern across companies.

UV: There’s a lot of automation there, it’s not a manual thing. As AI matures, it should be able to automatically take data in from the outside and learn from it. There’s no exchange of data, per se. It’s more that if someone participates in an AI network, he should be able to train his AI much faster and across a much broader set of attack vectors than if he were having his analysts train it by themselves.

(21:15) Hopefully, there would be some facilitation of that process, where we would see more and more businesses being able to block and detect and attack as it occurs. Although with malicious actors on the other end, they’ll be working hard too.

UV: The key is the machine-analyst combination needs to evolve faster than the attackers are evolving. That’s the crux of solving this problem.

Big Ideas:

1 – In a realm of unsupervised data, infosec analysts can empower machine learning systems through active learning, where machines are constantly given human feedback on unlabeled data to improve pattern recognition.

2 –The “holy grail” of infosec is an AI network that automatically shares attack patterns with all other companies in the network, so that an attack on one can quickly be defended against by all. This is just one of many possible AI applications in data security.

Related posts (5)

Episode Summary: CEO Chris Nicholson speaks on Skymind machine learning applications, which integrate with Hadoop and Spark. In this episode, Nicholson sheds light on current machine learning trends that he sees across industries and best practices for implementing AI solutions in order to gain consistent return on investment. For our readers who enjoyed out consensus on future trends in artificial intelligence consumer applications, it may be interesting to hear some of Chris's specific use cases in industry.

Episode Summary: A medium-size business with a $20m marketing budget can run into issues when aiming to track an attribute, what marketing dollars brought in customers, etc. But when you're managing $90B for customers all over the world and working in every conceivable channel, things get all the more complicated. Josh Sutton, global head of Data and AI at Publicis.Sapient, speaks in this episode about the future of advertising attribution with machine learning. Specifically, Sutton discusses how his team of publicists is working on managing, tracking, and determining cohorts and attribution across more channels and numerous clients, and touches on ways that the company is applying ML to make sense of marketing data and spend marketing dollars more effectively.

Episode Summary:Crowdsourcing is a relatively common term in technical vernacular today. Even if you're not a self-identified "techie", you may very may well have leveraged crowdsourcing in journalism, the sciences, public policy, or elsewhere. One area in which this concept hasn’t really taken off is in finance and hedge funds. In this episode, we speak with Numerai Founder Richard Craib, whose company is crowdsourcing a machine learning hedge fund. Their model is based on pooling data science talent from all over the world and using "anonymous" models to train financial data. These models compete against one another, and the winning models' creators are rewarded in bitcoin - a process based entirely on encryption and anonymity. Craib speaks about his overarching vision for the company, and also delves into his thoughts on the past, present, and future of AI applications in finance.

Episode Summary: This episode's guest is Uri Sarid, PhD, CTO for MuleSoft, Inc. Sarid speaks about where he believes the future of machine learning (ML) applications in industry might go - he thinks applications might stay small and niche-based, and will develop based on how well each serves its individual purposes. He also gives his perspective on how companies may adapt to deal with these disparate ML technologies, and expands on his belief that finding ways to connect technologies will be an important path in the development of machine learning applications and platforms across industries.

Episode Summary1: Fraud attacks have become much more sophisticated. Account takeovers are happening more often. Many security attacks involve multiple methods and unexpected attacks can devastate businesses in just a few days, as we saw with Neiman Marcus and Target. False promotion and abuse is seen not only on social media sites but is also targeted at business. To combat these risks, fraud solutions need to be smarter to keep pace with fraudsters to prevent attacks and react quickly when they do happen. This requires a fast-learning solution with the ability to continually evolve - which calls for the application machine learning for fraud detection. In this episode we talk to Kevin Lee from Sift Science and examine the shifts in the info security landscape over the past ten or fifteen year. Lee also highlights what new kinds of fraud are now possible and what machine learning solutions are available.

Gain Deeper Insight for Your A.I. Strategy

Emerj (formerly TechEmergence) helps business leaders survive and thrive in AI disruption. Jargon free research, case studies and business insight. Connecting business leaders to the products and services they need to stay ahead of the competition.