Machine Learning and Data Mining methods used in IDS

Approaches to security takes one of the two forms: Proactive security solution or Reactive security solutions. Conventionally, proactive security solutions are designed to maintain the overall security of the system under the event of compromise of individual components by an attack. Intrusion Prevention Systems (IPS) come under the class of proactive solutions. Reactive Security Solutions are systems used to recover from losses, contain infections and so on. Intrusion Detection Systems (IDS) come under the class of reactive solutions. Intrusion Detection Systems are what we’ll look at today. An IDS is a software application or device that monitors network or system activities for malicious activities or policy violations. A cyber intrusion is characterized as any unlawful attempt to access, manipulate, modify or destroy information or to use the systems to perform further actions without the consent of the people with the authority.

An IDS basically monitors the network and system activities. It assumes that all intrusions leave a trace and that this trace is different from that left behind by routine processes. The activities that the IDSs use to identify or define normal or deviant behavior can come from a variety of sources. Hence according to the methods of detecting or defining this behavior be have the following classifications:

Signature Detection

Anomaly Detection

Hybrid Detection

Scan Detector

Profiling Modules

I.Signature Detection

This method of raising alarms is primarily when the kind or type of intrusion is known (i.e. has been encountered before). This technique basically measures the resemblance between the currently occuring events and traces of known cyber misuse. Hence it flags behavior that shares similarity with the traces of known events. Thus this method is good for those attacks that are already known and with the added advantage of a low false positive rate. The disadvantage, which is quite obvious, is the ineffectiveness of signature based detection against new attacks.

II. Anomaly Detection

Anomaly detection takes a variation of white-listing approach to behavior tracing. Anomaly Detection triggers alarms when the object under surveillance behaves notably differently from the predefined normal patterns. Anomaly detection consists of two steps: training and detection. In the training step, machine learning techniques are applied to generate a profile of acceptable behavioral patterns . In the detection step, the input events are labeled as attacks if the traces deviate significantly. The advantage is that this method works against attacks that haven’t been seen before. The disadvantage is that this method is prone to a high rate of false positives. Feature selection and defense against adversarial machine learning attacks play a very very crucial role.

III. Hybrid Detection

Since the disadvantages of signature based detection techniques i.e. no defense against novel attacks is covered by anomaly detection and the disadvantage of anomaly based detection, i.e. high false positive rate can be covered by signature based detection. Hence the next step is to combine both approaches to make up for each others shortcomings.

IV. Scan Detection

Scan Detection methods are those techniques that raise flags when adversaries scan services or machines being monitored by the detection mechanism. Since scans are usually precursors to attacks, this detection technique also provides valuable information such as the source and the target IP etc. Although many techniques have been thought of, all such methods are plagued by high false positive rates and low true positive rates.

V. Profiling Modules

Profiling modules work by trying to group connections, processes, hosts etc. using clustering, association mining and so on by their behavior.Profiling makes use of extraction, aggregation and visualization. This too is a method that is highly dependent on the features, basically the machine learning and the data mining parts. Hence this method too hasn’t enjoyed much practical success.

We have seen a brief introduction to each of these methods.I’ll try analyzing some papers in further posts detailing some methods or approaches for each of the above.