Malware Detection Through Intelligence

With the exponentially increasing attacks on both enterprise and community networks, Malware Detection is a growing problem, especially on mobile platforms. Since the official app-stores have millions of mobile apps, it is almost impossible to examine each of them manually for malicious behaviour. Traditional approaches to malware detection are based on manual methods such as examining the behaviour and/or decompiled code of malware programs in order to design malware signatures by hand. However, these methods are not scalable to a large number of applications and a new malware can be designed to evade existing signatures. For that reason, recently there have been so many works on automatic malware detection using Machine Learning techniques.

Why using Artificial Intelligence?

In addition to the reasons that we defined above, AI has more advantages to be used in the field of malware detection. Especially on enterprise networks malware attacks are increasing exponentially. Given the amount of data, it is necessary to understand what’s going on under the hood. So some sort of methods needed in order to understand the data and the relationships between different features such as operations, malware families etc. AI comes into play at this point because it is great at finding patterns in a large volume of data.

How is it done?

There are several methods used when detecting malware through intelligence. First of all, some attributes must be extracted from the data for further use. These attributes will be used in the Machine Learning model and choosing these attributes well will affect the performance of the model. These attributes can be

Byte sequence n-grams

Opcode n-grams

System/API calls

Requested permissions

Printable strings

Byte Sequence n-grams

An n-gram is a contiguous sequence of n hexadecimal values from a given file. This method first implemented in IBM’s anti-virus scanner. The first implemented method uses 3-grams as features and a neural network as a classification model. Dimensionality reduction must be used with this kind of models because when n increases the features will also increase exponentially and causing a vector to hold millions of elements.

Opcodes n-grams

An opcode is an operation code of a machine language instruction that specifies the operation to be performed. These operation codes can be extracted from assembly language files. This method is very similar to the byte sequence n-grams method.

API Calls and Requested Permissions

API calls and requested permissions are widely used in Malware Detection, especially with the classic machine learning algorithms such as Decision Trees, Support Vector Machines and etc. Since these attributes are strings, they can be used with n-grams methods as well. It would be necessary to say that API Calls and Requested Permissions are not perfectly good enough to distinguish malicious software from benign and vice versa.

Deep Learning Methods

Deep learning is a new and exciting technique for implementing machine learning. Deep learning proved state-of-the-art results in many areas from academia to industry. From my point of view, when correctly implemented, the deep learning models are also way better than the classic machine learning models for detecting malware. From now on I’ll continue to the topic with a paper published in Proceedings of the ACM Conference on Data and Applications Security and Privacy (CODASPY) 2017. The pdf format can also be found on https://pure.qub.ac.uk/portal/files/122380314/sig_camera_ready.pdf.

The authors propose a new malware detection method by using Convolutional Neural Networks on static raw opcode sequence analysis. The research is done on Android applications and here is the workflow of producing an opcode sequence from an Android application.

Authors disassemble each apk package(and Android application package) and end up with a .smali(assembler) file. Then they extract the operation code sequences from these .smali files.

Advantages of using Convolutional Neural Networks

The training pipeline is much simpler compared to other methods.

After training the model, the network can be executed efficiently on a GPU.

Features are automatically learned by the network. This is pretty important both for the model and the team that does the research. Classic malware detection techniques require hand-designed signatures which takes lots of time and resource. Classic machine learning models are also require expert analysis in order to extract a secret sauce from the data. Both of these problems consumes time and labour. With CNNs features are automatically learned by the network.

Using neural networks allow us to use very long n-grams as well.

Since the detection network can be run on GPU, large number of malware files can be scanned per second on mobile devices.

Saves time when the system is presented with new malware to be recognized. Network can be just trained again.

Computationally efficient. Training and testing time is linearly proportional to the number of malware examples.

Disadvantage of using Convolutional Neural Networks

In order to have good results, a large training set is required. Since it is very hard to find a very large public dataset this method may require manual crawling.

Results

Datasets Used

Benign samples (Collected from Google play store and have been checked using virusTotal by the authors) (Small dataset)

Dataset provided by Intel Security (Benign: 3627, Malware: 2475)

For the first small dataset, paper shows state-of-the-art performance. For the large dataset n-grams method show a better performance. However as the dataset gets larger the f-score improves for the presented method. Authors did not run the n-grams method with the combined version of the large and small dataset because it is inefficient due to the computational resources needed.