Skillset

Introduction

Machine Learning is a subfield of computer science that aims to give computers the ability to learn from data instead of being explicitly programmed, thus leveraging the petabytes of data that exists on the internet nowadays to make decisions, and do tasks that are somewhere impossible or just complicated and time consuming for us humans.

Malware is one the imminent threats that companies and users face every day. Whether it is a phishing email or an exploit delivered throughout the browser, coupled with multiple evasion methods and other security vulnerabilities, it is a proven fact that nowadays defense systems cannot compete. The availability of frameworks such as Veil, Shelter, and others are known to be used by professionals when conducting pentesting work and are known to be quite effective.

Today I am going to show you that indeed Machine Learning can be used to detect Malware without having to use neither a signature detection nor a behavioral analysis.

P.S: Many products nowadays like CylanceProtect, SentinelOne, Carbon Black are known to leverage these capabilities the framework we are going to develop trough out this session is not at any level capable of doing what these products do, and I will explain shortly why.

Machine Learning a brief Introduction

Machine Learning is a subfield that mixes many domains of mathematics mainly Statistics and Probabilities and Linear Algebra and Computation (Algorithms, Data Processing, Numerical Calculations). To gain insight from data it is used to detect fraud, spam and recommending movies and meals and products to buy, Amazon, Facebook, Google to name a few of the hundreds of companies that use Machine learning to improve their products.

Machine Learning can be split into two major methodssupervised learning and unsupervised learning the first means that the data we are going to work with is labeled the second means it is unlabeled, detecting malware can be attacked using both methods, but we will focus on the first one since our goal is toclassifyfiles.

Classification is a sub domain of supervised learning it can be either binary (malware-not malware) or multi-class (cat-dog-pig-lama…) thus malware detection falls under binary classification.

Explaining Machine Learning is beyond this article, and nowadays you can find a large amount of resources to know more about it, and you can check the Appendix for more of these resources.

The Problem Set

Machine Learning works by defining a problem, collecting the data, processing the data to make it usable and then feeding it to the algorithms. This makes it quite hard to implement in everything for the extensive amount of resources you may need to do this; this is called the machine learning workflow it is the minimal steps you need to start doing Machine Learning.

In our case let’s define our workflow:

First, we need to collect malware samples and clean samples we cannot work with less than 10k samples of both, and it is advisable to use even more of these

We need to extract meaningfulfeaturesfrom our samples these features will be the basis of our study; features are what describe something, for example, the features of a house are:

number of rooms

SQ foot of the house

price

After extracting these features, we need to process all our samples to build a dataset it can be a database file or a CSV file this way it will be easier to turn it into vectors since the algorithms work by performing computation on vectors

Lastly, we need metrics in this binary classification there are a multitude of metrics to benchmark the performance of an algorithm (ROC/AUC, Confusion Matrix…) we will use a confusion matrix since it represents the rates of True Positives and True Negatives as well as False Positives and False Negatives.

Collecting Samples and Feature Extraction

I assume the reader knows about the PE File Format if you do not you can read about it here, collecting samples is quite easy you can either use a paid service like (VirusTotal) or one of the links here

Okay, let’s start on by discussing our model.

For our algorithm to learn from the data you feed it we need to make that data understandable and clear, in our case, we will use 12 features to teach our algorithm these features will be extracted from each binary and organized into a CSV file once.

Feature Extraction

To extract features, we will be using pefile. First Step is to download pefile I assume you know some Python and how to use pip.

From your terminal run:

pip install pefile

Now that you have the necessary tools let’s write some code, but first let’s discuss what kind of information we want to extract. We are interested in extracting the following fields of a PE File:

Major Image Version: Used to indicate the major version number of the application; in Microsoft Excel version 4.0, it would be 4.

Virtual Adress and Size of the IMAGE_DATA_DIRECTORY

OS Version

Import Adress Table Adress

Ressources Size

Number Of Sections

Linker Version

Size of Stack Reserve

DLL Characteristics

Export Table Size and Adress

To make our code more organized let’s start by creating a class that represents the PE File information as one object

importos

importpefile

classPEFile:

"""

This Class is constructed by parsing the pe file for the interesting features

each pe file is an object by itself and we extract the needed information

Now we move on to write a small method that constructs a dictionnary for each PE File thus each sample will be represented as a python dictionnary where keys are the features and values are the value of each parsed field .

defConstruct(self):

sample = {}

for attr, k inself.__dict__.iteritems():

if(attr !="pe"):

sample[attr] = k

return sample

Since we can write code let’s write a script that will loop trough all samples in a folder and process each one of them then dump all those dictionaries into one csv file that we will use .

defpe2vec():

"""

dirty function (handling all exceptions) for each sample

it construct a dictionary of dictionaries in the format:

sample x : pe informations

"""

dataset = {}

for subdir, dirs, files in os.walk(direct):

for f in files:

file_path = os.path.join(subdir, f)

try:

pe = pedump.PEFile(file_path)

dataset[str(f)] = pe.Construct()

exceptExceptionas e:

print e

return dataset

# now that we have a dictionary let's put it in a clean csv file

defvec2csv(dataset):

df = pd.DataFrame(dataset)

infected = df.transpose() # transpose to have the features as columns and samples as rows

# utf-8 is prefered

infected.to_csv('dataset.csv',

sep=',', encoding='utf-8')

Okay now we are ready to process some data, I advise you to use the code from my Github .

Exploring the Data

A Step that is not needed but can be quite eye opening experience it gives a more intuitive idea about the whole data.

In [2]:

importpandasaspd

importnumpyasnp

importmatplotlib.pyplotasplt

malicious = pd.read_csv("bucket-set.csv")

clean = pd.read_csv("clean-set.csv")

In [3]:

print"Clean Files Statistics"

clean.describe()

Clean Files Statistics

Out[3]:

DebugRVA

DebugSize

Dll

ExportRVA

ExportSize

IATRVA

ImageVersion

LinkerVersion

NumberOfSections

OSVersion

ResSize

StackReserveSize

clean

count

2.467000e+03

2467.000000

2467.000000

2.467000e+03

2467.000000

2.467000e+03

2467.000000

2467.000000

2467.000000

2467.000000

2.467000e+03

2.467000e+03

2467.0

mean

1.009835e+05

33.970004

6305.958654

1.473796e+05

1619.046210

4.863884e+04

302.233077

9.051885

3.978111

5.942440

1.690548e+05

3.025229e+05

1.0

std

5.217597e+05

14.873702

12392.766981

5.148365e+05

9275.796269

4.835382e+05

2484.761684

0.651705

1.165679

0.390389

9.364935e+05

1.871939e+05

0.0

min

0.000000e+00

0.000000

0.000000

0.000000e+00

0.000000

0.000000e+00

0.000000

2.000000

1.000000

0.000000

9.040000e+02

2.621440e+05

1.0

25%

4.416000e+03

28.000000

320.000000

4.304000e+03

74.000000

4.096000e+03

6.000000

9.000000

4.000000

6.000000

1.056000e+03

2.621440e+05

1.0

50%

4.816000e+03

28.000000

320.000000

1.472000e+04

147.000000

4.096000e+03

6.000000

9.000000

4.000000

6.000000

2.040000e+03

2.621440e+05

1.0

75%

2.099400e+04

56.000000

1344.000000

8.676000e+04

287.000000

4.096000e+03

6.000000

9.000000

4.000000

6.000000

2.190800e+04

2.621440e+05

1.0

max

1.769935e+07

84.000000

49472.000000

1.019821e+07

205292.000000

1.786675e+07

21315.000000

14.000000

22.000000

10.000000

2.026722e+07

4.194304e+06

1.0

In [4]:

print"Malicious Files Statistics"

malicious.describe()

Malicious Files Statistics

Out[4]:

DebugRVA

DebugSize

Dll

ExportRVA

ExportSize

IATRVA

ImageVersion

LinkerVersion

NumberOfSections

OSVersion

ResSize

StackReserveSize

count

2004.000000

2004.000000

2004.000000

2.004000e+03

2.004000e+03

2.004000e+03

2004.000000

2004.000000

2004.000000

2004.000000

2.004000e+03

2.004000e+03

mean

15453.085828

5.182136

16616.363772

1.933029e+04

3.183463e+05

6.372132e+04

19.202096

7.705589

4.477545

36.024451

4.882199e+04

1.078599e+06

std

50630.027056

12.926161

16693.869293

2.049653e+05

1.283018e+07

9.307602e+04

755.237241

8.081842

1.524306

1225.262134

7.545737e+05

1.011342e+06

min

0.000000

0.000000

0.000000

0.000000e+00

0.000000e+00

0.000000e+00

0.000000

0.000000

2.000000

1.000000

0.000000e+00

0.000000e+00

25%

0.000000

0.000000

0.000000

0.000000e+00

0.000000e+00

8.192000e+03

0.000000

6.000000

3.000000

4.000000

1.104000e+03

1.048576e+06

50%

0.000000

0.000000

1024.000000

0.000000e+00

0.000000e+00

2.867200e+04

0.000000

7.000000

4.000000

4.000000

2.880000e+03

1.048576e+06

75%

0.000000

0.000000

33088.000000

0.000000e+00

0.000000e+00

1.187840e+05

5.000000

9.000000

5.000000

5.000000

3.173800e+04

1.048576e+06

max

396224.000000

213.000000

59669.000000

8.273884e+06

5.704256e+08

1.327168e+06

33795.000000

248.000000

18.000000

54034.000000

3.356242e+07

3.355443e+07

We can see the discrepancies between the two sets especially in the first two features Let’s plot some of these features to get a visual idea about those differences

In [6]:

#lets plot

#let's label our dataframes

malicious['clean'] =0

clean['clean'] =1

importseaborn

%matplotlib inline

fig,ax = plt.subplots()

x = malicious['IATRVA']

y = malicious['clean']

ax.scatter(x,y,color='r',label='Malicious')

x1 = clean['IATRVA']

y1 = clean['clean']

ax.scatter(x1,y1,color='b',label='Cleanfiles')

ax.legend(loc="right")

Out[6]:

<matplotlib.legend.Legend at 0x7f7f1e5f83d0>

We can notice the “clustering” of the Malicious samples on a tight centroid while the cleanfiles are sparse over the ‘x’ line let’s try now to plot other features as well to get an overall understanding of what we have here

In [13]:

%matplotlib inline

fig,ax = plt.subplots()

x = malicious['DebugRVA']

y = malicious['clean']

ax.scatter(x,y,color='r',label='Malicious')

x1 = clean['DebugRVA']

y1 = clean['clean']

ax.scatter(x1,y1,color='b',label='Cleanfiles')

ax.legend(loc="right")

Out[13]:

<matplotlib.legend.Legend at 0x7f7f1f570390>

In [14]:

%matplotlib inline

fig,ax = plt.subplots()

x = malicious['ExportSize']

y = malicious['clean']

ax.scatter(x,y,color='r',label='Malicious')

x1 = clean['ExportSize']

y1 = clean['clean']

ax.scatter(x1,y1,color='b',label='Cleanfiles')

ax.legend(loc="right")

Out[14]:

<matplotlib.legend.Legend at 0x7f7f1b402190>

The more we plot and analyze the data the more we understand and get a sense of the overall distribution,of course a problem arises what do I do if I have a high-dimensional dataset well what we have here is fairly low dimensional but a lot of technics can be used to reduce the dimensions to the more “important” features algorithms like PCA and t-SNE can be used to visualize the data on 3D or even 2D plots .

Machine Learning Application

Enough with the statistics let’s do some work, till now we did not do any machine learning work what we did is part of the whole work we took some data, cleaned it and prepared it. Now to start experimenting with Machine Learning, we have to do a few more things:

First, we need to merge our datasets (malicious and clean) into one DataFrame

We need to split our DataFrame into two parts the first one will be used for training and later for testing

We will then proceed to apply few algorithms and see what happens

Dataset Preparation

In [22]:

dataset = pd.read_csv('malware-dataset.csv')

"""

Add this points dataset holds our data

Great let's split it into train/test and fix a random seed to keep our predictions constant

Now we have 4 Matrices quite big ones X_train and y_train will be used to train our different classifiers, and X_test will be used to predict the labels, and y_test will be used for metrics, in fact, we are going to compare the predictions from X_test to y_test to see how we did perform. We start by using Random Forests which are an ensemble version of Decision Trees they work by creating a lot of decision trees at training time and outputting the class that is the mode of the classes (classification), they are quite performant when it comes to binary classification problems

In [25]:

#let's start with random forests

#we initiate the classifier

clf1 = RandomForestClassifier()

#training

clf1.fit(X_train,y_train)

#prediction labels for X_test

y_pred=clf1.predict(X_test)

#metrics evaluation

"""

tn = True Negative a correct prediction clean predicted as clean

fp = False Positive a false alarm clean predicted as malicious

tp = True Positive a correct prediction (malicious)

fn = False Negative a malicious label predicted as clean

"""

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print"TN = ",tn

print"TP = ",tp

print"FP = ",fp

print"FN = ",fn

TN = 697

TP = 745

FP = 6

FN = 4

Notice anything? Well if you have 6 False Positives and 4 False Negatives with no parameter tuning and no modifications are quite good,actually we were able to detect 697 Clean files correctly and 745 Malicious Ones Correctly, guess our small Anti-Virus is working :D.

Let’s try this time another classifier, we will build a simple neural network and test it on another randomized split.

According to Wikipedia

A multilayer perceptron (MLP) is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs. An MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training the network. MLP is a modification of the standard linear perceptron and can distinguish data that are not linearly separable.

A Multi-Layer Perceptron is the generalized version of the perceptron which is the basis model of the neuron they are the fundamental building blocks for deep learning methods where we meet larger and deeper networks.

#This is a special process called feature engineering where we transform our data into the same scale for better predictions

scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

#Here we build a Multi Layer Perceptron of 12 Layers for 12 Features you can use more if you want but it will turn into a complex zoo

mlp = MLPClassifier(hidden_layer_sizes=(12,12,12,12,12,12))

#Training the MLP on our data

mlp.fit(X_train,y_train)

predictions = mlp.predict(X_test)

#evaluating our classifier

tn, fp, fn, tp = confusion_matrix(y_test,predictions).ravel()

print"TN = ",tn

print"TP = ",tp

print"FP = ",fp

print"FN = ",fn

TN = 695

TP = 731

FP = 8

FN = 18

The all mighty Neural Network failed to detect eighteen Threats not only that it detected them as clean files which is a very very bad problem imagine your antivirus detecting a ransomware as a clean file? Well this sounds like AV Evasion on AI but let’s not be pessimistic our Neural Network is very primitive we can actually make it more accurate, but this is beyond the scope of this article

Conclusion:

This is just the beginning. I wanted to show that Malware Classification is indeed a solvable problem if we accept 99% as a good accuracy rate. Of course, building and deploying something like this, in reality, is time-consuming and requires more knowledge and more data. This was merely a preview of the infinite possibilities machine learning and AI, in general, offers us, I hope this was educational, fun and insightful.

Ressources:

Machine Learning Course by Andrew NG

https://fast.ai a Course that will make you a deep learning practitioner in 7 weeks only requirement (Python)

Elements of Statistical Learning (Harstie) this is a more theoretical book but quite insightful

Achraf Belaarch is an applied Mathematics undergraduate. In his free time, he likes challenging problems while exploring the applications of machine learning and deep learning in cybersecurity. He also enjoys programming and reading research papers.

Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.

× 1 = 7

About InfoSec

At Infosec, we believe knowledge is the most powerful tool in the fight against cybercrime. We provide the best certification and skills development training for IT and security professionals, as well as employee security awareness training and phishing simulations. Learn more at infosecinstitute.com.

Connect with us

Join our newsletter

File download

First Name

Last Name

Work Phone Number

Work Email Address

Job Title

Why Take This Training?

How will you fund your training?

What is your training budget?

InfoSec institute respects your privacy and will never use your personal information for anything other than to notify you of your requested course pricing. We will never sell your information to third parties. You will not be spammed.

Comments

What is Skillset?

Skillset

Practice tests & assessments.

Practice for certification success with the Skillset library of over 100,000 practice test questions. We analyze your responses and can determine when you are ready to sit for the test. Along your journey to exam readiness, we will:

1. Determine which required skills your knowledge is sufficient
2. Which required skills you need to work on
3. Recommend specific skills to practice on next
4. Track your progress towards a certification exam