During the past few years, the rapid advancement of the Internet of Things (IoT) technologies has introduced transformational changes to our life. Nevertheless, the sophistication of IoT systems comes at the expense of a rise in the severity of cyberattacks that exploit vulnerabilities in IoT devices. In particular, the aftermath of a recently discovered IoT malware, known as Mirai, was prominent. Mirai is a worm that acts via finding an IoT device with similar vulnerability to ignite self-replication. The attacker controls a large number of IoT devices infected with the Mirai malware and turns them into botnets to launch a DDoS attack via seeding an enormous number of packets to target hosts.

To be able to mitigate such a wide scale cyberattack efficiently, it is essential to develop an approach that is capable of monitoring cyberattacks taking place on the internet with a wide view. To achieve this, the use of the darknet, referred to as the network telescope, has been researched for years. The darknet represents an unused address space. No communication takes place as there is no computer actually installed on the darknet, yet a large number of packets are arriving in reality. These packets are namely yielded by scan activity or backscatter of reply packets originating from hosts that have been targeted by a DDoS attack. As such, it can be assumed that packets detected in the darknet are generated by malwares. Consequently, via analyzing darknet packets, it is possible to detect a portion of cyberattacks taking place over the internet.

A recently published research paper analyzed the behavior of scan attacks via packets detected in the darknet. The study focused on TCP SYN packets characterizing scan attacks and aimed to detect statistical features within the TCP headers of these packets. To achieve this, the researchers applied the association rule learning to SYN packets and discussed the dynamic characteristics of IoT malware conducting scan attacks. Considering destination port information, there have been previous researches that analyzed SYN packets. Some researchers implemented the association rule learning to the destination port numbers of analyzed SYN packets, which led to the discovery of multiple association rules associated with the Carna botnet, as well as other forms of malware.

Throughout this article, we will overview the technique of darknet analysis of scan attacks via association rule learning and how it was used to identify the Mirai malware.

Darknet traffic analysis:

The darknet comprises a reachable, yet unused, IP address space over the internet. With the IPv4 protocol, which is used in most online communications presently, there are around 4.3 billion IP addresses. Nevertheless, not all of these IP addresses represent host computers. Actually, a considerable number of packets are received, even though packet transmission to an unused IP address does not take place with normal internet use. There are two main reasons for this:

1- Scan activity generated by malware.

2- Backscatter representing reply packets sent from a target host compromised by a DDoS attack.

A scan attack is launched by malwares to check if there is a security vulnerability in a target host. The SYN scan is an attack that involves sending an SYN packet in TCP communication, which is referred to as a stealth scan attack, since it is performed without leaving a log on a server. On the other hand, backscatter represents a train of reply packets sent by a victim compromised by a DDoS attack. Detecting backscatters enables us to know if a DDoS attack has been launched.

However, in a DDoS attack, compromised bot computers send lots of packets to a target host and some of those backscattered packets jump into the darknet. As such, we only detect DDoS attacks via this darknet traffic analysis.

Association rule learning:

The problem of association rule learning was first proposed in the context of market basket data to identify frequent groups of items that are bought all together. The problem of association rule learning can be defined as follows:

Let D = { T1, T2,···,TN} be a set of N transactions referred to as the database.

Let I = { i1, i2,···,iM} be the universal set of M which represents all items present in the database. Every transaction in the database D has its own unique transaction ID and includes a subset comprised of the items in I.

The support supp( X ) of a set of items (for short item set) X can be defined as the number/proportion of transactions within the database which include the item set.

Frequent pattern mining is to identify all patterns P ⊂ I that are present in at least a percentage S of the transactions. The percentage S represents the minimum support. It can be expressed as an absolute number, or as a fraction of the total number of transactions included in the database.

An association rule can be termed as an implication of the form

X → Y, for X, Y ⊆ I, X ∩ Y = ∅

The item sets X and Y are referred to as antecedent and consequent of the rule respectively. The confidence of a rule is denoted via the conditional probability, P( Y|X), i. e.,

conf( X ⇒ Y) = supp( X ∪ Y ) / supp( X)

To choose significant rules from the set of all possible rules, rules that meet both a minimum support threshold, S, as well as a minimum confidence threshold, C, are referred to as strong.

Generally speaking, association rule learning can be performed in the following two steps:

1. Frequent pattern mining: Each of the item sets will meet the minimum support threshold, i.e., occurs at least as frequently as S.

2. Strong association rule generation: by definition, rules created from the frequent item sets with guaranteed minimum support have to meet the minimum confidence threshold constraint.

The proposed method focuses on scan attacks that use TCP SYN packets, as they represent the majority of darknet packets. Previous researches focused on the destination port number of SYN packets to detect association rules that aided in the identification of Carna botnet. Nonetheless, apart from destination port numbers, there are other data in TCP and IP headers that can feature packet traffic such as destination port, source port, window size, sequence numbers, and others.

The proposed method utilizes every header field to detect scan attacks. Firstly, all values of each TCP and IP header are checked for all collected SYN packets, in order to define a “transaction set” for every header field. For instance, if the analysis relies on the three header fields, “sequence number,” “destination port,” and “window size,” three transaction sets are defined for the obtained darknet SYN packets. Thereafter, the association rule learning is undergone for each transaction set. If significant association rules are obtained, we would formulate a hypothesis that the behaviors of a scan attack are marked by the identified association rules, and then the authenticity of a hypothesis of darknet packets can be verified on the other days. If the authenticity, or correctness, of a hypothesis is verified, the association rules can be used as a signature of a malware or a piece of code performing a scan attack.

Analyzing darknet traffic before Mirai outbreak:

To test the proposed rule mining method, the researchers utilized a big set of TCP SYN packets collected in the period between July 1st, 2016 and September 15th, 2016 via means of the NICT/16 darknet sensor. The number of collected packets totaled 1,840,973,403, which were sent from 17,928,006 unique hosts. The researchers picked “destination port,” “sequence number,” and “window size” as header fields, and the association rule learning was conducted for darknet SYN packets obtained everyday throughout the aforementioned period.

Figure (1) illustrates the transitions of target hosts which sent darknet SYN packets matching the three association rules. As figure (1) shows, the number of these hosts was first identified on August 2nd and disappeared on September 4th, which is only 3 days before the source code of Mirai was opened. Generally speaking, there are two peaks regarding the number of hosts matched with the obtained association rules. To check whether or not the scan activities in Figure (1) were related to Mirai, the researchers verified the following three characteristics of Mirai, which were clarified by the opening of the source code.

Condition 1 sequence number = destination IP,

Condition 2 destination port = 23,

Condition 3 source port > 1024.

Interestingly, the majority of hosts matched with window size features met the above three conditions of Mirai. As such, the scan activities in Figure (1) might denote that attackers were performing some tests or preparing for the actual distribution of the Mirai malware.