Botnet dataset

Assessing performance of any detection approach requires experimentation with data that is heterogeneous enough to simulate real traffic to an acceptable level. The lack of such data sets available for evaluating botnet detection approaches is well known in the field mostly due to a number of challenges that have been repeatedly emphasized in the literature [1], [2]. We constructed such data set paying a close attention to the following challenges:

Generality: Unfortunately, most of the existing botnet datasets have generality issue, i.e, they mostly include data from a few botnets (usually two or three samples). Limited in nature (detectors developed in these environments only reflect a small number of characteristics describing a very specific botnet behavior), these approaches are impractical and ineffective in a face of novel threats.

Realism: The effectiveness of the developed approach in practice is highly dependent on realistic botnet traffic traces used for its evaluation. Botnet traffic is usually generated/captured in a controlled environment. Providing a resilient environment (not detectable by the botnet) in which a botnet performs all of its intended malicious functionality is not trivial. In addition to resiliency, collection period must be long enough to allow dormant bots to exhibit their functionality.

Representativeness: Another problem with generating botnet data is an ability of collected network traffic traces to reflect real environment a detector will face during deployment. Due to privacy concerns gathering background data in a real production environment is not feasible in most cases, as a result traffic is either simulated or gathered in a controlled environment. To overcome these challenges, we create an evaluation set combining non overlapping subsets of the following data:

ISOT dataset [3] that has been created by merging different available datasets: French chapter of the Honeynet project [4], Ericsson Research in Hungray [5], and Lawrence Berkeley National Laboratory [6]. It contains both malicious (traces of Storm and Zeus botnets) and non malicious traffic (gaming packets, HTTP traffic and P2P application such as bittorrent). We used 15% and 25% of ISOT dataset in our training and test datasets respectiv

ISCX 2012 IDS dataset [7] that has been generated in a physical testbed implementation using real devices that generate real (e.g. SSH, HTTP, and SMTP) traffic that mimics users’ behavior. We included a subset of their normal traces in our training dataset. We also included a subset of their normal and IRC botnet traffic in our test dataset.

Botnet traffic generated by the Malware Capture Facility Project [8], a research project with the purpose of generating and capturing botnet traces in long term. From this data we extracted four botnet traces (Neris, Rbot, Virut, and NSIS) for our training dataset and nine botnet traces (Neris, Rbot, Virut, NSIS, Menti, Sogou, and Murlo) for our test dataset.

To merge these data traces in one unified data set we employed so called overlay methodology [1], one of the most popular methods for creating synthetic datasets. Malicious data is usually captured by honeypots or through infecting computers with a given bot binary in a controlled environment [9].

Botnet traces can be merged with benign data by mapping malicious data to either machines existing in the home network or machines outside of the current network [1]. Considering the wide range of IP addresses in the traces, we mapped botnet IPs to the hosts outside of the current network using BitTwist packet generator [10]. Malicious and benign traffic were then replayed using TCPReplay [11] and captured by TCPdump [12] as a single dataset.

Distribution of botnet types in the training dataset

Botnet name | Type | Portion of flows in dataset

Neris | IRC | 21159 (12%)

Rbot | IRC | 39316 (22%)

Virut | HTTP | 1638 (0.94 %)

NSIS | P2P | 4336 (2.48%)

SMTP Spam | P2P | 11296 (6.48%)

Zeus | P2P | 31 (0.01%)

Zeus control (C & C) | P2P | 20 (0.01%)

The resulting set was divided into training and test datasets that included 7 and 16 types of botnets, respectively. Tables 1 and 2 detail distribution and type of botnets in each dataset. Our training dataset is 5.3 GB in size of which 43.92% is malicious and the reminder contains normal flows. Test dataset is 8.5 GB of which 44.97% is malicious flows. We added more diversity of botnet traces in the test dataset than the training dataset in order to evaluate the novelty detection a feature subset can provide.