Practical experience is essential for learning how to perform Internet data analysis and computer science students should have opportunity and be encouraged to work with the real Internet data. While traffic data from operational Internet links is difficult to obtain due to privacy constraints, we are able to collect samples of darkspace traffic that is more readily amenable to capture and analysis since it does not result from meaningful bidirectional communications.

A darkspace is a segment of globally routable Internet address space that has no active hosts. All traffic arriving to such IP darkspace is unsolicited and unidirectional, sometimes also called Internet Background Radiation (IBR). Observing and analyzing darkspace traffic can facilitate study of security-related Internet phenomena such as denial-of-service attacks from randomly spoofed sources, the automated spread of Internet worms and viruses, scanning of address space by attackers or malware looking for vulnerable targets, and various botnet activity. Darkspace traffic has also been used to analyze macroscopic Internet events unrelated to malware, such as country-level censorship of Internet communications, and natural disasters affecting reachability of significant regions of Internet infrastructure. The fact that darkspace traffic is less sensitive from a privacy standpoint, but still highly relevant to many Internet research questions, makes these data ideal for educating students on Internet data analysis methods, tools, and issues.

The UCSD Network Telescope is monitoring instrumentation that collects traffic destined to a large segment of dark (unassigned) address space (also known as an Internet darkspace, darknet, or blackhole). The UCSD Telescope's darkspace is a globally routed /8 network (approximately 1/256th of all IPv4 Internet addresses) that carries almost no legitimate traffic because most IP addresses in this prefix are not assigned to any hosts. After discarding traffic to the few hosts with assigned IP addresses, the remaining packets represent a continuous sample of anomalous unsolicited traffic [5]. (Active hosts also receive unsolicited traffic but we choose not to examine it for privacy reasons.)

We store traffic data captured by the UCSD Network Telescope in pcap format, each file containing all packets observed in one hour, each packet timestamped when the telescope receives it. The timestamps stored in the pcap files are in the epoch time format representing the number of seconds elapsed since January 1, 1970 midnight (UTC).

To teach methods and tools for darkspace traffic analysis, we built this educational data kit around a specific activity relevant to global Internet security: "Patch Tuesday" (PT) [3]. Microsoft releases accumulated security patches on the second Tuesday of each month at 10:00 AM local time in Redmond, WA, which corresponds to either 17:00 UTC (when daylight saving time is in effect) or 18:00 UTC (otherwise). After Patch Tuesday, attackers sometimes use the released patch information to exploit vulnerabilities on unpatched machines, or they may check whether previously exploited security holes remain open. In general, launching new malware immediately after Patch Tuesday maximizes the potential duration of the malware's effectiveness before the next patch release. Our proposed exercises aim to investigate the impact of patching strategies on the observable characteristics of unsolicited traffic.

In April 2012, Patch Tuesday fell on April 10 at 17:00 UTC. The raw PT data set from which we curated the educational data kit consists of 720 pcap files of darkspace traffic captured by the UCSD Network Telescope throughout April 2012: 1 file per hour × 24 hours per day × 30 days. The pre- and post-release data establish a baseline for studying the effects of this PT update.

We removed the payload from the captured packets and zeroed out the first eight bits of the destination IP address to anonymize the address range of the UCSD Network Telescope darkspace. We also anonymized the source IP addresses in the data because some packets originated from victims of DDoS attacks (backscatter data); we hide the IP addresses of victim hosts to protect them from further malicious activity.

Throughout this tutorial, we demonstrate processing steps on small subsets of the data, and provide computed results for the whole data set for further analysis.

Most of the exercises can be done using the aggregated data in the FlowTuple format1 rather than raw data in the pcap format. Therefore, we illustrate the aggregation process on just one hour of the raw pcap data (pt_example.pcap.gz) and provide 720 aggregated FlowTuple files prepared for the whole PT data set.

Next, we show how to compute statistics of the darkspace traffic from the aggregated data using a short list of just three FlowTuple files in the file example_ftlist.txt as an example and again, provide the computed statistics for the whole PT data set.

Table 1 shows all the files comprising the educational data kit. The files with the suffix .txt are in ASCII format and can be viewed with any standard text editor or viewer. The pcap file example.pcap.gz and FlowTuple file example_flowtuple.cors.gz can be displayed using tcpdump and cors2ascii, respectively, described in Section 4.1.

Filename

Description

size

example.pcap.gz

one hour long compressed raw pcap data from the PT data set

4.6 GB

example.flowtuple.cors.gz

compressed FlowTuple data generated from the example pcap file

12 MB

ucsd.[epoch_time].flowtuple.cors.gz

720 compressed hourly FlowTuple files generated for the whole PT dataset (30 days of April 2012, 24 files per day)

390 GB

example_ftlist.txt

a short list of three FlowTuple files from the PT dataset (to be used as an example)

279 B

apr2012_ftlist.txt

complete list of the 720 FlowTuple files generated for the whole PT dataset

65 kB

apr2012_pkt_count.txt

timestamp of the beginning of an hour and the number of packets per hour for the whole month April 2012

16 kB

apr2012_src_count.txt

timestamp of the beginning of an hour and the number of unique source IP addresses per hour for the whole month April 2012

Exercises described in this tutorial require three tools to be installed on a local host: Corsaro [9] (with the libtrace library); Octave [2] (we used v3.6.1); and tcpdump [4]. We recommend a Linux system with at least 4 GB memory, although we have tested the tools on other platforms (FreeBSD, OSX, etc.) as well.

CAIDA developed the pcap-trace processing software Corsaro in order to analyze darkspace traffic, although it would work on other types of passive traffic trace data. CAIDA has published a complete description of Corsaro features [9] and installation instructions [11]. To use the FlowTuple data provided, Corsaro should be configured using the following command:

Octave is a high-level interpreted language, primarily intended for numerical computations and also providing extensive graphics capabilities. Octave normally works through its interactive command line interface, but can also be used to write non-interactive programs. Its language is quite similar to Matlab so that most programs are easily portable. This software is distributed under the terms of the GNU General Public License.

All Octave instructions shown in our exercises are available as small scripts that one can start from the Octave command line. The option -q prevents Octave from displaying its startup message. Table 2 gives an overview of the scripts.

Raw telescope data are stored in hourly files in pcap format. To get acquainted with the data, one can display the content of a pcap file as ASCII text characters using the tcpdump command with the -r option. Other useful options, rendering the output much faster, are: -n which prevents tcpdump from converting IP addresses to domain names, and -t which prevents printing a timestamp for each packet2. So, if the pcap file is compressed, the command:

Corsaro takes the pcap file as input and writes an aggregated FlowTuple file as output (cf. Figure 1). In the FlowTuple output file, the data is broken into intervals, each representing 60 seconds of data. Within an interval, each unique key (unique combination of the FlowTuple fields) observed in the raw pcap data is reported on a separate line in the following format:

The < value > that follows the FlowTuple fields is the number of packets in the interval whose header fields match this FlowTuple key.

In our first exercise, we process a file of pcap data example.pcap.gz into the FlowTuple file:

corsaro -o example.%P.cors.gz example.pcap.gz

The %P in the filename will be substituted with the plugin names included during the Corsaro installation. For example, if Corsaro is run with the FlowTuple plugin enabled (the default), Corsaro will create a FlowTuple output file by substituting the %P with the string flowtuple. In the example given above, Corsaro would create the output file example.flowtuple.cors.gz. By adding the extension .gz to the output file name, we ensure that Corsaro automatically compresses the output file using gzip, further reducing storage requirements.

Next, the cors2ascii command will display the FlowTuple output in a human-readable ASCII format:4

Each line in this output corresponds to the format in Listing and shows the eight FlowTuple fields separated by |. As explained in Section 2.3, the first octet of the destination IP address is set to 0. The last value on each line shows how many packets with this unique combination of fields were in the input pcap file.

The exercises in Section 4.2 and Section 4.3 use aggregated FlowTuple files as input. To reduce bandwidth requirements for downloading the data kit, we provide 720 FlowTuple files ucsd-nt.anon.[epoch_time].flowtuple.cors.gz precomputed from the 720 original hourly pcap files of the UCSD Network Telescope Patch Tuesday data collected in April 20125. The [epoch_time] field in each file name shows the starting time of the hour of data in this file in the epoch format.

To compute hourly packet rates for the whole dataset, we need to sum the packet count values in each hourly FlowTuple file. We use the Corsaro FlowTuple aggregation tool, cors-ft-aggregate, as follows:

The option -i 3600 specifies an aggregation interval of one hour (3600 seconds). The option -v packet_cnt indicates that we want to aggregate the packet count value. To illustrate how the aggregation tool works, we input the file example_ftlist.txt, which lists 3 of the 720 FlowTuple files comprising the full the PT data set.

The output file example_pkt_count.txt contains three lines for each one hour FlowTuple file listed in the input file. The first line shows the epoch time of the start of the one hour (3,600 s) interval. The second line shows the aggregated FlowTuple fields for this interval. All field values are equal to 0 because we aggregate over all of them; the only non-zero value (separated by a ',') is the total packet count per interval. The third line is shows the end time of the interval in the epoch format.

The first value on each line shows the start of an hour interval in epoch time format. The second value shows the number of packets observed in that hour. Since the input file example_ftlist.txt lists three hourly files, the output file example_pkt_count_ts.txt has three lines.

Type (column)

Protocol

Dest. IPs

Ports

Packets

μTorrent (14)

any

any

any

μTorrent packets

Conficker-C (15)

any

any

any

Conficker-C

1 or 2 packets (16)

any

any

any

< 3 packets

TCP Probe (3)

TCP

one

one

all

TCP Vert. Scan (4)

TCP

one

multi

all

TCP 445 Horiz. Scan (20)

TCP

multi

445

all

TCP Horiz. Scan (5)

TCP

multi

one

all, except to dport 445

TCP Backscatter (17)

TCP

any

any

TCP-ACK, TCP-RST

UDP Probe (7)

UDP

one

one

all

UDP Vert. Scan (8)

UDP

one

multi

all

UDP Horiz. Scan (9)

UDP

multi

one

all

DNS Backscatter (18)

TCP/UDP

any

any

source port 53

TCP and UDP (13)

TCP/UDP

any

any

TCP and UDP

ICMP Backscatter (19)

ICMP

any

any

Time Exceeded, Dest.Unreach.

ICMP Only (11)

ICMP

any

any

all

TCP Unknown (6)

TCP

multi

multi

all remaining TCP

UDP Unknown (10)

UDP

multi

multi

all remaining UDP

Unclassified (2)

any

any

any

all remaining

Table 3: Source Types in descending order of classification

We pre-processed all 720 hourly FlowTuple files in the PT dataset by applying cors-ft-aggregate and cors-ft-timeseries.pl to all PT hourly files in apr2012_ftlist.txt. The resulting time series of the hourly packet rates for the whole month of April 2012 is available in the file apr2012_pkt_count.txt.

To determine the overall number of unique source IP addresses seen per hour in the data, we again use the aggregation tool cors-ft-aggregate, but with different options. This time we want to aggregate all FlowTuples using only the source IP address as a key:

Here the option -i 3600 again specifies the desired aggregation interval of one hour. The option -v src_ip indicates that we want to aggregate over the unique source IP addresses. Again, we input a small subset of FlowTuple files listed in the file example_ftlist.txt to illustrate the aggregation process.

The format of the resulting file is similar to the output in the previous exercise. There are three lines of output for each input file. These lines show the interval starting and ending times, and the number of unique source IP addresses observed during that hour:

As in Section 4.2, we use the perl script cors-ft-timeseries.pl to combine the interval start time and the number of unique sources sending traffic to the UCSD Network Telescope during that hour onto one line:

The two values per line are the start of an hour interval in epoch format, and number of unique source IP addresses observed in that hour.

To obtain hourly counts of unique source IP addresses for the whole PT dataset, we again pre-processed all 720 hourly FlowTuple files using cors-ft-aggregate, cors-ft-timeseries.pl and the full list of PT hourly files in apr2012_ftlist.txt. The resulting time series of the number of unique source IP addresses per hour for April 2012 is available in the file apr2012_src_count.txt.

Another plugin provided by Corsaro to aggregate the raw pcap data, is smee (cf. Figure 1). This plugin is a Corsaro implementation of the IATmon tool [8] which classifies sources of observed darknet traffic into 18 mutually exclusive types based on protocol and temporal patterns across a configured time interval. All packets observed in this time frame are first aggregated according to their source IP address, and then each source IP address is assigned to one of the 18 source types based on what type and pattern of packets it generates. Table 3 lists attributes of the source types.

Again, we use a small subset of the compressed original data in file example.pcap.gz to illustrate the source type classification with smee:

corsaro -p smee -o example.%P.cors.gz example.pcap.gz

The -p smee option specifies the Corsaro plugin used that becomes part of the output file name, for example: example.smee.cors.gz (compressed using gzip).

Note:Corsaro can run with multiple plugins. In particular, one can generate both the FlowTuple aggregation (see Section 4.1) and the source type classification with a single command by including both plugins (`-p flowtuple -p smee') in the command line, producing two output files (one for each plugin).

The output file example.smee-sum.cors.gz produced above contains the source type analysis for the packets recorded in the one-hour long example pcap file example.pcap.gz: the number of source addresses per source type, the number of packets per source type, and some additional information.

To save students the effort of running the source type analysis for all pcap files in the PT dataset, we generated the file apr2012_src_types.txt which contains the time series of the number of unique source IPs per hour for each source type. Each line of the file accounts for 1 hour of data and contains 22 values. The first value is the hour start time in the epoch format while columns 2-19 contain the number of source IPs observed during this hour and attributed to a given source type. The number in parentheses in Table 3 shows which position in the line of file apr2012_src_types.txt contains the data for this source type. For example, the source type `TCP vertical scan' in the Table is marked with `(4)'. In the following output we see that during the first hour of the collected data we observed 997 unique IP addresses classified as the source type 'TCP vertical scan' (column 4). The last two values in each line are: the date (in YYYYmmDDHHMM format) (position 21) and the sum of the counts of source IPs for all types for this hour (position 22, the last value in each line).

The objective of the following exercises is to analyze whether one can discern any unusual characteristics of unwanted darkspace traffic during or after the April 2012 Patch Tuesday. First, we consider the overall packet count, the number of sources that contribute to the unwanted traffic, and the distributions of protocols and destination ports observed in the captured packets. Next, we look into the number of packets and the number of sources per source type to find out if we observe any unusual behavior attributable to specific source types. Finally, we analyze the temporal behavior of the two time series formed by the packet count per hour and the number of sources per hour.

In this section we apply Octave scripts to the prepared files apr2012_pkt_count.txt, apr2012_src_count.txt and apr2012_src_types.txt for the data analysis and visualization of the results.

In order to check for unusual patterns in the overall amount of traffic, we analyze a discrete time series formed by the number of packets per hour observed by our darkspace monitor. In Section 4.2 we showed how to extract these hourly numbers of packets from the aggregated FlowTuple files. We performed the necessary calculations for the whole PT darkspace data set of 720 hourly files collected in April 2012 and use the prepared file apr2012_pkt_count.txt for the analysis described in this section.

We use the Octave software to plot and analyze packet rates over time. We configure the Octave environment to enable printing of numbers in a field wider than 10 characters long and also change the default font to Helvetica and font size to 20:

> format long > set(gca, 'fontname', 'Helvetica', 'fontsize', 20);

Then we load the file apr2012_pkt_count.txt into a matrix apr2012_pkt using the csvreadOctave function with the filename as an argument:

> apr2012_pkt=csvread('apr2012_pkt_count.txt');

This command creates the matrix apr2012_pkt in Octave and reads the comma separated values from the file apr2012_pkt_count.txt into the matrix. Typing the name of the matrix will display its contents:

One can also use indices to access different parts of the matrix. For example, typing apr2012_pkt(2,1) will print the matrix element in the second row and the first column. The operator `:' is used to select all elements of a row or column. We can access the entire first column containing the timestamps as apr2012_pkt(:,1) while the second column containing the number of packets per hour can be accessed by apr2012_pkt(:,2). For instance, the following command will calculate the total number of packets observed in April 2012 by summing up the values in the second column of the matrix:

> apr2012_pktcnt=sum(apr2012_pkt(:,2))
apr2012_pktcnt = 108687558356

Next, we plot the number of packets per hour vs. time. To make it easier to see when Patch Tuesday took place, we want to display the timestamps, given in the epoch format, as the actual dates and hours in a human readable form. Thus, we apply the function datenum to convert the epoch time into the datenum format, which represents the time as the number of days starting from Jan 1, 0000.

datenum (year, month, day, hour, minute, second)

To convert the epoch time to the datenum format, we set the values for year, month, day, hour and minute to Jan 1, 1970 midnight (when epoch time starts) and then add the epoch time as number of seconds. For the first element of the matrix apr2012_pkt (row=1, column=1) the corresponding datenum value is 734960:

> datenum(1970,1,1,0,0, apr2012_pkt(1,1));
ans = 734960

To plot the data over time, we convert the epoch times in the entire first column apr2012_pkt(:,1) into the datenum format and then plot the number of packets stored in the second column apr2012_pkt(:,2):

This command stem creates a stem plot, which represents each value by a vertical line. The datetick function changes tick labels on the x-axis from the datenum into a date format specifying day of the month (with year, month, day, hour and minutes or a combination of thereof). We name the axes using the xlabel and ylabel commands and give the graph a title. To plot the number of packets in millions, we divide the values in the second column by 106. Figure 2 shows the resulting plot, revealing no unusually high volume of packets on or around Patch Tuesday.

Figure 2: Number of packets per hour vs. time

Next, we check the maximum and minimum hourly packet counts in the PT dataset using the functions max() and min(), correspondingly (shown below). These functions return the maximum (minimum) value of the dataset and the index (that is, the row of the matrix) where the returned value is. So, if the maximum were in the third row of the dataset we would get the index=3. Using the index we can then display the whole row (with a timestamp and a packet count) and therefore find out when the maximum (or minimum) was observed:

The maximum packet count per hour is 239,613,469 observed in row 93 of the matrix apr2012_pkt at epoch time 1333569600, and the minimum hourly packet count of 99,396,882 is observed in row 461 at epoch time 1334894400. The datestr function translates epoch times into a more readable format, revealing that the maximum hourly packet count in April occurred on April 4, 2012 at 20:00 while the minimum hourly packet count was April 20, 2012 at 04:00. Both dates appear unrelated to Patch Tuesday (April 10).

Finally, we find the mean value of 150,954,942.16 packets per hour in April 2012 using the following:

The goal of this exercise is to discover unusual patterns in the number of active sources related to the Patch Tuesday. We plot the number of sources per hour vs. time using the same Octave functions as in Section 5.1:

Figure 3 shows the resulting stem plot and reveals two interesting features: a periodical temporal pattern and an unusually high number of unique sources on April 11, 2012. We will further investigate the apparent periodicity in the hourly counts of unique sources in Section 5.6. To analyze the peak activity, we calculate the maximum, minimum and mean values of hourly counts of unique sources and time of their occurrence using the same Octave instructions as in Section 5.1:

The maximum hourly count of active sources, which exceeded the mean value by almost a factor of 4, occurred on April 11, 2012 at midnight - the first night after Path Tuesday (April 10). We analyze which types of sources became active in Section 5.5.

In this exercise we analyze which protocols are used to send packets to the darkspace and, in particular, whether there are any differences in protocol usage before and after Patch Tuesday. The protocol specified in the IP packet header matches one of the fields in the FlowTuple format. Thus, we use the pre-computed FlowTuple files to generate the overall distribution of protocols for the entire month of April 2012, as well as the protocol distributions for the hour when Microsoft releases patches (Patch Tuesday 17:00) and for one hour of the subsequent day (Exploit Wednesday 00:00).

The file apr2012_ftlist.txt contains a list of all FlowTuple files for April 2012. The file PT_ftlist.txt contains the name of the FlowTuple file for April 10, 2012 17:00 (patch release hour) while the file EW_ftlist.txt contains the name of the FlowTuple file for April 11, 2012 00:00. For each of those files, we aggregate the packet counts per protocol per hour using the Corsaro aggregation tool cors-ft-aggregate:

Option -f specifies that we want to aggregate the protocol field and option -v specifies that we want to report the aggregated value of the packet count. The tool cors-ft-aggregate then sums up all packet counts that have the same protocol number in the FlowTuple interval. The corresponding output files are: apr2012_proto_dist.txt for the whole month of data, PT_proto_dist.txt for the patch release hour, and EW_proto_dist.txt for the first hour of Wednesday following the release. We do some postprocessing with the standard Unix utility sed to output only the protocol number and the associated number of packets as comma separated values.

Next, we input the results into Octave (using the csvread function described in Section 5.1) and compute packet counts per month and for the two special hours.

The last three lines in the script limit the number of ticks on the y-axis to 5 for better readability.

As expected, the three most common protocol numbers stand out: there are maxima in the plotted distributions at protocol numbers 6 (TCP), 17 (UDP), and 1 (ICMP). Yet the distributions are different: during the patch hour on patch Tuesday we observe 84% TCP, 12% UDP and 3% ICMP traffic while for the hour on Exploit Wednesday midnight the fraction of UDP traffic almost doubles: 76% TCP, 21% UDP and 3% ICMP traffic.

Figure 4: Distribution of Protocol Numbers. All three panels show that the most frequently used protocols are: 6 (TCP), 17 (UDP) and 1 (ICMP). In the top panel, the x-axis looks like a thick line because in a whole month of data, most protocol numbers are observed at least once and the corresponding cross symbols overlap. In contrast, the other two panels each shows only one hour of data and the distributions of observed protocols are sparse.

We compare TCP and UDP destination port numbers in packets collected over the month, and during the two specific hours on Patch Tuesday and Exploit Wednesday.

We use the same commands as in Section 5.3, although this time we aggregate the data by looking at two different fields: protocol number and destination port. That means that each row in the result file aggregates all packets with the same protocol number and destination port. The commands to perform analysis for the TCP packets are shown below. These same steps apply to analyzing UDP packets, but filtering with grep should be for protocol number 17 (UDP) instead of 6 (TCP).

Figure 5 shows the resulting graph. The x-axis is limited to show only the first 500 destination ports using the xlim([0,500])Octave command. One can zoom in to see less common ports using the ylim[0,limit] command, e.g., ylim([0,5]) will show only x-values that have y-axis values below 5% of the total packet count.

Figure 5: Distribution of Destination Ports. There is one prominent peak at the same destination port value in all three panels, and a few smaller peaks at varying positions.

Each panel exhibits several peaks. To determine the exact port numbers at which those peaks occur, i.e., the most frequent TCP destination ports, we first sort the rows in the matrix according to the third column (number of packets) in descending order and then look at the top five port numbers in the sorted data:

The destination port 445 received the largest number of TCP packets for the whole month of April as well as during the individual hours on Patch Tuesday and Exploit Wednesday. TCP scans to port 445 are common, in large part due to scanning activity of Conficker-infected hosts [7].

Since we observed an unusually high hourly number of sources at midnight following Patch Tuesday, we want to analyze what types of darknet traffic sources became active. Pre-computed results of the source type analysis for the whole month of April are stored in the file apr2012_src_types.txt. Again, we use Octave and read the whole file into one matrix. We can look at the number of sources for different source types by varying the column that we select for the y-axis in the plots. In the following example, we plot the data in the fifth column, which contains the number of sources per hour of the type "TCP horizontal scans" (see Table 3 in Section 4.4 to map columns to source types).

To find out which source types were unusually active during the first hour of the Exploit Wednesday 00:00, we compare the number of sources of each type during that hour with this type's mean hourly count. To do this analysis, first we need to check where in the matrix src_types we can find the data for the hour of Exploit Wednesday 00:00. We know that this hour coresponds to an epoch time of 1334102400. The Octave command find takes the epoch time as input and returns the corresponding index (the row number):

> index=find(src_types(:,1)==1334102400)
ans=241

The EW hour data is stored in the 241st row.

We then use a for loop in Octave to calculate the mean and ratio for each of the columns (each source type). We store the mean value for each source type i into a vector mean_v(i) and the ratio of the number of sources during the EW hour to the mean number of sources in the vector ew_mean_ratio(i).

When displaying the values stored in ew_mean_ratio, we find the highest values in columns 10 (6.26) and 16 (4.58). Column 10 contains the source type `UDP unknown' and column 16 the source type `1 or 2 packets' (cf. Table 3). So, these two source types contribute significantly to the increase in the overall number of sources. Figure 6 produced by the following Octave code illustrates this behavior.

This exercise is advanced since it requires basic knowledge of signal processing.

Time series of hourly counts of packets (Figure 2) and of unique source addresses (Figure 3) both seem to exhibit periodic variations, the temporal pattern being more noticeable for the latter. In this Section, we show how to analyze the temporal behavior of these darknet traffic characteristics by using the frequency spectrum of the time series signal.

The frequency spectrum represents the time series signal as a superposition of multiple sinusoidal signals. Each sinusoidal signal is defined by its amplitude, frequency, and phase shift. The Fourier transform is a mathematical procedure used to calculate the frequency spectrum and convert the signal from the time domain into the frequency domain.

The signal in our time series signal is discrete as it contains one packet count (or source count) value per hour. For transforming the discrete finite time signal we use the Discrete Fourier Transform (DFT). For a time series of N data points x0 ...xn ... xN−1, the DFT calculates a set of N complex numbers X0... Xk ...XN−1:

Xk=

N−1∑k=0

xn e−i2πk [n/N]

where each complex number Xk represents a sinusoidal signal that contributes to the time series. Our time series has one data point per hour and contains data from 30 days. In total we have N=24 ×30 = 720 data points, which means N=720 complex numbers in the frequency domain.

A fast way to calculate the DFT is the Fast Fourier Transform (FFT) algorithm. Octave provides the function fft(x) for calculating the FFT from a discrete time signal with values stored in the vector x:

The packet counts per hour are stored in the second column of the matrix apr2012_pkt; the resulting complex spectrum values are in pkt_fft. The Octave function abs() yields the amplitudes (the absolute values) of the complex numbers calculated by the FFT.

We plot the amplitudes of the calculated spectrum with the index k on the x-axis and the vertical line showing the signal amplitude at the corresponding k:

We only need to plot the first N/2 coefficients because the spectrum repeats itself. Also, since the first value X0 at k=0 represents the signal offset, we exclude it from the plot of the frequency spectrum in order to make the other frequencies more visible. Note that Octave indexing always starts at 1, whereas k in the DFT formula starts at 0. So, to plot the data from k=1 to k=N/2 we need to use the indices ind=k+1=2 to ind=k+1=(N/2)+1.

Figure 7 shows the resulting plot. Each k corresponds to the number of cycles for the sinusoidal signal within the whole duration of the analyzed time series (720 hours). Therefore, the period of the k-th signal is pk = [(720 hours)/(k cycles)] and the corresponding frequency is fk = [1/(pk)] = [(k cycles)/(720 hours)].

In the diagram we see a high amplitude for the sinusoidal signal with k=30. The period for this signal is p30=[(720 h)/(30 cycles)]=24 hours, which corresponds to a diurnal pattern in the data. Diurnal patterns are common in Internet traffic data since many hosts are turned off at night.

Figure 7: Spectrum for number of packets per hour

We perform the same steps to calculate and plot the spectrum of hourly counts of unique source addresses (Figure 8):

k and N values remain the same. Again, we see a strong spectrum amplitude at k=30 that corresponds to a 24-hour pattern. We also observe a peak at k=60, which indicates a 12-hour pattern. One behavior that could generate this traffic pattern is using 12-hour time format for some internal functions.

We have described an educational data kit curated from darkspace traffic observed by the UCSD Network Telescope in April 2012, and presented a set of exercises developed for this data. After completing the exercises, the students will be familiar with the raw format used to store the captured traffic packets (pcap), various aggregations of the raw data, and basic statistics commonly used for traffic characterization. They will acquire hands-on experience in using specialized traffic analysis software Corsaro and a scripting high-level computational language Octave. The students should be able to apply this knowledge to analyze other packet trace data available in (pcap) format, including bi-directional data. One can find additional relevant reading in CAIDA's online papers directory [1] by filtering for keyword "network telescope".

Acknowledgments

The work was supported by U.S. NSF grants II-EN-1059439 and CNS-1228994, DHS S&T Cyber Security Division (DHS S&T/CSD) Cooperative Agreement FA8750-12-2-0326 (PREDICT project) and by Fraunhofer FOKUS. This material represents the position of the authors and not of the sponsoring agencies.

Footnotes:

1 Described in Section 4.1, this format retains only a certain subset of fields from the captured packets, greatly reducing the data volume and the associated bandwidth, storage, and processing requirements.