methodbased on K-Means clustering algorithm and HITSand PageRankalgorithms. A weight is extractedfrom anomalous behavior detected by K-Meansclustering which is used in calculating the energy rank of the second algorithm.When K-Meansclustering algorithm is applied to network monitoring data, it can be used to detect intrusions [3]and we use it to detectanomalous

machines. In [4], it is said that the weight in the rankevaluation could be chosen based on different factors.We have chosen the weights

moreaccurately than [4].With executing the combined method, we found a larger set of IP addressesof spam machines and found that we have increased the accuracy of the algorithms perceptibly.

Keywords:

spam, clustering method, K-Means clustering algorithm,

HITS algorithm, anomalousbehavior.

1-

Introduction

Spam isa

side effect of free email service and has become a serious problemthat

threats everyInternet user. According to MessageLabs report[1], 60% of email traffic is spam. Althoughdifferent methodsfor combating spam have been proposed, Spam messages are still sent tousers’

mailboxes. This happens because lots of spamdetection

methods use filtering.

There are different methods for preventing spam. Most organizations and Internet ServiceProviders (ISPs) use spam filters whichare

installed on mail servers. These filters extractkeywords and other signatures and use statistical and heuristic methods to determine thatanemailis spam. But spam senders use complicated methods for combining contents intelligentlyto mislead content based filters. Thus content based filters do not have high performance. [2]

Most spam researches, concentrate onpost-sendmethodswhich

detect spam after sending, butmost ofthedamage caused by spam is before the usage

ofthese detection methods.

Thesemethods are not able to reduce overhead, bandwidth, processing power, time and memory usedby spam.

In this paper, we identify machines thataresending

spam or machines that are compromised

andaredistributing

spam. This work is done in two parts. First, with clustering algorithm

[3],machines

areseparated

to normal andanomalous clusters. The extracted features are

based onthe volume of trafficthe machines are sending

(num. of packets, bytes, flows). Then based onranking and link analysis methods[4] andwith

the weight extracted from the firstsection, wedetect spam machines. Analysis is done on one day of netflow trafficof alarge scale ISP.

In section

II, wereview

the related work. In sectionIII,We outline the structure of our approach.

[6] are examples of those researches. [8][9][12]Numerous spam mitigationtechniquestry to understand spammer’s behavior. Several studies have used email sinkholes orhoneypots to study spammer properties.In these methods,

received email isaspam.The data isextracted from different email sinkholesof

different domains and variousproperties

of networklevel behavior of spammers

were

extracted.

In [15], data was

extracted from a limited sinkhole ina domain and the structural characteristics of scamwere studied. But thetraces

received by thesemethodsare limited

to an

organizational domain.To extract abroader

view of spam

problem,Open relay sinkholes were proposed in [11]. The idea of this method is to setup open relays insuch a way thatit can beeasily detected by spammers but doesn’t send any spam. In this way,information about the source and destination of spam is extracted.

In [9], another method was proposed by Nick Feamester et al. They propose

a method thatdoes

not detect spam based on IP address or content filtering

butdetects spam

with behavioral

analysis. They used the logs

of an organization which had 115 domains and analyzed spam inmultiple domains. To classify spam, they clustered IP

addresses basedon similar behaviors. Theidea of their clustering algorithm is “bots of a botnet have similar behavior and sendsmall

the algorithmwith k=2on network traffic data and choosethreefeaturesasnumber of packets, number of bytes and number of flows. So the algorithmclustersthe monitoringdata to normal and anomalous IP addresses based on the volume oftraffic exchanged.After detection of ananomalous IP

address, a weight is assigned to it which isused inrank evaluation. In section

[4]Asit is defined in the PageRankalgorithm[4], the weight used in energy calculation, can be assigned based on different factors.In [4], this weight is based on a pre-used value PScore. We use the weight calculated by theclustering algorithm.

Using K-Meansclustering algorithm for detecting spam, combination ofthetwo methods with each other and determining IP weightsK-Means clustering algorithm

andusing it in the second method are the contributions of this paper. The combinational method isexerted onthe sample

traffic and spam sendingmachines

have been detected.

3-1-

Networkmonitoringtraffic

A flow is a summary of traffic traveling in a session. Each flow contains basic information aboutconnection

such as IP, source/destination port, number of packets/bytes transferred, protocolused, connection time and TCP flags. Flow record does not containpayloadinformation.Emailservice connection uses SMTP protocol and its destination port is 25. Thus the analysis is doneon TCP traffic with destination port 25. Because netflow trafficinformation is

at medium leveland does not contain the payload information of a packet,

this method does not have problems ofmethods that use payload data.

Our test data is the floe records of one week of a large scale ISP.It contains 158772000 flowrecords.We used one day of this set and selected the records which their source or destination

ports are 25. It contains 871777flowrecords.

3-2-

K-Means clusteringalgorithm

K-Means clustering algorithm,groups data based on their feature valuesinto K clusters. Objectsin a cluster have

similar feature values. K is a positive true number that determines the number ofclusters and is determined at the beginning of theexecution of thealgorithm. Now we definesteps ofK-Means clustering algorithm.

1)

Define the number of clusters.

2)

Define K different centroids foreachcluster. This work is done byarbitrarilydividingobjectsinto K clusters, determining their centroids, and evaluating whether

thesecentroids are different fromeach other.Alternatively, the centroids can be initialized to Karbitrarily chosen, different objects.

3)

Iterate over

all objects to determine the distance of each object to the centroid of thatcluster. Each object is assigned to the cluster ofthenearest

centroid.

4)

Realculate the centroids of new clusters.

5)

Repeat

step 3 until centroids doesn’t change anymore.

The distance function, which isused in this algorithm to calculate the distance between 2 objects,is the Euclidean distance which isdefined

in formula (1).

(1)

Where x=(x1,x2,…,xm) and y=(y1,y2,…,ym) and m is the number of features. In this paper,features are number of packets, number of bytes, number of flows and K is 2. We used the K-Means clustering algorithm on the training

it isimportant to define the number of clusters correctly. We choose K=2, with this assumption thatnormal and anomalous traffic forms two different clusters.

K-Means clustering algorithmcalculates

centroids for normal and anomalousclusters and thesecentroids are used for detecting anomalous behavior in the

network monitoring traffic. New flowrecords are preprocessed and transformed and their feature

values are extracted. To detectanomalous behavior, two distance-based methods

could be deployed. These methods are

classification

and outlier detection which is combined in

this paper.

Classification method:

In this method, the distances

to

the centroids of clusters andthenewtrafficare

calculated

using

Euclidean distance

function. The new traffic isclassified asnormal ifit

is closer

to the centroid of the normal cluster than the centroid oftheanomalousone. Thisdistance based classification allows detecting that kind of abnormal traffic and is similar to thecharacteristics ofthetraining dataset.

Outlier detection method:

An outlier is an object which is different

from other objectssignificantly. Thus it can be recognized as anomaly. For outlier detection, only the distance to thecentroid of normal traffic is calculated. If the distance between the object and centroid is largerthan a predefined threshold, dmax,the object is known as an anomaly.

Combined classification and outlier detection method:

The classification and outlier detectionare

used in combined way

to reduce the limitations of each method.Ifthetwo methods are usedsimultaneously, an object is known as anomaly

if it iscloser

to the centroid ofabnormal

clusteror its distance to the centroid of normal cluster is larger than apredefinedthreshold.

The combination of classification and outlier detection is used in this paper.

3-3-

Email servers’behavior formationmethod

Email servers receive/send emails from/to other email servers. Thus email servers form acommunitydue to interactions with each other

and they form a bipartite graph. We use theemail

servers’ behavior to distinguish between normal andanomalous

traffic.The bipartite graph isused in otherdomains

such astheweb.

3-3-1-

Hubs and Authorities

Bipartite graphhas been

usedfor

web mining. A bipartite core (i,j) is a bipartite subgraph with inodes of one set of nodes to j nodes of another set of nodes.

With reference tothegraph concept, i pages that have

communications with other pages arereferred to as

hubs and j pages that are referenced arethe

authorities. For a set of pages related toa topic,a

bipartite core which includes hubs and authoritiesis

determinedusing

HITS algorithm.[18] Hubs and authorities are important because theyserve as

good sources of informationfor

that topic. In the domain of email traffic flow, hubs are equivalent to machines that send emails

normal traffic between email serversaretheneliminated. In this stage only edges are removed andnot thenodes.

This removes

thenormal email servers’

behavior. The second stepidentifies

machines thatbehave

like servers andhave high volume ofoutgoingtraffic that are not related to regular email connections. Thesemachines are probably spammachines because they send emails to lots of machines that do notparticipate in normal email connections.

3-3-3-

Rank evaluation

For each node, based on email sender score, a rank is determinedand it is called the spamsending rank. [4]Another metricisthen calculated

based on email sending metric andiscalled

email sending height (PHeight). For the ith node at time

t,itsheightcan be

determined

byformula (4).

PHeightit=log2(1+1/PR)

(4)

For a node with high rank, PR=1 and PHeight=1 and a node with infinite rank, PR=∞andPHeight=0.Then rate of changesin the rank of

a nodeis calculatedover

time. Changes for thetimeperiod

∆t,

is calculated in formula (5).

v=∆PHeight/∆t

(5)

Since

we are interested in changes andnot in a positive or a negative change, wetake the

squareof v for our analysis. We alsoassign a weight to each node based on the results of the K-Meansclustering algorithm. This is the result of the combinational method and is the contribution of thispaper.As it is said

in [4],the node could be weighed

based on different factors. In [4], weightsare

chosen based on PR but we choose weights based on K-Means algorithm which increases theaccuracy ofrank energy. K-Means is a clustering algorithm and with (K=2) divides IP addressedto two normal andanomalous

clusters. Theanomalous

IP

addresses

are assigned a weight whichis used in rank evaluation.The energy rank of each node ismeasured as in formula (6).

behaviorof nodes. Rapidchanges are important for the system analyst becausethey indicate

machines that send spamsuddenly or are email serversgoing down.

4-

Results evaluation

Experimentswere

done in three phases. These experimentswere executedon one day of netflowtraffic of a bigISP. First the K-Means clustering algorithm was exerted on half an hour ofnetflow traffic and information was divided to normal andanomalous

clusters. The composedmethod

was exerted on 24 hours of data, every 15 minutes of each hour.FirstK-Meansclusteringalgorithm wasapplied

and ifthe machine

belonged to the anomalouscluster, a weightwas assigned to it. The algorithm defined

andanomalous clusterswas calculated.If the IP belonged to anomalous cluster, a weight wasassigned to it. Then we applied the

HITS algorithm,and calculatedhub and authority scores

foreachmachine.The relations

between email servers withtop

huband

authority scores

were

removed and the HITS algorithm was executed again. In this way,themachines

with high hubrank were

known as spam senders.Then the energyrankof the internal IP addresses of the ISPwas calculated two times. Once it was calculated

based on the weight defined in[4]

andthesecond

time it was calculated based on the weightassigned

byK-Means clustering algorithmdefined in section 2-3.IP

addresses

with

highhubscores, gained high ranks. The results areshown in table 1.IP address X.133.201.23 has high hub score in two hours of the day. Theenergy calculated for thismachine with

the method proposed in [4], as shown in the table,reports no abnormal behavior. The IP address X.133.203.167, has high hub rank in 6 hours of theday. The energy calculated with the combinational method is high in 3rd

hour of the day, butisnothighin other hours

because there is no change in the situation of the system. The methodproposed in [4], doesn’t show high energy ranks for some of these times.

The IP address,X.133.206.80, has normal behavior.

5-

Conclusion

In this paper, a combined method for detecting spam machines was proposed. The combinedmethod is based on twoalgorithms proposed in [3] and [4].

A weight was assigned

tothemachine that was

known anomalous or abnormal. This weight

was used for calculating spammachine ranks in the second method. This work is limited to modeling in single node level.Further research can be done for modeling in multiple node level.

Table 1-

The results ofthe combinational method and the simple method on the sample dataset