Computing data mining algorithms such as clusteringtechniques on massive geospatial data sets is still not feasi-ble nor efﬁcient today. Massive data sets are continuously produced with a data rate of over several TB/day. In thecase of compressing such data sets, we demonstrate the ne-cessity for clustering algorithms that are highly scalablewith regard to data size and utilization of available com- puting resources to achieve an overall high performancecomputation. In this paper, we introduce a data stream-based approach to clustering, the partial/merge k-means al-gorithm. We implemented the partial/merge k-means as aset of data stream operators, which are adaptable to avail-able computing resources such as volatile memory and pro-cessors by parallelizing and cloning operators, and by com- puting k-means on data partitions that ﬁt into memory, and then merging the partial results. In our extensive analyticaland experimental performance evaluation, we show that the partial/merge k-means in comparison to a serial algorithmoutperforms a serial implementation by a large margin withregard to overall computation time and achieves a signiﬁ-cantly higher clustering quality.

1 Introduction

Computing data mining algorithms such as clusteringtechniques on

massive data sets

is still not feasible nor ef-ﬁcient today. Massive data sets are continuously producedwith a data rate of over several TB/day, and are typicallystored on tertiary memory devices. To cluster massive datasets or subsets, overall execution time and scalability areimportant issues.For instance, the NASA Earth Observation System’s In-formation System (EOSDIS) collects several TB of satelliteimagery data daily, and data subsets thereof are distributedto scientists. To facilitate improved data distribution andanalysis, we substitute data sets with compressed counter-parts. In our compression approach, we use a techniquethat is able to capture the high order interaction between theattributes of measurements as well as their non-parametricdistribution, and assume a full dependence of attributes. Todo so, we partition the data set into 1 degree x 1 degree gridcells, and compress each grid cell individually using

mul-tivariate histograms

[6]. For the compression of a singleglobal coverage of the earth for the MISR instrument [22],we need to compute histograms for 64,800 individual gridcells each of which contains up to 100,000 data points thathave between 10 to 100 attributes

1

. To achieve a dense rep-resentation within the histograms, we use non-equi-depthbuckets so that the shapes, sizes, and number of buckets areable to adapt to the shape and complexity of the actual datain the high dimensional space. To compute a histograms fora grid cell, we use a clustering technique that produces themultivariate histograms. This example motivates the neces-sity for an approach to implementing clustering algorithmssuch as k-means that are

highly scalable

. They need to bescalable with regard to the overall data size, the numberof data points within a single grid cell, the dimensional-ity of data points, and the utilization of available computingresources in order to achieve an overall high performancecomputation for massive data.We focus on approaches that allow us to perform k-means in a highly scalable way in order to cluster massivedata set efﬁciently; with this view point, we do not considersolutions that improve k-means algorithm otherwise the re-lated work in the ﬁeld can be readily be applied to providefurther algorithmic improvements.

1.1 Requirements from a Scalability and Performance Perspective

From a scalability and performance perspective, the fol-lowing criteria are relevant designing a k-means algorithmfor massive data:

1

Typically, a global coverage of measurements is collected, dependingon the instrument, between every 2 to 14 days.

1

Figure 1. Swath of the MISR satellite instru-ment

Handling large numbers of data points

: For massivedata sets, an algorithm should be highly scalable with re-gard to the

number

and

dimensionality

of data points thatneed to be clustered. That includes that it should scale au-tomatically from small to massive data sets.

Overall efﬁcient response time

: The algorithm shouldadapt to available resources such as computational re-sources and memory automatically, and maximize the uti-lization of resources in a greedy way.

High quality clustering results

: The results of clus-tering should provide a highly faithful representation of the original data, and capture all correlations between datapoints.

Easily interpretable results

: Compressing massivedata sets into equivalents that preserve the temporal-spatialstructure should also represent data using descriptions thatcan easily be interpreted by an end user such as a climateresearcher.

1.2 Contributions of this Paper

In this paper, we propose a scalable, parallel implemen-tation of the k-means clustering algorithm based on the datastream paradigm. A single data stream-based operator con-sumes one or several data items from an incoming datastream, processes the data, and produces a stream of outputdata items, which are immediately consumed by the nextdata stream operator; thus, all data stream operators processdata in a pipelined fashion. Using this paradigm for massivedata sets, several restrictions can be observed:1. Each data item from the original data set should be

scanned only once

to avoid data I/O,2. An operator can store only a

limited amount

of stateinformation in volatile memory, and3. There is limited control over the

order of arriving dataitems

.Nevertheless, data streaming allows us to exploit the inher-ent parallelism in the k-means algorithm.We introduce the

partial/merge k-means

algorithmwhich processes the overall set of points in ’chunks’, and’merges’ the results of the partial k-means steps into anoverall cluster representation. The partial k-means andthe merge k-means are implemented as data stream oper-ators that are adaptable to available computing resourcessuch as volatile memory and processors by parallelizing andcloning operators, and by computing k-means on partitionsof data that can be ﬁt into memory. Our analytical and ex-tensive experimental tests, we compare the scalability, per-formanceandclusteringqualityofaserialandadata-streambased k-means implementation. A serial implementationassumes that grid cells are computed serially, and it requiresthat all data points are present in memory. The experimen-tal results suggest that our approach scales in an excellentway with regard to the memory bottleneck, and producesclusters that are of better quality than generated a serial k-means algorithm especially for large datasets.The remaining paper is structured as follows: Section2 states the problem in more detail and discusses relatedwork. Section 3 contains the detailed description of the par-tial/merge k-means, and Section 4 includes implementationdetails of our prototype. Section 5 contains our experimen-tal results. We conclude this paper in Section 6 with fewremarks.

2 Clustering Massive Data Sets

Several methods of data compression in sparse high di-mensional temporal-spatial spaces are based on

clustering

.Here, a data set is subdivided into temporal-spatial gridcells, and each cell is compressed separately. A grid cellcan contain up to 100,000 n-dimensional data points. Dur-ing clustering, grid cell’s data points are partitioned intoseveral disjunct clusters such that the elements of a clus-ter are similar, and the elements of disjunct clusters are dis-similar. Each cluster is represented by a speciﬁc element,called

cluster centroid

. For non-compression applications,the goal is to partition a data set into cluster; for compres-sion, we are interested in representing the overall data setvia the cluster centroids.

Two strategies can be used to identify cluster: a) ﬁnd-ing densely populated areas in a data set, and dissecting thedata space according to the dense areas (clusters) (densityestimation-based methods), or b) using an algorithm thatattempts to ﬁnd the clusters iteratively (k-means). In ourwork, we will focus on the k-means approach. In detail,the k-means algorithm can be formalized as described asfollows. K-means is an iterative algorithm to categorize aset of data vectors with metric attributes in the followingway: Given a set

S

of

N D

-dimensional vectors, form k disjoint non-empty subsets

{

C

1

,C

2

,...C

k

}

such that eachvector

v

ij

∈

C

i

is closer to mean(

C

i

) than any other mean.Intuitively, k-means algorithms can be explained with thefollowing steps:1.

Initialization

: Select a set of k initial cluster centroidsrandomly, i.e.

m

j

,

1

≤

j

≤

k

.2.

Distance Calculation

: For each data point

X

i

,

1

≤

i

≤

n

, compute its Euclidean distance to each cen-troid

m

j

,

1

≤

j

≤

k

, and ﬁnd the closest clus-ter centroid. The Euclidean distance from vector

v

j

to a cluster centroid

c

k

is deﬁned as

dis

(

c

k

,v

j

) =(



d

=1

,D

(

c

kd

−

v

jd

)

2

)

1

/

2

.3.

Centroid Recalculation

: For each

1

≤

j

≤

k

, com-puted the actual mean of the cluster

C

j

which is de-ﬁned as

µ

j

= 1

/

|

C

j

| ∗



v

∈

C

j

v

; the cluster centroid

m

j

’jumps’ to the recalculated actual mean of the clus-ter, and deﬁnes the new centroid.4.

Convergence Condition

: Repeat (2) to (3) until con-vergence criteria is met. The convergence criteria isdeﬁned as the difference between the mean square er-ror

(

MSE

)

in the previous clustering iteration

I

=(

n

−

1)

and the mean square error in the current clus-tering iteration

I

= (

n

)

. In particular, we choose



(

MSE

(

n

−

1)

−

MSE

(

n

))

≤

1

×

10

−

9

(referit to our experiments in section

§

5.2).The quality of the clustering process is indicated by the

error function E

which is deﬁned as

E

=



k

=1

,K



v

∈

C

k



µ

k

−

v



2

where

K

is the total number of clusters.The k-means algorithm iteratively minimizes this func-tion. The quality of the clustering depends on the selectionof k; also, the k-means algorithm is sensitive to the initialrandom choice of k seeds from the existing data points. Inthe scope of this paper, we assume that we are able to makean appropriate choice of k; however, we will vary on theselection of initial random seeds.There are several improvements for step 2 that allow usto limit the number of points that have to be re-sorted; how-ever, since this is not relevant with regard to the scope of thepaper, we will not consider it further.

2.1 Problem Statement

Clustering temporal-spatial data in high dimensionalspaces using k-means is expensive with regard to both

com-putational costs

and

memory requirements

. Table 1 de-picts the symbols used in the complexity analysis of the al-gorithms.

N Number of data pointsI Number of Iterations to convergeK Total number of CentroidsG Total number of grid cellsR Total number of experiment run (with different initial

k

seeds)p Total number of chunks/partitions used in partial/merge K-Means

Table 1. Symbols used in complexity analysisof the algorithms.

Computing k-means via a

serial algorithm

, i.e. scan-ning a grid cell

C

j

at a time, and compressing it, and thenscanning the next grid cell, all

N

data points belonging toone grid cell have to be kept in memory. The algorithm uses

I

iterations to converge, and it is run

R

times with

R

differ-ent sets of randomly chosen initial seeds. In this case, thememory complexity is

O

(

N

), and the time complexity is

O

(

GRIKN

)

whereby

G

isthenumberofoverallgridcells.Here, both the memory and the computational resources arebottlenecks for the serial k-means.Two aspects to the memory bottleneck need to be con-sidered: the volatile memory that is available via the vir-tual memory management, and the actual memory availablevia RAM. From a database perspective, control over RAMis essential to control any undesired paging effects. If therelevant data points do not ﬁt into RAM and to avoid un-controlled paging effects by the underlying operating sys-tem, the data set has to be broken up, and clustered incre-mentally. In the database literature, several approaches dealwith the problem of large data sizes ([2, 5, 14, 18, 25]), aproblem that is not considered in k-means algorithms usedin statistics, machine learning, and pattern recognition. Wewill refer to this work in more detail in the related work section 2.2.

Parallel implementations

of k-means deal with the bot-tleneck of

computational resources

when clustering largeamounts of high dimensional data. A group of processorsis utilized to solve a single computational problem. Sev-eral ways of parallelizing k-means using either massivelyparallel processor machines or networks of PCs can be con-sidered

2

.As shown in Figure 2,

Method A

is a naive way of par-allelizing k-means is to assign the clustering of

one grid

2

Forpriceperformancereasons, weconsidershared-nothing, orshared-disk environments as available in networks of PC.