is a division of data into groups of similar objects. Representing thedata by fewer clusters necessarily loses certain fine details, but achievessimplification. It models data by its clusters. Data modeling puts clustering in ahistorical perspective rooted in mathematics, statistics, and numerical analysis.From a machine learning perspective clusters correspond to

hidden patterns

, thesearch for clusters is

unsupervised learning

, and the resulting system represents a

data concept

. From a practical perspective clustering plays an outstanding role indata mining applications such as scientific data exploration, information retrievaland text mining, spatial database applications, Web analysis, CRM, marketing,medical diagnostics, computational biology, and many others.Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering indata mining. Data mining adds to clustering the complications of very largedatasets with very many attributes of different types. This imposes uniquecomputational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and weresuccessfully applied to real-life data mining problems. They are subject of thesurvey.Categories and Subject Descriptors: I.2.6. [

The goal of this survey is to provide a comprehensive review of different clusteringtechniques in data mining.

Clustering

is a division of data into groups of similar objects.Each group, called cluster, consists of objects that are similar between themselves anddissimilar to objects of other groups. Representing data by fewerclusters necessarilyloses certain fine details (akin to lossy data compression), but achieves simplification. Itrepresents many data objects by few clusters, and hence, it models data by its clusters.Data modeling puts clustering in a historical perspective rooted in mathematics, statistics,and numerical analysis. From a machine learning perspective clusters correspond to

To fix the context and to clarify prolific terminology, we consider a dataset

X

consistingof data points (or synonymously,

objects, instances, cases

,

patterns

,

tuples

,

transactions

)in attribute space

A

, where

i

, and each component is anumerical or nominal categorical

attribute

(or synonymously

, feature

,

variable

,

dimension

,

component

,

field

). For a discussion of attributes data types see [Han &Kamber 2001]. Such point-by-attribute data format conceptually corresponds to amatrix and is used by the majority of algorithms reviewed below. However, data of otherformats, such as variable length sequences and heterogeneous data, is becoming more andmore popular. The simplest attribute space subset is a direct Cartesian product of sub-ranges called a

segment

(also

cube, cell, region

). A

unit

is an elementary segment whose sub-ranges consist of a single category value, or of asmall numerical bin. Describing the numbers of data points per every

unit

represents anextreme case of clustering, a

histogram

, where no actual clustering takes place. This is avery expensive representation, and not a very revealing one. User driven

segmentation

isanother commonly used practice in data exploration that utilizes expert knowledgeregarding the importance of certain sub-domains. We distinguish clustering fromsegmentation to emphasize the importanceof the automatic learning process.

A x x x

id ii

∈=

),...,(

1

C C

l

⊂=

∏

:N

1

=

l il

A x

∈

N d

×

,:1,,

d l AC A

l l

=⊆

The ultimate goal of clustering is to assign points to a finite system of

k

subsets, clusters.Usually subsets do not intersect (this assumption is sometimes violated), and their unionis equal to a full dataset with possible exception of outliers.

[Jain & Flynn 1996]. For statistical approaches to patternrecognition see [Dempster et al. 1977] and [Fukunaga 1990]. Clustering can be viewed asa density estimation problem. This is the subject of traditional multivariate statisticalestimation [Scott 1992]. Clustering is also widely used for data compression in image processing, which is also known as