Encyclopedia of Information Science and Technology, Fourth Edition (10 Volumes) Now 50% Off

Take 50% off when purchasing the Encyclopedia directly through IGI Global's Online Bookstore. Plus, receive the complimentary e-books for the first, second, and third editions with the purchase of the Encyclopedia of Information Science and Technology, Fourth Edition e-book.

InfoSci®-Journals Annual Subscription Price for New Customers: As Low As US$ 4,950

This collection of over 175 e-journals offers unlimited access to highly-cited, forward-thinking content in full-text PDF and XML with no DRM. There are no platform or maintenance fees and a guarantee of no more than 5% increase annually.

Abstract

Clustering is one of the most important techniques in data mining. This chapter presents a survey of popular approaches for data clustering, including well-known clustering techniques, such as partitioning clustering, hierarchical clustering, density-based clustering and grid-based clustering, and recent advances in clustering, such as subspace clustering, text clustering and data stream clustering. The major challenges and future trends of data clustering will also be introduced in this chapter. The remainder of this chapter is organized as follows. The background of data clustering will be introduced in Section 2, including the definition of clustering, categories of clustering techniques, features of good clustering algorithms, and the validation of clustering. Section 3 will present main approaches for clustering, which range from the classic partitioning and hierarchical clustering to recent approaches of bi-clustering and semisupervised clustering. Challenges and future trends will be discussed in Section 4, followed by the conclusions in the last section.

Background

Data clustering is sourced from pattern recognition (Theodoridis & Koutroumbas, 2006), machine learning (Alpaydin, 2004), statistics (Hill & Lewicki, 2007) and database technology (Date, 2003). Data clustering is to partition data into groups, where the data in the same group are similar to one another and the data from different groups are dissimilar (Han & Kamber, 2000). More specifically, it is to segment data into clusters so that the intra-cluster similarity is maximized and that the inter-cluster similarity is minimized. The groups obtained are a partition of data, which can be used for customer segmentation, document categorization, etc.

Clustering techniques can be “clustered” into groups in multiple ways. In terms of the membership of objects, there are two kinds of clustering, fuzzy clustering and hard clustering. Fuzzy clustering is also known as soft clustering, where an object can be in more than one cluster, but with different membership degrees. In contrast, an object in hard clustering can belong to one cluster only. Generally speaking, clustering is referred to as hard clustering implicitly. In terms of approaches, data clustering techniques can be classified into the following groups: partitioning clustering, hierarchical clustering, density-based clustering, grid-based clustering and model-based clustering. In terms of the type of data, there are spatial data clustering, text clustering, multimedia clustering, time series clustering, data stream clustering and graph clustering.

For a good clustering algorithm, it is supposed to have the following features: 1) the ability to detect clusters with various shapes and different distributions; 2) the capability of finding clusters with considerably different sizes; 3) the ability to work when outliers are present; 4) no or few parameters needed as input; and 5) scalability to both the size and the dimensionality of data.

How to evaluate the results is an important problem for clustering. For the validation of clustering results, there are many different measures, such as Compactness (Zait & Messatfa, 1997), Conditional Entropy (CE) and Normalized Mutual Information (NMI) (Strehl & Ghosh, 2002; Fern & Brodley, 2003). The validation measures can be classified into three categories, 1) internal validation, such as Compactness, Dunn’s validation index, Silhouette index and Hubert’s correlation with distance matrix, which is based on calculating the properties of result clusters, 2) relative validation, such as Figure of merit and Stability, which is based on comparisons of partitions, and 3) external validation, such as CE, NMI, Hubert’s correlation, Rand statistics, Jaccard coefficient, and Folkes and Mallows index, which is based on comparing with a known true partition of data (Halkidi et al., 2001, Brun et al., 2007).

Key Terms in this Chapter

Partitioning Clustering: It is a clustering approach which uses centers to represent clusters and then improves the partitioning by moving objects from group to group.

Data Clustering: Data clustering is to partition data into groups, where the data in the same group are similar to one another and the data from different groups are different from one another.

Text Clustering: It is to group documents based on the similarity in their topics and text.

Subspace Clustering: It is to find clusters in subspaces, where two clusters may exist in two different subspaces and the subspaces may also have different dimensionalities.

Model-Based Clustering: Model-based clustering assumes that the data are generated by a mixture of probability distributions, and attempts to learn statistical probability models from data, with each model representing one particular cluster.

Fuzzy Clustering: Also known as Soft Clustering. For fuzzy clustering, an object can be classified with fractional membership into multiple groups, in contrast to Hard Clustering where an object can be classified into one group only.

Bi-Clustering: Also known as co-clustering, it is to group objects for a subset of attributes by performing simultaneous clustering of both rows and columns.

Hierarchical Clustering: It is to build a hierarchical decomposition of data in either bottom-up or top-down way. Generally a dendrogram is generated and a user may select to cut it at a certain level to get the clusters.

Data Stream Clustering: It is to group continuously arriving data, instead of static data, into groups based on the similarity.

Grid-Based Clustering: It is to partition the whole space into cells with grids and then merge the cells to build clusters.

Semi-Supervised Clustering: Semi-supervised clustering is a partly supervised clustering which is guided with a small amount of knowledge, such as pairwise constraints and class labels for some objects.