This article gives a short introduction to clustering and then explains K-means algorithm in an efficient way using a live demo in Silverlight. The demo can be used to understand the working of k-means algorithm through user-defined data points. The full source code in C# and Silverlight is available for download below.

K-Means Demo in Silverlight

The K-means algorithm is illustrated in this demo. You can enter a new set of data points and test the resultant clusters. The source code of the demo is available for download below.

What is Machine Learning and Clustering

Machine learning is a scientific discipline used to automatically learn in order to understand complex patterns and make intelligent decisions based on data. This computational learning can be supervised or unsupervised. Data Mining is the process of extracting useful patterns from large volumes of data. Uncovering hidden patterns in data using data mining techniques will be very useful for businesses, scientists and governments.

Clustering is the process of organizing a set of items into subsets (called clusters) so that items in the same cluster are similar. The similarity between items can be defined by a function or a formula, based on the context. For example, the Euclidean distance between two points acts as a similarity function for list of points/co-ordinates in space. Clustering is a method of unsupervised learning and a common technique for statistical data analysis used in many fields. The term clustering can also refer to automatic classification, numerical taxonomy, topological analysis etc. For more information on Clustering, see http://en.wikipedia.org/wiki/Cluster_analysis.

Data Structures for this Article

We illustrate the k-means algorithm using a set of points in 2-dimensional (2D) space. The following data-structure classes are created. The Point class represents a point in 2D space. The PointCollection represents a set of points and/or cluster.

The K-Means is a simple clustering algorithm used to divide a set of objects, based on their attributes/features, into k clusters, where k is a predefined or user-defined constant. The main idea is to define k centroids, one for each cluster. The centroid of a cluster is formed in such a way that it is closely related (in terms of similarity function) to all objects of that cluster.

Since we know the number of clusters to be formed, the objects in the input list are initially divided into random groups, that is, each object is assigned to a random cluster. After this, the algorithm iteratively refines each group by moving objects from irrelevant group to relevant group. The relevance is defined by the similarity measure or function. Whenever a new object is added or removed from a cluster, its centroid is updated or recalculated. Each iteration is guaranteed to increase the similarility between all the points inside a cluster. This iterative refinement is continued until all the clusters become stable i.e. there is no futher movement of objects between clusters. For more information on k-means algorithm, see http://en.wikipedia.org/wiki/K-means_clustering. The k-means algorithm is also referred to as Lloyd's algorithm.

The K-means algorithm can be used for grouping any set of objects whose similarity measure can be defined numerically. For example, a set of records of a relational-database table can be divided into clusters based on any numerical field of the table. For example, the set of customers or employees can be divided based on their attributes/properties like age, income, date-of-join, etc. In such cases, the similarity measure has to be defined based on that attribue.

The following source-code implements the K-means algorithm, using the data-structures defined above.

The SplitList() function defined in ListUtility class is used to split a list of objects into equal number of groups. This is explained in more detail in this article. The FindNearestCluster() function finds the cluster that is very nearest (in terms of euclidean distance) to the given point.

Comment by kulandai samy on Apr-28-2013hi frds,
i have one doubt..how to detemine number of cluster for k-means..please help me..

Comment by jos on Apr-23-2013Great article. By reading the source code I now understand the algorithm. This is math for dummies like me. What is this web site anyway, does it contain more of these quality articles?

Comment by subrata on Jun-09-2012 i want

Comment by khanmuh2 on Apr-24-2012using System.Windows.Controls.DataVisualization; source code compile error. I cant seem to find the dll for this.

Comment by zddream on Apr-11-2012good applet!begin learning

Comment by Jiangxin Wu on Oct-24-2011Remarkable

Comment by Allen on Sep-17-2011do you have any examples using 4 dimentions observational data? let's x1 = {x,y,z,w}