This article introduces what data clustering is, and what it is good for. It provides the simplest possible implementation of the popular k-means++ algorithm in both FORTRAN and C, and discusses a couple of example problems.

What it is

Data clustering, sometimes called cluster analysis, is the process of finding groups in data. Objects that are alike are put together; objects that are different are separated.

As a graphical example, if your data is this:

Then clustering is this:

Major tasks in Machine Learning:

Regression

Make a numerical prediction. Example: predict the height of the son given the height of the father.

Classification

Assign each object to one of several given classes. Example: separate the sheep from the goats

Pattern recognition

Search for occurrences of a known object in a data set. Example: find a needle in a haystack.

Clustering

Assign the objects in groups which have not been defined before. Example: naming species of flowers in a collection of plant specimens

Data fusion

Decide which object from one source of information corresponds to an object from another source. Example: Match satellite images of missile installations against spy reports (Yes, the "term fusion" is used mainly by the military.)

Data clustering is a topic in statistics and one of the major tasks of machine learning.

What it's good for

Exploratory analysis:

A company's marketing department may wish to place its customers in one of several market segments. A biologist may be interested in arranging a collection of specimens as several species. Much attention in statistics is given to hypothesis testing: to prove or disprove whether a difference exists between groups. Exploratory analysis is something else: there is no hypothesis yet. The analyst is finding what groups appear to exist.

Summarize a data set:

To cope with a large amount of data, often a small sample is taken to represent the full set. This sample may be chosen at random, but it is possible that rare objects will be overlooked. Data clustering can be used to produce a sample with both common and rare objects.

Data clustering should not be confused with classification. In classification, the classes are known beforehand. Usually a human expert provides a training set where each data point has been labeled with its class. In clustering, it is not known beforehand what classes might exist. For example, suppose that the data is information on some tumor cells. A classification task would be to identify each specimen as benign or malignant. A clustering task would be to identify distinct kinds of tumor cells.

Varieties of clustering

Prototype clustering

Each cluster is represented by a prototype, an example, or an average of some kind. Each object is compared to this prototype to decide if it belongs in the cluster. The k-means algorithm is of this variety.

Hierarchical clustering

Biologists place each species into a genus, each genus into a family, each family into an order, each order into a class, and each class into a division. Thus, all living things form a hierarchy of groups. Much early work in data clustering was done by botanists and was hierarchical, of course. It will not be discussed further here.

Graph clustering

Data points are compared to other points, and there are no cluster prototypes. Graph theory is invoked: the data objects are called vertices, and the similarities are called edges. Graph clustering is then a matter of cutting up the graph by slicing a minimum of edges. Since there is no central point, the clusters may be of any shape. This makes graph clustering useful for image segmentation, which will not be discussed further here.

Specifics

The components of x may be of various quantities, temperature, pressure, distance, energy and so forth. Using the Euclidean distance adds apples and oranges. Is there any physical meaning to adding squared pressure to squared time? No, but there is a justification behind the use of the formula which is not crucial for the beginner to understand in detail, but ought to be mentioned. A set of measurements where most of the values are close to the average can be described by the multivariate normal distribution

f(x, μ, V) = exp(-½(x - μ)V-1(x - μ)T) / ((2π)½p √|V|)

This equation gives a value f proportional to the probability is drawing an object with values x from a population with average values μ and covariance matrix V. For example, heights and weights of adult men in a population. If the interactions between the variables are ignored, the covariance matrix reduces to a vector of variances. Then take the negative logarithm and omit constant factors:

f ∝ Σ((xj - μj)/σj)²

which is squared Euclidean distance after the units of each quantity have been divided by their (within-cluster) standard deviation. So, though some statisticians cook numbers to a thick paste, we can and should have a rational justification for the operations we are making on the data!

It becomes necessary specify exactly an "object" is, and what "similarity" means.

Let an object be a vector in RP, that is, an array of type float of length P. This is the easiest kind of data to work with, and many real problems of interest can be described this way.

x = [x1 x2…xP]

Let the prototype of each cluster be its centroid, the arithmetic mean of its vectors:

c = (1/N) Σxi

Let the dissimilarity between an object and its prototype be the Euclidean distance, that is, according to the Pythagorean Theorem:

d² = Σ(xj - cj)²

The k-means method

The k-means method has been around since 1956 or earlier, and is still widely used because it is simple, fast, and usually gives good results. Usually. Notice that formally, k-means is a method and not an algorithm, because it gives no guarantee that it will yield a correct answer.

Call the objects "points" and the prototypes "centers." The method alternates between two main steps:

Assign each point to its nearest center

Let each center be the average of its points

Quit when the cluster assignments of the points quit changing. Let N denote the number of objects, P denote the measurements each object, and K denote the number of clusters sought. Then the greatest computation cost is in comparing the points to the centers, which requires O(NPK) operations each iteration.1

The k-means++ variant

The method as described above does not say how to choose the starting position. Traditional approaches have been to choose K points at random to be the initial centers, or to randomly assign each point to a cluster. It is possible for this to lead to very bad results, and the old advice was to run k-means several times and keep the best result.

Since 2006, new ways to choose the initial centers have been proposed that offer theoretical guarantees of approximating the optimal clustering.2, 3 This new variant is called the k-means++ algorithm. One point is chosen uniformly at random to be the first center. The remaining K-1 centers are chosen from the other points with a probability proportional to their squared distance to the nearest previously chosen center.

Here is how to choose an element with an arbitrary probability. First, we need a WORK array of length N to hold the shortest squared distance for each data point, and add up TOT as the sum of the elements in WORK. Then,4

Because every element in WORK is positive, adding turns an arbitrary probability distribution into a cumulative probability distribution, so that at each step of the loop on I, TOT must increase. As a graphic, it plots like this:

The cumulative distribution is the wand that can turn the mere uniform random number into a random variable of any other probability distribution.

where, X is the input data of points in RP, P is the number of variables, N is the number of objects, K is the number of clusters to be formed, C is the output centers as points in RP, Z is an array of integers from 1…K that denote the cluster membership of each point, WORK is used internally by the subroutine and its contents may be ignored by the user, and IFAULT is an error code that returns 0 on success.

Notice that the matrices are in the form of arrays of pointers, that double precision is eschewed, and that the error code is now the return value of the function.

Example: the Iris data

The Iris data set is one of the oldest and best-known test problems in statistics.5 It was the example problem R. A. Fisher used to introduce Linear Discriminant Analysis in 1935. It is the measurements of the widths of petals, lengths of petals, widths of sepals, and lengths of sepals of 150 specimens of three species or iris. So N=150, P=4, K=3. The first two dimensions of the data plot as:

The t_iris example program calls KMPP and writes the results as a CSV file which is easy to import into any spreadsheet. With colors representing cluster membership and X marking the centers:

It is possible for the clusters to appear to overlap, because this plot is just a 2D projection of the 4-dimensional data set.6 For reference, here is the Iris data colored according to class label:

There are some disagreements. Remember, clustering is not classification. And k-means is pretty good, but it is certainly not perfect.

Example: random points in the unit circle

The t_round example program generates points at random in the unit circle, N=30000, P=2, K=6. The input to KMPP is:

It is split up into clusters:

Notice that k-means obligingly partitions the data into K pieces, even if no clusters exist in the input!

Limitations

The simple routine given here

Can't cope with data that is not represented as a vector of real numbers

Can't cope with a data set from which any values are missing

Can't tell you how many clusters you ought to make

Can't measure the quality of its results

I hope in a future article to show at an Intermediate level of difficulty how these limitations may be addressed.

Notes

If the centers are put into a tree data structure, then it is possible to find the nearest center to a point without comparing to all of them.

David Arthur and Sergei Vassilvitskii, "k-means++: the advantages of careful seeding", Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007. Without doing any math, but using some common sense, it is obvious that because k-means++ depends on drawing random numbers, it can only "guarantee" the quality of the results in some average sense, over the long run. So it is still important to run the subroutine several times.

This linear forward search could be replaced by storing the cumulative distribution in WORK, and doing a bisection search, for improved speed.

License

Share

About the Author

I work on an assembly line and have been doing amateur programming since 1992. Those rare times away from the computer I spend bicycling, canoeing the Columbia River, or cheering on the Seattle Seahawks.

Well, the R package NbClust implements 30 different statistical tests to try to choose K. The silhouette is one of them. They all seek some kind of minimum of variance, so they all fit into the framework of the elbow method. The code I have now computes the Pham/Dimov/Nguyen statistic.
I'm not familiar with the AIC or BIC. Maybe that is the better choice.

Dear sir: great article... but... please post the full source code, both in C an Fortran.
While the article is a great starting point, the absence of source code make it less than usable, at least for those of us that are starting with the topic.
Anyway, great work. I do expects your next articles.

Hi eslipak,
Thanks for noticing this. The download link at the top of the page has been left out somehow. In the left column of the page, click on "Browse Source" and the source files are all there to view or download.

Well... I was never aware of this (Browse Code) . I do have some difficults reading (age do not contribute to see well ...) and then expand the webpage so letters are bigger. Options on the left of page are hidden... I do learn something new every day
Thanks a lot.
Be well