Efficient Algorithms for K-Means Clustering

This is a collection of C++ procedures for performing k-means
clustering based on a combination of local search and Lloyd's
algorithm (also known as the k-means algorithm). Given any set
of k centers Z, for each center z in Z, let V(z) denote its
neighborhood, that is, the set of data points for which z
is the nearest neighbor. Each stage of Lloyd's algorithm moves
every center point z to the centroid of V(z) and then updates
V(z) by recomputing the distance from each point to its nearest
center. These steps are repeated until convergence.

However, Lloyd's algorithm can get stuck in locally minimal
solutions that are far from the optimal. For this reason it is
common to consider heuristics based on local search, in
which centers are swapped in and out of an existing solution
(typically at random). Such a swap is accepted if it decreases
the average distortion, and otherwise it is ignored. It is also
possible to combine these two approaches (Lloyd's algorithm and
local search), producing a type of hybrid solution.

This program provides a number of different algorithms for doing
k-means clustering based on these ideas and combinations.
Further information can be found in the software documentation and the above
research papers.

There are four different procedures for performing k-means,
which have been implemented here. The main issue is how the
neighbors are computed for each center.

A local search heuristic, which works by performing swaps
between existing centers and a set of candidate centers.

EZ_Hybrid:

A simple hybrid algorithm, which does one swap
followed by some number of iterations of Lloyd's.

Hybrid:

A more complex hybrid of Lloyd's and Swap, which performs
some number of swaps followed by some number of iterations
of Lloyd's algorithm. To avoid getting trapped in local
minima, an approach similar to simulated annealing is included
as well.

Software:

The current version of the software, Version 1.7.2, is provided
in a bundle (see below), which contains all the copyright notice,
source code, documentation, and a test program. (This version is
functionally identical to version 1.7.) Unfortunately, the documentation
is not very extensive at this time. A Makefile has been provided, which
runs on Unix for Sun Solaris and Redhat Linux, using the g++
compiler, and I have compiled it under Microsoft Visual Studio .NET.
The directory KMLwin32/bin contains a version compiled for Microsoft
Windows.

In k-means clustering we are given a set of n data points in
d-dimensional space and an integer k, and the problem is to
determine a set of k points in d-space, called centers,
so as to minimize the mean squared distance from each data point
to its nearest center. A popular heuristic for k-means
clustering is Lloyd's algorithm. We present a simple and
efficient implementation of Lloyd's k-means clustering
algorithm, which we call the filtering algorithm. This
algorithm is easy to implement, requiring a kd-tree as the only
major data structure. We establish the practical efficiency of
the filtering algorithm in two ways. First, we present a
data-sensitive analysis of the algorithm's running time, which
shows that the algorithm runs faster as the separation between
clusters increases. Second, we present a number of empirical
studies both on synthetically generated data and on real data
sets from applications in color quantization, data compression,
and image segmentation.

In k-means clustering we are given a set of n data points in
d-dimensional space and an integer k, and the problem is to
determine a set of k points to minimize the mean squared
distance from each data point to its nearest center. No exact
polynomial-time algorithms are known for this problem. Although
asymptotically efficient approximation algorithms exist, these
algorithms are not practical due to the very high constant
factors involved. There are many heuristics that are used in
practice, but we know of no bounds on their performance.

We consider the question of whether there exists a simple and
practical approximation algorithm for k-means clustering. We
present a local improvement heuristic based on swapping centers
in and out. We prove that this yields a
(9+ε)-approximation algorithm. We present an example
showing that any approach based on performing a fixed number of
swaps achieves an approximation factor of at least (9-ε)
in all sufficiently high dimensions. Thus, our approximation
factor is almost tight for algorithms based on performing a
fixed number of swaps. To establish the practical value of the
heuristic, we present an empirical study that shows that, when
combined with Lloyd's algorithm, this heuristic performs quite
well in practice.