READING

There are several drawbacks in using $k$-means clustering, among others the sensitivity to outliers, the problem of how to choose $k$ or how to initialize the $k$ centers [1]. Although the problem of initializing the centers is a well-studied problem (e.g. using $k$-means++ [2]), in practice, $k$-means is often run multiple times with different initial centers to get good results. For such approaches, the runtime of $k$-means - especially in high dimensions with a large set of samples - is crucial. Elkan proposes to use the triangle inequality in order to reduce the number of distance calculations and, thus, reduce runtime. His algorithm is based on the following lemmata:

Lemma 1: Let $d$ be a distance function. Further, let $x$ be a data point and $c, c'$ two centers. Then it holds:

if $d(c, c') \geq 2 d(x, c')$, then $d(x,c) \geq d(x, c')$.(1)

Lemma 2: Let $d$ be a distance function, $x$ be a data point and $c, c'$ centers. Then:

$d(x,c) \geq \max\{0, d(x, c') - c(c, c')\}$.

In practice, Lemma 1 is used to reduce the number of distance calculations by computing the distances $d(c, c')$ and applying equation (1). Lemma 2 is used to maintain a lower bound on the distances of each point $x$ to each center $c$, see the paper for details.