Abstract

We analyze online [5] and mini-batch [16]k-means variants. Both scale up the widely used k-means algorithm via stochastic approximation, and have become popular for large-scale clustering and unsupervised feature learning.
We show, for the first time, that starting with any initial solution, they converge to a “local optimum” at rate O(1t) (in terms of the k-means objective) under general conditions.
In addition, we show if the dataset is clusterable, when initialized with a simple and scalable seeding algorithm, mini-batch k-means converges to an optimal k-means solution at rate O(1t) with high probability.
The k-means objective is non-convex and non-differentiable: we exploit ideas from recent work on stochastic gradient descent for non-convex problems [8, 3] by providing a novel characterization of the trajectory of k-means algorithm on its solution space, and circumvent the non-differentiability problem via geometric insights about k-means update.

Lloyd’s algorithm (batch k-means) [13] is one of the most popular heuristics for clustering [9].
However, at every iteration it requires computation of the closest centroid to every point in the dataset. Even with fast implementations such as [7], which reduces the computation for finding the closest centroid of each point, the per-iteration running time still depends linearly on n, making it a computational bottleneck for large datasets.
To scale up the centroid-update phase, a plausible recipe is the “stochastic approximation” scheme [4]: the overall idea is, at each iteration, the centroids are updated using one (online [5]) or a few (mini-batch [16]) randomly sampled points instead of the entire dataset. In the rest of the paper, we refer to both as stochastic k-means, which we formally present as Algorithm 1.
Empirically, stochastic k-means has gained increasing attention for large-scale clustering and is included in widely used machine learning packages, such as Sofia-ML[16] and scikit-learn[14].
Figure 3 demonstrates the efficiency of stochastic k-means against batch k-means on the RCV1 dataset [12]. The advantage is clear, and the results raise some natural questions:
Can we characterize the convergence rate of stochastic k-means?
Why do the algorithms appear to converge to different “local optima”?
Why and how does mini-batch size affect the quality of the final solution?
Our goal is to address these questions rigorously.

Our main contribution1 is the O(1t) global convergence of stochastic k-means. Our key idea is to keep track of the distance from the algorithm’s current clustering solution to the set of all “local optima” of batch k-means: when the distance is large with respect to all local optima, the drop in k-means objective after one iteration of stochastic k-means update is lower bounded; when the algorithm is close enough to a local optimum, the algorithm will be trapped by it and maintains the O(1t) convergence rate thanks to a geometric property of local optima.

{subfigure}

0.46
{subfigure}0.5

Figure 1: Figure from [16], demonstrating the relative performance
of online, mini-batch, and batch k-means.
Figure 2: An illustration of one run of Lloyd’s algorithm:
the arrows represent k-means updates in action.
Figure 3:

Notation

The input to our clustering problem is a discrete dataset X; ∀x∈X, x∈Rd.
We use letter A to denote clusterings of X;
we use letter C to denote a set of k-centroids.
Superscripts are used to indicate a particular clustering, e.g., At denotes the clustering at the t-th iteration in Algorithm 1 (or batch k-means);
subscripts indicate individual clusters in a clustering: cr denotes the r-th centroid in C corresponding to the r-th cluster Ar.
We use letter n to denote cardinality, n=|X|, nr=|Ar|, etc.
Fix a point set Y, we let m(Y) denote the mean of Y. Each clustering A:={As,s∈[k]} induces a unique set of centroids m(A):={m(As),s∈[k]}.
Fix C={cr,r∈[k]}, we let V(cr) denote the Voronoi cell belonging to centroid cr, i.e., {x∈Rd,∥x−cr∥≤∥x−cs∥,∀s≠r}, and we use V(C) to denote the Voronoi diagram induced by a set of centroids C.
Fix C with k-centroids, we denote its k-means cost with respect to a k-clustering A by
ϕ(C,A):=∑kr=1∑x∈Ar∥x−cr∥2
(or simply ϕ(C) when A is induced by the Voronoi diagram of C).
We let ϕ∗:=ϕ(C∗), ϕt:=ϕ(Ct), and let ϕ∗r (ϕtr) denote the k-means cost of cluster A∗r (Atr). ϕopt and ϕoptr are similarly defined for the cost of an optimal k-means clustering.
We use π:[k]→[k] to denote permutations between two sets of the same cardinality.

First, we develop insights for batch k-means to facilitate our analysis of stochastic k-means.2
Batch k-means initializes the position of k centroids C0 via a seeding algorithm. Then ∀t≥1,

Obtain At by assigning each x to its closest centroid (Voronoi cell).

For all Atr that is not empty, obtain ctr:=m(Atr).

The algorithm alternates between two solution spaces: the continuous space of all k-centroids, which we denote by {C}, and the finite set of all k-clusterings, which we denote by {A}. One problem with k-means is it may produce degenerate solutions: if the solution Ct has k centroids, it is possible that data points are mapped to only k′<k centroids. To handle degenerate cases, starting with |C0|=k, we
consider an enlarged clustering space {A}[k], which is the union of all k′-clusterings
with 1≤k′≤k.
Our key idea is that {C} can be partitioned into equivalence classes by the clustering they induce on X, and the algorithm stops if and only if two consecutive iterations stay within the same equivalence class in {C}.
Specifically, for any C, let
v(C):=V(C)∩X∈{A}[k] denote the clustering they induce on X via their Voronoi diagram.
Then C1,C2 are in the same equivalence class in {C} if v(C1)=v(C2)=A.
In lieu of this construction, the algorithm goes from {C} to {A}[k] via mapping v:{C}→{A}[k];
it goes from {A}[k] to {C} via the mean operation m:{A}[k]→{C}.
Figure 3 illustrates how batch k-means alternates between
two solution spaces until convergence.
We use the pre-image v−1(A)∈{C} to denote the equivalence class induced by clustering A. When batch k-means visits a clustering At, if m(At)∉v−1(At), the algorithms jumps to another clustering At+1.
If m(At)∈v−1(At), the algorithm stops because At+1=At and m(At+1)=m(At).
We thus formalize the idea of “local optima” of batch k-means as below.

Definition 1 (Stationary clusterings).

We call A∗ a stationary clustering of X, if m(A∗)∈Cl(v−1(A∗)).
We let {A∗}[k]⊂{A}[k] denote the set of all stationary clusterings of X
with number of clusters k′∈[k].

The operator Cl(⋅) denotes the “closure” of an equivalence class v−1(A∗), which includes its “boundary points”, a set of centroids that induces ambiguous clusterings (this happens when there is a data point on the bisector of two centroids in a solution; see Appendix A for details). For each A∗, we define a matching centroidal solution C∗.

Definition 2 (Stationary points).

For a stationary clustering A∗ with k′ clusters, we define C∗={c∗r,r∈[k′]} to be a stationary point corresponding to A∗, so that ∀A∗r∈A∗, c∗r:=m(A∗r).
We let {C∗}[k] denote the corresponding set of all stationary points of X with k′∈[k].

As preluded in the introduction, defining a distance measure on {C} is important to our subsequent analysis; For C′ and C, we let
Δ(C′,C):=minπ:[k]→[k]∑rnr∥c′π(r)−cr∥2, where nr=|Ar|.
This distance is asymmetric and non-negative, and evaluates to zero if and only if two sets of centroids coincide. In addition, if C∗ is a stationary point, then for any solution C, Δ(C,C∗) upper bounds the difference of k-means objective, ϕ(C)−ϕ(C∗) (Lemma 11).

2.1 Stochastic k-means

Algorithm 1 and stochastic k-means [5, 16] are equivalent up to the choice of learning rate and sampling scheme (the proof of equivalence is in Appendix A).
In [5, 16], the per-cluster learning rate is chosen as
ηtr:=^ntr∑i≤t^nir; in our analysis, we choose a flat learning rate ηt=c′to+t for all clusters, where c′,to>0 are some fixed constants (empirically, no obvious differences are observed; see Section 3.1 for more discussion and Section 4 for empirical comparison).

At iteration t (t≥1), obtain sample St⊂X of size m uniformly at random with replacement;
set count ^ntr←0 and set Str←∅, ∀r∈[k]

fors∈Stdo

Find I(s) s.t. cI(s)=C(s)

StI(s)←StI(s)∪s; ^ntI(s)←^ntI(s)+1

endfor

forct−1r∈Ct−1do

if^ntr≠0then

ctr←(1−ηtr)ct−1r+ηtr^ctr with
^ctr:=∑s∈Strs^ntr

endif

endfor

untilconvergence_criterion is satisfied

Algorithm 1 Stochastic k-means

Unlike a usual gradient-based algorithm on a continuous domain, the discrete nature of the k-means problem causes major differences when stochastic approximation is applied.

First, for batch k-means, at every iteration t, Ct is chosen as the means, m(At), of the current k-clustering At. Since {A}[k] is finite and m(A) is unique for a fixed A, the set of “legal moves” of batch k-means is finite, while {C} is continuous.
In stochastic k-means, however, Ct can be any member of {C} due to both stochasticity and the learning rate. As such, its effective solution space on {C} is continuous and thus infinite. Our framework enables us to impose a finite structure on {C} by mapping points in {C} to points in {A}[k].

Second, in stochastic k-means, centroids are usually updated asynchronously, especially when the mini-batch size is small compared to k; in the extreme case of online k-means, centroids are updated one at a time.
Since the positions of centroids implicitly determine the clustering assignment, asynchronous k-means updates will lead to a different solution path than fully synchronized ones, even if we ignore the noise introduced by sampling and choose the learning rate to be 1.
Due to asynchronicity, it is also hard to detect when stochastic k-means produces a degenerate solution, since centroids may fail to be updated for a long time due to sampling. In practice, implementation such as scikit-learn re-locate a centroid randomly if it fails to be updated for too long. In Algorithm 1, we do not allow re-location of centroids and our analysis subsumes the degenerate cases.

When analyzing Algorithm 1, we let Ω denote the sample space of all outcomes (C0,C1,…,Ct,…) and Ft the natural filtration of the stochastic process up to t (Ω is also used in our statements of Theorems 1 and 2 as the Big-Omega).
We let
ptr(m):=Pr{ct−1r~{}is updated at~{}t~{}with mini-%
batch size~{}m}.

With two weak assumptions on the properties of stationary points of a dataset X, our first result shows stochastic k-means has an overall O(1t) convergence rate to a stationary point of batch k-means, regardless of initialization.

(A0)

We assume all stationary points are non-boundary points, i.e., ∀A∗∈{A∗}[k], m(A∗)∈v−1(A∗)
(By Lemma 5, under this assumption, ∃rmin>0 such that all stationary points are (rmin,0)-stable).

Then
starting from any initial set of k-centroids C0,
Let G denote the event
{∃T,∃A∗∗∈{A∗}[k]~{}s.t.~{}At=A∗∗,∀t≥T}.
Then
Pr(G)≥1−δ,
and there exists events parametrized by A∗∗, denoted by
Go(A∗∗), such that
Pr{∪A∗∗∈{A∗}[k]Go(A∗∗)}≥1−δ. And for any
Go(A∗∗), we have
∀t≥1,
E{ϕt−ϕ∗∗|Go(A∗∗)}=O(1t),~{}~{}where~{}~{}ϕ∗∗:=ϕ(m(A∗∗)).

This result guarantees the expected O(1t) convergence of Algorithm 1
towards a stationary point under a general condition.
However, it does not guarantee Algorithm 1 converges to the same stationary clustering as its batch counterpart, even with the same initialization. Moreover, it is even possible that the final solution becomes degenerate.

Algorithm parameters in the convergence bound

The exact convergence bound of Theorem 1, which we hide in the Big-O notation, reveals dependence of the convergence on the algorithm’s parameters.
When c′ is sufficiently large, the exact bound we obtain is likely dominated by
a2maxBβ−1(t0+2t0+1)β+11t0+t+1
where β=2c′p+min(m)(1−√~ϕtϕt)
and
amax:=c′ρ(m).
The bound becomes tighter when c′ decreases.
Since the 1t-order convergence requires a sufficiently large c′, our analysis suggests we should choose c′ to be neither too large nor too small.
The bound also becomes tighter when p+min(m) and ρ(m) becomes closer to 1. Both depend on m; less obviously, they also depend on the number of non-degenerate clusters.
Since ptr=1−(1−ntrn)m, which equals zero if and only if ntr=0,
p+min(m)=1−(1−minr,t;ptr>0ptr(1))m<1−(1−1k′)m, where k′ is the smallest number of non-degenerate centroids in a run. The similar holds for ρ(m).
This suggests the convergence rate may be slower with larger k′, which depends on the initial number of clusters k, and smaller m. For experimental results of the effect of algorithm parameters on convergence, see Section 4.
A detailed explanation on the exact bound is included in Appendix B.

The general convergence result in Theorem 1 is applicable to any seeding algorithm, including random sampling, which is probably the most scalable seeding heuristic one can use. However, it does not provide performance guarantee with respect to the k-means objective. We next show that stochastic k-means initialized by Algorithm 2 converges to a global optimum of k-means objective at rate O(1t), under additional geometric assumptions on the dataset.
The major advantage of Algorithm 2 over other choices of initialization, such as the k-means++[1] or random sampling, is that its running time is independent of the data size while providing seeding guarantee. When using it with stochastic k-means, both the seeding and the update phases are independent of the data size, making the overall algorithm highly scalable.

Let (Aopt,Copt) denote an optimal k-means clustering of X. We assume, similar to [11], that the means of each pair of clusters are well-separated and that the points from the two clusters are separated by a “margin”, that is, ∀x∈Aoptr∪Aopts, the distance from x to the bisector of coptr and copts is lower bounded.
Formally, let ¯x denote the projection of x onto the line joining coptr,copts, the margin between the two clusters is defined as
Δrs:=minx∈Aoptr∪Aopts|∥¯x−coptr∥−∥¯x−copts∥|.
In addition, we require that the size of the smallest cluster is not too small compared to the data size.
The geometric assumptions are formally defined as below.

(B1)

Mean separation: ∀r∈[k],s≠r, ∥m(Aoptr)−m(Aopts)∥≥f(α)√ϕopt(1√noptr+1√nopts), with f(α)>max{642,5α+5256α,maxr∈[k],s≠rnoptrnopts} for some α∈(0,1) that is sufficiently small.

(B2)

Existence of margin: ∀r∈[k],s≠r,
Δrs≥γ∥m(Aoptr)−m(Aopts)∥, for some γ>8√2√f.

(B3)

Cluster balance: pmin≥γ162f(α)+√α, where pmin:=minr∈[k]noptrn.

Theorem 2.

Assume (B1)(B2) (B3) hold for an optimal k-means clustering Aopt.
Fix any ξ>0, if f(α) in addition satisfies
f(α)≥5√12wminln(2ξpminln2kξ),
and we choose mo in Algorithm 2 such that
log2kξpmin<mo<ξ2exp{2(f(α)4−1)2w2min}.
And if we run Algorithm 1 initialized by Algorithm 2 and choosemini-batch size m and learning rate of the form
ηt=c′t+to so that

m>1~{}~{}~{}and~{}~{}~{}c′>12[1−√α−(1−√α)m]

and

to=Ω{[(c′)2ρ(m)2(2c′(ρ(m)−√α)−1)ln1δ]22c′(ρ(m)−√α)−1}

where
ρ(m):=1−[1−(pmin−γ162f(α))]m.
Then ∀t≥1, there exists event Gt⊂Ω s.t.

Interestingly, we cannot provide performance guarantee for online k-means (m=1) in Theorem 2.
Our intuition is, instead of allowing stochastic k-means to converge to any stationary point as in Theorem 1, it studies convergence to a fixed optimal solution; a larger m provides more stability to the algorithm and prevents it from straying away from the target solution.
The proposed clustering scheme is reminiscent to the Buckshot algorithm[6], widely used in the domain of document clustering.
Readers may wonder how can the algorithm approach ϕopt for an NP-hard problem. The reason is that our geometric assumptions softens the problem.
In this case, Algorithm 1 converges to the same (optimal) solution as its batch counterpart, provided the same initialization.

3.1 Related work and proof outline

Our major source of inspiration comes from recent advances in non-convex stochastic optimization for unsupervised learning problems [8, 3].
[8] studies the convergence of stochastic gradient descent (SGD) for
the tensor decomposition problem, which amounts to finding a local optimum of a non-convex objective function composed exclusively of saddle points and local optima.
Inspired by their analysis framework, we divide our analysis of Algorithm 1 into two phases, that of global convergence and local convergence, indicated by the distance from the current solution to stationary points, Δ(Ct,C∗).
We use Δt:=Δ(Ct,C∗) as a shorthand.

Significant decrease in k-means objective when Δt is large

In the global convergence phase, the algorithm is not close to any stationary point, i.e.,
∀C∗∈{C∗}[k], Δ(Ct,C∗) is lower bounded.
We first prove a black-box result showing that on non-stationary points, stochastic k-means decreases the k-means objective in expectation at every iteration.

Lemma 1.

By Lemma 1, the term minr,t;pt+1r(m)>0[ηt+1rpt+1r(m)]ϕt(1−√~ϕtϕt)
lower bounds the per iteration drop in k-means objective.
Since minr;pt+1r(m)>0pt+1r(m)≥1n, ϕt≥ϕopt,
and 1−~ϕtϕt=ϕt−~ϕtϕt=Δ(Ct,Ct+1)ϕt≥0 (the second equality is by Lemma 12), the drop is always lower bounded by zero.
We show that if Δ(Ct,C∗) is bounded away from zero (for any stationary point C∗), then so is Δ(Ct,Ct+1): the rough idea is, in case Ct+1 is a non-stationary point, Ct and Ct+1 must belong to different equivalence classes, as discussed in Section 2, and their distance must be lower bounded by Lemma 4; otherwise, Ct+1 is a stationary point, and by our assumption their distance is lower bounded.
Thus,
Δ(Ct,Ct+1)ϕt is lower bounded by a positive constant, and so is 1−√~ϕtϕt.
Since we chose ηt:=Θ(1t), the expected per iteration drop of cost is of order Ω(1t), which forms a divergent series; after a sufficient number of iterations the expected drop can be arbitrarily large. We conclude that Δ(Ct,C∗) cannot be bounded away from zero asymptotically, since the k-means cost of any clustering is positive. This result is presented in Lemma 2.

Before proceeding, note
the drop increases with
minr,t;pt+1r(m)>0[ηt+1rpt+1r(m)].
This means if we can set a cluster-dependent learning rate that adapts to pt+1r(m), the drop could be larger than choosing a flat rate, as we do in our analysis.
The learning rate in [16, 5], where ηtr:=^ntr∑i≤t^nir, is intuitively making this adaptation:
in the case when the clustering assignment does not change between iterations, it can be seen that
Eηtr≈1tptr(m), so the effective learning rate ηtrptr(m) is balanced for different clusters and is roughly 1t. However, in practice, the clustering assignment changes when the centroids are updated, and it is hard to analyze this adaptive learning rate due to its random dependence on all previous sampling outcomes.

Lemma 2.

Assume (A1) holds.
If we run Algorithm 1 on X with ηt=at+to,a≥12p+min(m), with p+min(m) as defined
in Theorem 1,
and to>1, and choose any initial set of k centroids C0.
Then for any δ>0, ∃t s.t. Δ(Ct,C∗)≤δ with C∗:=m(A∗) for some A∗∈{A∗}[k].

Lemma 2 suggests that, starting from any initial point C0 in {C}, the algorithm always approaches a stationary point asymptotically, ending its global convergence phase after a finite number of iterations. We next examine its local behavior around stationary points.

Local convergence to stationary points when Δt is small

To obtain local convergence result, we first define “local attractors” and “basin of attraction” for batch k-means; the natural candidate for the former is the set of stationary points; a basin of attraction is a subset of the solution space so that once batch k-means enters it, it will not escape.

Definition 3.

We call C∗ a (b0,α)-stable stationary point of batch k-means if it is a stationary point and for any clustering C such that
Δ(C,C∗)≤b′ϕ∗, b′≤b0,
the Voronoi partition induced by {cr}, denoted by {Ar}, satisfies
maxr|Aπ(r)△A∗r|n∗r≤b5b+4(1+ϕ(C)/ϕ∗), where
π is the permutation achieving Δ(C,C∗),
with b≤αb′ for some α∈[0,1).

We can view b0 as the radius of the basin and α the degree of the “slope” leading to the attractor.
A key lemma to our analysis characterizes the iteration-wise local convergence property of batch k-means around stable stationary points, whose convergence rate depends on the slope α.

Lemma 3.

Let C∗ be a (b0,α)-stable stationary point. For any C such that
Δ(C,C∗)≤b′ϕ∗, b′≤b0,
apply one step of batch k-means update on C results in a new solution C1 such that
Δ(C1,C∗)≤αb′ϕ∗.

Lemma 3 resembles the standard iteration-wise convergence statement in SGD analysis, typically via convexity or smoothness of a function [15]. Here, we have neither at our dispense (we do not even have a gradient). Instead, our analysis relies on the geometric property of Voronoi diagram and the mean operation used in a k-means iteration, similar to those in recent works on batch k-means [11, 2, 17].
Although this lemma applies only to batch k-means, our hope is that stochastic k-means has similar iteration-wise convergence behavior in expectation even in the presence of noise.

The difficulty here is, due to non-convexity, the convergence result only holds within a local basin of attraction: if the algorithm’s solution is driven off the current basin of attraction by stochastic noise at any iteration, it may converge to a different attractor, causing trouble to our analysis.
To deal with this, we exploit probability tools developed in [3]. [3] studies the convergence of stochastic PCA algorithms, where the objective function is the non-convex Rayleigh quotient, which has a plateau-like component. The tools developed there were used to show that stochastic PCA gradually escapes the plateau. Here, we adapted their analysis to show Algorithm 1 stays within a basin of attraction with high probability, and converges to the attractor at rate O(1t).
We define a nested sequence of events Ωi⊂Ω:

Ω⊃Ω1⊃⋯⊃Ωi⊃…

where Ωi:={Δt≤b0ϕ∗,∀t<i}.
Then if Algorithm 1 is within the basin of attraction of a stable stationary point at time t, the event that it escapes this local basin of attraction is contained in the
event ∪t≥i+1Ωt−1∖Ωt.
We upper bound the probability of this bad event (Proposition 1) using techniques that derive tight concentration inequality via moment generating functions from [3], which in turn implies a lower bound on the probability of Ωt, t≥i.
Then conditioning on Ωt and adapting Lemma 3 from batch to stochastic k-means proves the expected local convergence rate of O(1t).

Theorem 3.

Assume (A1) holds.
Let C∗ be a (b0,α)-stable stationary point,
and let Δi=bϕ∗ for some b≤12b0 at some iteration i
in Algorithm 1.
Let
at:=maxrptr(m)minspts(m).
Suppose we set c′ and m sufficiently large so that

Note how, in contrast to Theorem 1,
the local convergence result does not allow degeneracy here, by implicitly requiring that minr,tptr(m)>0.
This is reasonable, since a degenerate set of k centroids cannot converge to a fixed stationary point
with k centroids.

Building on the ideas introduced above, we present the key ingredients of our main theorems.

To prove Theorem 1, we first show that all stationary points satisfying (A0) must be
(rmin,0)-stable for some rmin>0 (Lemma 5).
Then we apply results from both the global and local convergence analysis of Algorithm 1.
We define the global convergence phase as when Δ(Ct,C∗∗)>12rminϕ∗, ∀C∗∗∈{C∗}[k]. As discussed, since Δ(Ct,C∗∗) is bounded away from zero, the per-iteration drop of k-means cost is of order Ω(ηt), thus we get the expected O(1t) convergence rate.
Lemma 2 suggests that this phase will eventually end and at some iteration T,
Δ(CT,C∗∗)≤12rminϕ∗∗ must hold for some stationary point C∗∗.
The first time when this happens signals the beginning of the local convergence phase:
at this point, CT is in the basin of attraction of C∗∗ since C∗∗ is (rmin,0)-stable.
Thus, applying Theorem 3 implies that in this phase the algorithm converges to C∗∗ locally at rate O(1t).
Hence, the overall global convergence rate is O(1t).

Here we only apply the local convergence result.
The key step is to show that our clusterability assumption on the dataset, (B1) and (B2), implies that its optimal k-means solution, Copt, is a (b0,α)-stable stationary point with a sufficiently large b0 (Proposition 2). Then we adapt results from [17] to show that the seeds obtained from Algorithm 2 are within the basin of attraction of Copt with high probability (Lemmas 13). Using the other geometric assumption, (B3), we apply Theorem 3 to show an O(1t) convergence to Copt with high probability.

To verify the O(1t) global convergence rate of Theorem 1, we run stochastic k-means with varying learning rate, mini-batch size, and k on RCV1[12]. The dataset is relatively large in size: it has manually categorized 804414 newswire stories with 103 topics, where each story is a 47236-dimensional sparse vector; it was used in [16] for empirical evaluation of mini-batch k-means.
We used Python and its scikit-learn package [14] for our experiments, which has stochastic k-means implemented. We disabled centroid relocation and modified their source code to allow a user-defined learning rate (their learning rate is fixed as ηtr:=^ntr∑i≤t^nir, as in [5, 16], which we refer to as BBS-rate).

Figure 5 shows the convergence in k-means cost of stochastic k-means algorithm over 100 iterations for varying m and k; fix each pair (m,k), we initialize Algorithm 1 with a same set of k random centroids and run stochastic k-means with varying learning rate parameters (c′,to), and we average the performance of each learning rate setup over 5 runs to obtain the original convergence plot.
Figure 5 is an example of a convergence plot before transformation. The dashed black line in each log-log figure is ϕ0−ϕmint, a function of order Θ(1t). To compare the performance of stochastic k-means with this baseline, we first transform the original ϕt vs t plot to that of ϕt−ϕmin vs t. By Theorem 1, E[ϕt−ϕ∗∗|G(A∗∗)]=O(1t), so we expect the slope of the log-log plot of ϕt−ϕ∗∗ vs t to be at least as large as that of Θ(1t). Although we do not know the exact cost of the stationary point, since the algorithm has reached a stable phase over 100 iterations, as illustrated by Figure 5, we simply use ϕmin as an estimate of ϕ∗∗.
Most log-log convergence graphs fluctuate around a line with a slope at least as steep as that of Θ(1t), and do not seem to be sensitive to the choice of learning rate in our experiment. Note in some plots, such as Figure 5, the initial phase is flat. This is because we force the plot to start at ϕ0−ϕmin instead of their true intercept on the y-axis. BBS-rate exhibits similar behavior to our flat learning rates.
Our experiment suggests the convergence rate of stochastic k-means may be sensitive to the ratio mk; for larger m or smaller k, faster and more stable convergence is observed.

To facilitate our analysis of mini-batch k-means, we build a framework to
track the solution path produced by batch k-means;
it alternates between two solutions spaces: the space of all k-centroids, which we denote by {C}, and the space of all k-clusterings, which we denote by {A}.
Note the latter is a finite set. Throughout the paper, we use π:[k]→[k] to denote permutations between two sets of the same cardinality.

Degenerate cases

One problem with batch k-means algorithm is it may produce degenerate solutions: if the solution Ct has k centroids, it is possible that data points are mapped to only k′<k centroids. To handle degenerate cases, starting with |C0|=k, we will
consider the enlarged clustering space {A}[k], the union of all k′-clusterings, which we denote by {A}k′,
with 1≤k′≤k.

Partitioning {C} via {A}

Our first observation is that most part of {C} (with exception discussed below) can be partitioned into equivalence classes by the clustering they induce on X;
for any C, let
v(C):=V(C)∩X∈{A}[k] to formally denote the clustering they induce via their Voronoi diagram.
For most points in {C}, there is only one such clustering so v(C) is uniquely determined. We define v−1(A) as the set of points inducing a unique k′-clustering A, with k′≤k.
Then we let C1,C2 be in the same partition in {C} if v(C1),v(C2) are both unique and v(C1)=v(C2).

Ambiguous cases

However, it is not always clear which partition C belongs to: if
∃x∈X such that ∥x−c1(x)∥=∥x−c2(x)∥, where c1(x),c2(x) denote the centroids in C that are closest and second closest to x, x can be clustered into either the cluster of centroid c1(x) or that of c2(x). Centroids with this property can induce two or more clusterings due to ambiguity of Voronoi partition.
Intuitively, they are at the boundary of v−1(Ai), for some clusterings Ai.
We formalize v−1(A) and boundary points as below.

Definition 4 (members of v−1(A)).

Fix a clustering A={A1,…,Ak}, we define
C∈v−1(A) if it contains a matching set of centroids and there exists a permutation of [k]
s.t.
∀x∈Ar,∀r∈[k],

∥x−cπ(r)∥<∥x−cπ(s)∥,∀s≠r

We say C is a boundary point of v−1(A), if
∀r∈[k],∀x∈Ar,

∥x−cπ(r)∥≤∥x−cπ(s)∥,∀s≠r

with equality attained for at least one data point x.
Let B(v−1(A)) denote the set of all boundary points of v−1(A).
Let Cl(v−1(A)):=v−1(A)∪B(v−1(A)) denote the closure of v−1(A).

Note that in the case C has k′>k centroids, C∈v−1(A) implies
all centroids in C∖{cπ(1),…,cπ(k)} are degenerate.

Representing k-means updates

For now, let C∗ denote a “local minimum” of batch k-means and
suppose C∗ is not a boundary point. Let A∗:=v−1(C∗).
One run of batch k-means can be represented as

C0v(C0)→A1m(A1)→C1→⋯m(At−1)→Ct−1v(Ct−1)→At⋯→A∗→C∗

Figure 3 illustrates how the algorithm alternates between
two solution spaces.
When batch k-means visits a clustering At, if m(At)∉Cl(v−1(At)), the algorithms jumps to another clustering At+1.
Otherwise, if m(At)∈v−1(At), the algorithm stops because At+1=At and m(At+1)=m(At).
In the special case where m(At) is a boundary stationary point,
since the algorithm arbitrarily breaks the tie, then it will continue to operate if the new clustering is chosen such that At+1≠At, or stops if At+1=At.
In practice, it is unlikely to encounter a boundary point due to perturbations in the computing system, and regardless,
a sufficient condition for batch k-means to converge at the t-th iteration is m(At)∉Cl(v−1(At)