9.2 Soft \(K\)-Means

\(K\)-means clustering is a method of clustering data represented as
\(D\)-dimensional vectors. Specifically, there will be \(N\) items to be
clustered, each represented as a vector \(y_n \in \mathbb{R}^D\). In the
“soft” version of \(K\)-means, the assignments to clusters will be
probabilistic.

Geometric Hard \(K\)-Means Clustering

\(K\)-means clustering is typically described geometrically in terms of
the following algorithm, which assumes the number of clusters \(K\) and
data vectors \(y\) as input.

For each \(n\) in \(1:N\), randomly assign vector \(y_n\) to a cluster in \(1{:}K\);

Repeat

For each cluster \(k\) in \(1{:}K\), compute the cluster centroid \(\mu_k\) by averaging the vectors assigned to that cluster;

For each \(n\) in \(1:N\), reassign \(y_n\) to the cluster \(k\) for which the (Euclidean) distance from \(y_n\) to \(\mu_k\) is smallest;

If no vectors changed cluster, return the cluster assignments.

This algorithm is guaranteed to terminate.

Soft \(K\)-Means Clustering

Soft \(K\)-means clustering treats the cluster assignments as
probability distributions over the clusters. Because of the
connection between Euclidean distance and multivariate normal models
with a fixed covariance, soft \(K\)-means can be expressed (and coded in
Stan) as a multivariate normal mixture model.

In the full generative model, each data point \(n\) in \(1{:}N\) is assigned
a cluster \(z_n \in 1{:}K\) with symmetric uniform probability,

\[
z_n \sim \mathsf{Categorical}({\bf 1}/K),
\]
where \({\bf 1}\) is the unit vector of \(K\) dimensions, so that \({\bf 1}/K\) is the symmetric \(K\)-simplex. Thus the model assumes that
each data point is drawn from a hard decision about cluster
membership. The softness arises only from the uncertainty about which
cluster generated a data point.

The sample implementation in this section assumes a fixed unit
covariance matrix shared by all clusters \(k\),
\[
\Sigma_k = \mbox{diag\_matrix}({\bf 1}),
\]
so that the log multivariate normal can be implemented directly up to a proportion
by
\[
\mbox{normal}\left( y_n | \mu_k, \mbox{diag\_matrix}({\bf 1}) \right)
\propto \exp \left (- \frac{1}{2} \sum_{d=1}^D \left( \mu_{k,d} - y_{n,d}
\right)^2 \right).
\]
The spatial perspective on \(K\)-means arises by noting that the inner
term is just half the negative Euclidean distance from the cluster
mean \(\mu_k\) to the data point \(y_n\).

Stan Implementation of Soft \(K\)-Means

Consider the following Stan program for implementing \(K\)-means
clustering.

There is an independent standard normal prior on the centroid parameters;
this prior could be swapped with other priors, or even a hierarchical
model to fit an overall problem scale and location.

The only parameter is mu, where mu[k] is the centroid for cluster
\(k\). The transformed parameters soft_z[n] contain the log of the
unnormalized cluster assignment probabilities. The vector soft_z[n]
can be converted back to a normalized simplex using the softmax
function (see the functions reference manual), either externally or
within the model’s generated quantities block.

Within the multivariate normal version of \(K\)-means, replacing the
unit covariance matrix with a shared covariance matrix amounts to
working with distances defined in a space transformed by the inverse
covariance matrix.

Although there is no global spatial analog, it is common to see soft
\(K\)-means specified with a per-cluster covariance matrix. In this
situation, a hierarchical prior may be used for the covariance matrices.