We find the cluster centers and assign points to k different cluster bins in k-means clustering which is a very well known algorithm and is found almost in every machine learning package on the net. But the missing and most important part in my opinion is the choice of a correct k. What is the best value for it? And, what is meant by best?

I use MATLAB for scientific computing where looking at silhouette plots is given as a way to decide on k discussed here. However, I would be more interested in Bayesian approaches. Any suggestions are appreciated.

$\begingroup$Under visualization-for-clustering there's (ahem) a way to picture k-clusters and see the effect of various k in one shot, using MSTs.$\endgroup$
– denisMar 1 '12 at 12:34

$\begingroup$I've answered this question with half a dozen methods in R over here$\endgroup$
– BenMar 16 '14 at 5:27

1

$\begingroup$Deciding on the "best" number k of clusters implies comparing cluster solutions with different k - which solution is "better". It that respect, the task appears similar to how compare clustering methods - which is "better" for your data. The general guidelines are here.$\endgroup$
– ttnphnsNov 1 '17 at 17:40

11 Answers
11

This has been asked a couple of times on stackoverflow: here, here and here. You can take a look at what the crowd over there thinks about this question (or a small variant thereof).

Let me also copy my own answer to this question, on stackoverflow.com:

Unfortunately there is no way to automatically set the "right" K nor is there a definition of what "right" is. There isn't a principled statistical method, simple or complex that can set the "right K". There are heuristics, rules of thumb that sometimes work, sometimes don't.

The situation is more general as many clustering methods have these type of parameters, and I think this is a big open problem in the clustering/unsupervised learning research community.

$\begingroup$+1 After reading this --it seems to me so intuitive....but I must say that I never thought about this before. that actually the problem of choosing the number of PCs in PCA is equivalent to the problem of choosing the number of clusters in K-mean...$\endgroup$
– DovFeb 10 '12 at 8:38

2

$\begingroup$@Dov these two things are not quite equivalent. There are specific measures that can be used to examine the quality of a PCA solution (most notably reconstruction error, but also % of variance captured etc), and these tend to be (mostly) consistent. However in clustering there is often no one "correct answer" - one clustering may be better than another by one metric, and the reverse may be true using another metric. And in some situations two different clusterings could be equally probable under the same metric.$\endgroup$
– tdcFeb 10 '12 at 11:40

$\begingroup$@Dov Yes, they are "more or less" like each other, but I was simply saying that the problem of choosing the number of clusters is much more fraught than choosing the number of PCs - i.e. they're not "equivalent".$\endgroup$
– tdcFeb 10 '12 at 13:25

1

$\begingroup$+1 You're right. We kind of introduce some other model or assumption to decide on the best k but then the question turns out to be why is that model or assumption the best...$\endgroup$
– petrichorFeb 13 '12 at 12:06

Firstly a caveat. In clustering there is often no one "correct answer" - one clustering may be better than another by one metric, and the reverse may be true using another metric. And in some situations two different clusterings could be equally probable under the same metric.

If you begin with a Gaussian Mixture model, you have the same problem as with k-means - that you have to choose the number of clusters. You could use model evidence, but it won't be robust in this case. So the trick is to use a Dirichlet Process prior over the mixture components, which then allows you to have a potentially infinite number of mixture components, but the model will (usually) automatically find the "correct" number of components (under the assumptions of the model).

Note that you still have to specify the concentration parameter $\alpha$ of the Dirichlet Process prior. For small values of $\alpha$, samples from a DP are likely to be composed of a small number of atomic measures with large weights. For large values, most samples are likely to be distinct (concentrated). You can use a hyper-prior on the concentration parameter and then infer its value from the data, and this hyper-prior can be suitably vague as to allow many different possible values. Given enough data, however, the concentration parameter will cease to be so important, and this hyper-prior could be dropped.

$\begingroup$A Dirichlet process under what concentration parameter? It is kind of equivalent to the same original question, k-means under what k? Though I agree that we understand better the Direchlet distribution that the behavior of some complex algorithm on some real-world data.$\endgroup$
– carlosdcFeb 10 '12 at 1:15

$\begingroup$@carlosdc good point, I've updated the answer to include a bit of discussion about the concentration parameter$\endgroup$
– tdcFeb 10 '12 at 11:34

1

$\begingroup$In my experience it is much easier to learn a continuous valued concentration parameter like alpha than it is to determine the number of clusters in a finite mixture model. If you want to stick with finite mixture model, and take a Bayesian tack, there is reversible jump MCMC (onlinelibrary.wiley.com/doi/10.1111/1467-9868.00095/abstract)$\endgroup$
– yeewhyeFeb 19 '12 at 17:33

Start with K=2, and keep increasing it in each step by 1, calculating your clusters and the cost that comes with the training. At some value for K the cost drops dramatically, and after that it reaches a plateau when you increase it further. This is the K value you want.

The rationale is that after this, you increase the number of clusters but the new cluster is very near some of the existing.

Cluster sizes depend highly on both your data and what you're gonna use the results for. If your using your data for splitting things into categories, try to imagine how many categories you want first. If it's for data visualization, make it configurable, so people can see both the large clusters and the smaller ones.

If you need to automate it, you might wanna add a penalty to increasing k, and calculate the optimal cluster that way. And then you just weight k depending on whether you want a ton of clusters or you want very few.

Essentially this evaluates the fit for various values of k. An "L" shaped graph is seen with the optimum k value represented by the knee in the graph. A simple dual-line least-squares fitting calculation is used to find the knee point.

I found the method very slow because the iterative k-means has to be calculated for each value of k. Also I found k-means worked best with multiple runs and choosing the best at the end. Although each data point had only two dimensions, a simple Pythagorean distance could not be used. So that's a lot of calculating.

One thought is to skip every other value of k (say) to half the calculations and/or to reduce the number of k-means iterations, and then to slightly smooth the resulting curve to produce a more accurate fit. I asked about this over at StackOverflow - IMHO, the smoothing question remains an open research question.

You need to reconsider what k-means does. It tries to find the optimal Voronoi partitioning of the data set into $k$ cells. Voronoi cells are oddly shaped cells, the orthogonal structure of a Delaunay triangulation.

But what if your data set doesn't actually fit into the Voronoi scheme?

Most likely, the actual clusters will not be very meaningful. However, they may still work for whatever you are doing. Even breaking a "true" cluster into two parts because your $k$ is too high, the result can work very well for example for classification. So I'd say: the best $k$ is, which works best for your particular task.

In fact, when you have $k$ clusters that are not equally sized and spaced (and thus don't fit into the Voronoi partitioning scheme), you may need to increase k for k-means to get better results.

$\begingroup$Although the description of K-means in the first paragraph is not wrong, it may mislead some people into equating this method with Voronoi partitioning based on the original data. This is not so: the partition is based on the locations of the cluster means, which might not (and usually will not) coincide with any of the original data.$\endgroup$
– whuber♦Feb 23 '12 at 15:04

knowledge driven: you should have some ideas how many cluster do you need from business point of view. For example, you are clustering customers, you should ask yourself, after getting these customers, what should I do next? May be you will have different treatment for different clusters? (e.g., advertising by email or phone). Then how many possible treatments are you planning? In this example, you select say 100 clusters will not make too much sense.

Data driven: more number of clusters is over-fitting and less number of clusters is under-fitting. You can always split data in half and run cross validation to see how many number of clusters are good. Note, in clustering you still have the loss function, similar to supervised setting.

Finally, you should always combine knowledge driven and data driven together in real world.

As no one has pointed it yet, I thought I would share this. There is a method called X-means, (see this link) which estimates proper number of clusters using Bayesian information criterion (BIC). Essentially, this would be like trying K means with different Ks, calculating BIC for each K and choosing the best K. This algorithm does that efficiently.

There is also a weka implementation, details of which can be found here.

Another approach is to use an evolutionary algorithm whose individuals have chromosomes of different lengths. Each individual is a candidate solution: each one carries the centroids coordinates. The number of centroids and their coordinates are evolved in order to reach a solution that yields the best clustering evaluation score.

Rather than use some statistical criteria (although those may be useful) I would base it on utility for the problem at hand.

Look at various solutions and then judge which one best answers your research question or your business need or else which one gives you insight into the data. So, in one situation, you might choose k based on some other characteristic of the observations. In another, you might have a particular hypothesis to test or question to answer. And in yet another you may be looking for hypotheses or research questions.