...How can you tell when this happens ? Here's a very elegant technique, first proposed by Kenneth Rose in his work on deterministic annealing. In the high-T regime, where every point is in the same cluster, the cluster center location can be computed by solving a simple convex optimization. As the process evolves, the positive definite matrix defining the optimization starts losing its positive definiteness, till some point when one of its eigenvalues goes to zero. This point can be computed analytically, and yields a specific temperature value at which the first clusters start to emerge.Kenneth Rose has some nice examples illustrating this behaviour. As time goes one, more and more phase transitions start to appear, as more and more clusters start to emerge. What you end up with is a hierarchy of clusterings that end with the trivial clustering where all data lie in separate clusters.This idea has been developed further with the information bottleneck method, which replaces both the distortion and entropy terms by terms involving the mutual information of the assignments. The free energy paradigm works the same way, although now the phase transition points can't be computed analytically (I'll have more to say about the information bottleneck method)What's entirely weird is that the phase transitions still happen (and I want to stress this point), data sets will split up into different numbers of clusters at the same transition point ! We wrote a paper a few years ago that proposes a particular "temperature" to look for phase transitions in, and lo and behold, we were able to recover the "correct" number of clusters from planted data sets by watching the data split at this point. I'm still not sure why this happens, but it was quite interesting, and yielded one way of identifying "natural clusters" in the data....

which leads me to Lenka Zdeborova's presentation at Paris Machine Learning Meetup #8 ( How hard is it to find a needle in a haystack? ) who tells a similar story of a connection between statistical physics and clustering [4,5,6] in finding communities within large sparse networks. In particular, she and co-authors also describe, the non-backtracking matrix a better choice for the transformation of the data/graph representation than the Laplacian or the adjacency matrix [5].

Why should we care ? because as Adam Coates and Andrew Y. Ng stated recently in [7], these clustering algorithms, much like dictionary learning in the early deep neural network papers are the first steps for feeding data into some deep learning algorithms. From [7]:

...Recently, it has been found that K-means clustering can be used as a fast alternative training method. The main advantage of this approach is that it is very fast and easily implemented at large scale. On the other hand, employing this method in practice is not completely trivial: K-means has several limitations, and care must be taken to combine the right ingredients to get the system to work well...

...I still see plenty of questions on Metaoptimize, on the Deep Learning G+ community, on the Kaggle forums and on this subreddit, from people who seem to be unaware of this "paradigm shift". They ask about training autoencoders and RBMs for unsupervised feature learning, when it is often clear that a purely supervised approach would probably work at least as well for their problem (and is conceptually much simpler and easier to understand).

I think this is because they read papers from 2010-2012 advertising unsupervised pre-training as the holy grail of deep learning. That was only 2-4 years ago, so they can't really be blamed for assuming that this approach still represents the state of the art.

Of course unsupervised pre-training still has its applications, but for many problems it has been obsoleted. So I don't think it's a bad thing to draw some attention to this fact

so quite clearly supervised learning applications in human speech or images won't need much clustering like K-means/subspace clustering but it will remain key for real exploration like understanding Dolphin communication or learning heterogenous data among other. And as the map makers did in compressive sensing, those phase transitions, when they are identified, will be key to understand these data. Given a dataset, if these algorithms fail (i.e are not robust) in these unsupervised settings, then you are either not catching the right data (and need to expand them by making the network less sparse) or you are not using the right modeling tools (metric, ...) .

PS: All these phase transitions results appear for spectral clustering and none seem to have been found in the subspace clustering arena .... yet.

PPS: I'll add this blog entry to the phase transition of the Spectral Clustering section in the Advanced Matrix Factorization Jungle page. I just split the spectral clustering section from the subspace clustering one and came up with this division:

Spectral Clustering, A = DX with unknown D and X, solve for sparse X and X_i = 0 or 1

2 comments:

To avoid any misunderstandings, I should perhaps clarify that I was talking specifically about unsupervised pre-training as a method of initialising the parameters of supervised models. K-means has never been particularly popular for that, to my knowledge (mainly autoencoders and RBMs).

When it comes to unsupervised feature learning I'm a fan of the (spherical) k-means approach as Adam Coates describes it. Its simplicity and speed of learning is unmatched. I've used it myself for the whale detection challenge on Kaggle, among other things.