Fast, High Quality, Single Pass k-means Clustering

Ted Dunning ( MapR Technology )

5Oct

16:00 5th October 2012 ( week 0, Michaelmas Term 2012 )

Lecture Theatre A

I will describe an implementation of recent results that provide high quality k-means clustering at very high speed. For
well clusterable data, this algorithm provides good bounds on quality, but practically speaking, it makes clustering practical
in many applications by providing roughly 3 orders of magnitude speedup relative to the standard algorithm based on Lloyd's
initial efforts. In addition, the algorithm is highly amenable to implementation using map-reduce and shows essentially linear
speedup. Just as significant, this new algorithm allows clustering with a very large number of clusters which makes it practical
to use as a feature extraction algorithm or set up for a nearest neighbor search.

I will provide an outline of how and why this class of algorithms work so well, and demonstrate the practical aspects of
the implementation. I will also describe how this algorithm can be integrated into an industrial scale nearest neighbor implementation.

Speaker bio

Ted Dunning has been involved with a number of startups with the latest being MapR Technologies where he is Chief Application
Architect working on advanced Hadoop-related technologies. He is also a PMC member for the Apache Zookeeper and Mahout projects.
Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related
communities and loves helping projects get going with new technologies.

Prior to his work in the startup world, he worked
as a research at the Computing Research Laboratory in New Mexico under Yorick Wilks. His published work includes advances
in statistical natural language processing, genomics and machine translation. One paper, in particular, on the statistics
of rare coincidences has been widely cited academically and used in practical systems.