May 6, 2012

it has been quite a long time since my last blog post. Thanks to my work, to keep me busy all day and don't let me research on cool new things. However, over the few holidays and weekends over the last weeks I came across a very interesting algorithm called DBSCAN.
It is abbreviated for "density-based spatial clustering of applications with noise", it is a unsupervised clustering algorithm just like k-means, besides that it is much smarter in many aspects.
Another objective I'd like to solve is the parallelization of this algorithm. I've seen just some ancient papers and what buffels me is that I've seen no implementation in Mahout (for MapReduce) or other distributed frameworks.

As you may know, I'm working for Apache Hama. It is a framework for distributed computing with the BSP (bulk synchronous parallel) model. I always searching for new algorithms that could fit into the model of BSP computing, e.G. graph algorithms of all sorts, strongly iterative algorithms, real-time algorithms.
And I think that DBSCAN also fits into the BSP model, I tell you why a bit later in this post.
First off, just a little introduction of the DBSCAN algorithm itself...

The algorithm

The algorithm is very easy to understand. Actually you have a bunch of points (or vectors in higher dimensionalities) as input, then you have to parameters and some fancy output.
The two parameters are called "epsilon" and "minpoints", epsilon is the minimum distance between two vectors to connect two points strongly and minpoints is the number of points that are at least needed to build a cluster out of strongly connected vectors.
Now you are going through the graph, point by point, marking visited vectors and adding points to a cluster while they are not violating the rules defined by epsilon and minpoints.

You can read on wikipedia about how the sequential version works in detail, however I am going to propose a much more easier to understand version of the algorithm.

Distributed algorithm steps

Instead of defining a big distributed algorithm that translates the sequential version into some distributed programming model, I have assembled three main steps to get the same result as the sequential version.
However each of these steps are strongly parallelizable in every major programming model (at least I know how it works in MapReduce, BSP and MPI).

Here are the three steps:

compute a distance matrix between the vectors with a given distance measurement