Tuesday, June 17, 2014

Suppose I give you this dataset of 2D points (with the added twist that the generating code has been obfuscated, to avoid making the answer too obvious), and ask you to group them into \(k\) meaningful clusters. Is there a way to do it without knowing (or guessing) \(k\) in advance, as \(k\)-means would require? Yes, with the DBSCAN algorithm, but at the expense of introducing two additional parameters! In this post, I try to show why it's not as bad as it may sound by first describing the intuition behind DBSCAN, fiddling next with the problem of parameter estimation and finally showing an equivalent way of looking at it, in terms of graph operations.

In [1]:

importbase64random.seed(0)# this obfuscated code just defines a 'points' dataset, as a numpy arrayobfuscated_code=('cG9pbnRzMCA9IHJhbmRvbS5tdWx0aXZhcmlhdGVfbm9ybWFs''KFswLCAwXSwgZXllKDIpLCAzMDApCnBvaW50czEgPSByYW5k''b20ubXVsdGl2YXJpYXRlX25vcm1hbChbNSwgNV0sIGV5ZSgy''KSArIDIsIDMwMCkKcG9pbnRzMiA9IHJhbmRvbS5tdWx0aXZh''cmlhdGVfbm9ybWFsKFs4LCAyXSwgZXllKDIpLCAzMDApCnBv''aW50cyA9IHZzdGFjaygocG9pbnRzMCwgcG9pbnRzMSwgcG9p''bnRzMikp')eval(compile(base64.b64decode(obfuscated_code),'<string>','exec'))# 'points' is defined at this pointrandom.shuffle(points)scatter(*zip(*points),color=[.5]*3,s=5);

Like \(k\)-means, DBSCAN is quite a simple idea. But to be honest, I find the formal introduction (from the Wikipedia article) slightly confusing, whereas the basic idea is really simple: a point is included in a cluster if either:

It's surrounded by at least a certain number of neighbors inside a certain radius.

It's inside a certain distance of a point as defined in 1.

Let's call the first type core points, and the second reachable points. The remaining points (if any) are noise, and we don't care about them (contrary to \(k\)-means, DBSCAN can easily result in uncategorized points, which can be meaningful for certain applications).

So first things first, if we are to implement an algorithm based on the notions of distance and neighborhoods, we'd better have at our disposal an efficient way of retrieving the nearest neighbors of a point in terms of a certain distance metric. The naive way of doing it (computing every pairwise possibilities) would be \(O(n^2)\), but it turns out that for relatively low-dimensional (i.e. < 20-ish) Euclidean spaces, a k-d tree is a very reasonable and efficient way of doing it. So here's how to use a scipy.cKDTree to retrieve all the neighbors inside a radius of 1 of the first dataset point:

We now have everything we need to define and implement the DBSCAN algorithm. Although Python is itself stylistiscally very close to pseudocode, the essence of the algorithm can be summarized in words as: for every unvisited point with enough neighbors, start a cluster by adding them all in, and then, for each, recursively expand the cluster if they also have enough neighbors, and stop expanding otherwise. Although I used the word "recursively", you'll notice that our implementation is actually not:

Now that we have an algorithm, you may ask a very legitimate question: how do we choose eps (the minimal distance/radius) and min_pts (the minimal number of points inside that radius to start/expand a cluster)? If we pick an arbitrary value for min_pts (say 10), we could try looking at the distribution of distances separating every point from their 10 nearest neighbors:

All right, much better (also note the graphical distinction between core and reachable points, the latter being a bit paler)! And it wasn't that hard to come up with reasonable parameter values.. but are we really convinced yet that DBSCAN is such an improvement over \(k\)-means? Picking a value for min_pts was still pretty much arbitrary.. To continue our exploration, let's create a dataset which would be very hard (in fact impossible) for \(k\)-means to make sense of:

Very reasonable and interesting results! To conclude, let's examine an alternative way of looking at the DBSCAN algorithm, from the perspective of graph theory. It turns out that DBSCAN is equivalent to finding the connected components of a graph defined as: "for every node, draw an edge to every neighbors inside a certain radius (eps), if they are in sufficient number (min_pts)". This is very easy to implement using the networkx library, and yields the expected results:

Although interesting, this equivalence of course does not yield a more efficient algorithm, because it still relies on the neighborhood-finding helper (in our case a \(k\)-d tree), which dominates the overall complexity by being \(O(n \log n)\).