It just happens that I have two different projects that have the need of cluster analysis, applied in two different ways: one has uses on maps, where a large number of items needs to be displayed quickly, while another implies finding clusters of news items, where the distance between them is determined by their content. The most used clustering algorithm and the first to be found by searching the web is the k-means clustering. Its purpose is to categorize a list of items into a number of k clusters, hence the name. Setting aside the use value of the algorithm for my purposes, the biggest problem I see is the complexity: in its general form it is at least O(n2), and most of the time a lot higher. The net abounds with scientific papers investigating the k-means complexity and suggesting improvements, but they are highly mathematical and I didn't have the time to investigate further. So I just built my own algorithm. It is clearly fuzzy, imperfect, may even be wrong in some situations, but at least it is fast. I will certainly investigate this area more, maybe even try to understand the math behind it and analyse my results based on this research. When I do that I will update this post or write others. But until then, let me present my work so far.

The first problem I had was, as I said, complexity. For one million points on the map, any algorithm that takes into account the distance between any two items will have to make at least one trillion comparisons. So my solution was to limit the number of items by grouping them in a grid:Step 1: find the min and max on each dimension (that means going through the entire item collection once or knowing beforetime the map bounds)Step 2: determine a number of cells that would be a bit more than what I need in the end. (that's a decision I have to take, no algorithmic complexity)Example: for my map example I have only two dimensions: X and Y. I want to display an upper bound of 1000 clusters. Therefore I find the minimum and maximum X and Y and then split each dimension into 100 slots. That means I would cluster the items I have into 10000 cells.Step 3: for each item, find its cell based on X,Y and add the item to the cell. This is done by simple division: (X-minX)/(maxX-minX). (again that means going once through the collection)Step 4: find the smallest cell (the complexity is reduced now to working with cells)Step 5: find its smallest neighbour (the complexity of this on the implementation)Step 6: merge the two cellsUntil the number of cells is larger than the desired number of clusters, repeat from Step 4.In the end, the algorithm is O(n+p*log(p)), I guess, where p is the number of cells chosen at step 2.

Optimizations are the next issue.

How does one find the neighbours of a cell? On Step 3 we also create a list of neighbors for each new cluster by looking for a cluster that is at coordinates immediately above, below, left or right. When we merge two clusters, we get a cluster that is a neighbour to all the neighbours of the merged clusters.

How does one quickly get the cluster at a certain position? We create a dictionary that has the coordinates as the key. What about when we merge two clusters? Then the new cluster will be accessible by any of the original cluster keys (that implied that each cluster has a list of keys, as well)

How does one find the smallest cell in the cluster list? After Step 3 we sort the cluster list by the number of items they contain and each time we perform a merge we find the first item larger than the merged result and we insert it in the list at that location, so that the list always remains sorted.

How do we easily find the first item larger than an item? By employing a divide-et-impera method of splitting the list in two at each step and choosing to look into one bucket based on the item count of the cluster at the middle position

Before you use the code note that there are specific scenarios where this type of clustering would look a bit off, like items in a long line or an empty polygon (the cluster will appear to be in its center). But I needed speed and I got it.

Enjoy!

Update: The performance of removing or adding items from a List is very poor, so I created a LinkedList version that seems to be even faster. Here it is. The old List version is at the end

// the result holds the count of both clusters//Items = cluster1.Items.Union(cluster2.Items).ToList(), Count = cluster1.Count + cluster2.Count,// the neighbors are in the union of their neighbours that does not contain themselves Neighbors = cluster1.Neighbors.Union(cluster2.Neighbors) .Distinct() .Except(keys) .ToList(),// compute the sums for the final position SumX = cluster1.SumX + cluster2.SumX, SumY = cluster1.SumY + cluster2.SumY };foreach (var key in keys) { clusterDict[key] = cluster; }// efficiently remove first cluster since we know its index clusterArr.RemoveAt(cluster1Index);// remove this cluster from the list (perhaps some sort of caching can speed this up, too) clusterArr.Remove(cluster2);

// find the index of the cluster before which we want to insert our merged result// the first comment is the naive implementation// the current implementation is a divide-et-impera quick find in a sorted list