The desire to add extra work to issue reporting for a young project like Cassandra strikes me as slightly misguided in the first place. I have what may be an excessive aversion to overengineering, and I like to see a very clear benefit before adding complexity to anything, even an issue tracker. Still, I was curious to see what David's clustering algorithm made of things. And after pestering him to show me how to run his code I figure I owe it to him to show my results.

In general it did a pretty good job, particularly with the mid-sized groups of files. The large groups are just noise; the small groups, well, it's not exactly a revelation that Filter and FilterTest go together. I'd be tempted to play with it more but with only about two months and 250 commits in the apache repo there's not really all that much data there. (Cassandra's first two years were in an internal Facebook repository.) Working with data that exists as a side effect of natural activity is fascinating.

1 comment:

The noisy large groups are indeed a problem with this technique: It seems to be a limitation of the clustering algorithm I'm using.

I have a different algorithm based on it which seems to produce better results by allowing overlapping clusters, but unfortunately I can't release it yet. (It's proprietary code). We're hoping to release it as open source at some point, but for the moment it's a no go. Maybe it will work better for your code base when we do. :-)