There are two types of data scientists – those who cluster and those who don’t

Today, we are lucky to have Jesse Marshall as a guest blogger on data analysis. Jesse is a PhD student at Stanford, working on analyzing the collective behavior of hundreds of neurons as they process information in the living brain.

Humans have trouble accepting randomness. We search for patterns, trends and correlations in all aspects of life, and we do it for good reason. Identifying patterns lets us learn from our mistakes and predict the future. Clustering algorithms are exploratory data analysis tools used to categorize a set of observations into a few discrete classes, or, clusters, that share common features. Cluster algorithms only need four ingredients: a set of objects,a list of quantitative or qualitative descriptors of these objects, and a metric that you can use to compare these objects, based on their features. Then you try to divide the data into a specified number of clusters and see what happens.

Say for example that you ran a website, and wanted to target ads to different users. For each user (the objects) you have a large database of web browsing history that consists of which web pages they visited and how long they spent on each site (the features). You can compare any two users based on the number of webpages that they have both visited, which is similar to the ‘Jaccard similarity‘ metric.

Targeting advertising using machine learning can reduce the number of stupid ads, thus making the world a better place. Source : Skeptoid.com

Now comes the tricky part: you have to pick a number of clusters to partition your users into, based on how many types of users you think are present. There’s no right answer for this question, which is a general problem for clustering approaches, but we’ll talk a little a bit at the end about how you can assess the performance of a clustering algorithm.

We’ll start clustering using k-means, which attempts to divide a set of objects into k clusters by iteratively reducing the sum of squares distances of points within clusters. You can read all about the algorithm used to achieve this on Wikipedia. We’ll just start exploring k-means using a classic toy example: a Gaussian mixture model. First we need to generate the data. We’ll start with well separated clusters to show how clustering works when everything is easy. Let’s start with three clusters, with means separated by 5 units and standard deviations of 1:

Let’s see how well our clustering worked by taking the centroids of the clusters, matching them up to our start points, and comparing the true positive rate:

%first identify which of the original clusters corresponds to those located by the algorithm
cluster_lookup = zeros(1,3);
for ll = 1:3
[~,cluster_lookup(ll)] = min(sum(bsxfun(@minus,cluster_means,cluster_centroids(ll,:)).^2,2));
end
true_clusters = cat(1,ones(100,1),2.*ones(100,1),3.*ones(100,1));
% now assess the classification performance
for jj = 1:3
clusters_correct = numel(intersect(find(true_clusters==cluster_lookup(jj)),find(cluster_ids==jj)));
fprintf('number of clusters correct in cluster %d: \t %d \n',jj,clusters_correct);
end

When I ran it, there was a better than 99 percent true positive rate!

Now suppose the clusters were much closer together, spaced by 1 unit, with a standard deviation of 1:

Now as you might expect, the true positive rate is much worse, and hovers closer to 50%.

Of course in real life, you don’t have the privilege of knowing how many clusters are present in the original data — you need some sort of internal evaluation procedure. Intuitively, we think of a dataset as ‘clustered’ when the amount of scatter within a cluster is much less than the distance between clusters . Makes sense, right? Now like in everything with clustering, there are a few different ways to realize this intuition into a formal mathematical procedure. For example the amount of scatter in a cluster can be quantified with a few different metrics, for example as the overall diameter of the cluster (the maximum pairwise distance) or the mean euclidean distance between points in the cluster . This is all formalized in the Davies-BouldinIndex, :

This has a convenient implementation in Matlab’s evalcluster function, but we’ll do a quick implementation below, as kmeans automatically returns most of the quantities of interest.

Here is what the index looks like for our well separated and poorly separated examples:

As you would expect, for the well separated example you see a global minima in the index at N=3, while for the poorly separated clusters, no global minima is found.

Now let’s zoom out a little bit. I’ve given you a sense of the mechanics of clustering: how to cluster data and evaluate the ‘goodness of fit’ of your clustering algorithm. Now let’s think of where you would want to use clustering algorithms. In science, arguably the biggest application domain of clustering algorithms is in genetics, and especially in studies of gene expression. In these experiments, scientists are able to measure the simultaneous expression of thousands of genes under several different conditions, and ask questions about which genes are co-regulated when and where (e.g. Here and Here). Its an ideal scenario for clustering: you have a ton of data with lots of features, and you don’t really know which features will be similar to one another. Cluster analysis can give you a starting point for identifying groups of genes that are up- and down-regulated by different perturbations and pave the way for the next step of analysis or experiments.

In neuroscience, the types of problems you can address are similar — clustering gives you a bird’s-eye view on what types of responses are present in the data. Say you have lots of time series of neural activity from different voxels in fMRI data, different cells in calcium imaging data, or different electrodes in EEG data. Then you can start to classify your data by computing the cross-correlation between the different time series and seeing if the responses are clustered.

For example, below I simulated recordings of noisy single unit data from 200 cells during a series of 100 single trials, and averaged the neuronal responses. Take a look at the raster plot on the left. We see enhancement of neuronal firing rates between 0 and 3 seconds after the initiation of the trial. Just staring at the mean raster plot we aren’t sure how many types of neuronal responses there are in the population. To get a gross idea of how many classes of responses there are, you can cluster the mean temporal waveforms of the cells using the Euclidean distance. On the right, you can see the data after cells are sorted by their cluster membership. You can see that there are three functional classes of neurons in this dataset. After this type of analysis, you can start to understand how members of these functional classes are recruited on single trials, and whether different subsets of these cells share higher covariance (as in here).

Want to know more? There are other clustering algorithms out there: I’d start with hierarchical clustering, which can allow you to build trees relating elements in a cluster from the bottom up or top down. As always, Matlab makes it easy!

Code of "There are two types of data scientists – those who cluster and those who don’t" 8.13 KB