Just another WordPress.com weblog

Introduction To Mean Shift Algorithm

Its been quite some time since I wrote a Data Mining post . Today, I intend to post on Mean Shift – a really cool but not very well known algorithm. The basic idea is quite simple but the results are amazing. It was invented long back in 1975 but was not widely used till two papers applied the algorithm to Computer Vision.

I learned this algorithm in my Advanced Data Mining course and I wrote the lecture notes on it. So here I am trying to convert my lecture notes to a post. I have tried to simplify it – but this post is quite involved than the other posts.

It is quite sad that there exists no good post on such a good algorithm. While writing my lecture notes, I struggled a lot for good resources 🙂 . The 3 “classic" papers on Mean Shift are quite hard to understand. Most of the other resources are usually from Computer Vision courses where Mean Shift is taught lightly as yet another technique for vision tasks (like segmentation) and contains only the main intuition and the formulas.

As a disclaimer, there might be errors in my exposition – so if you find anything wrong please let me know and I will fix it. You can always check out the reference for more details. I have not included any graphics in it but you can check the ppt given in the references for an animation of Mean Shift.

Introduction

Mean Shift is a powerful and versatile non parametric iterative algorithm that can be used for lot of purposes like finding modes, clustering etc. Mean Shift was introduced in Fukunaga and Hostetler [1] and has been extended to be applicable in other fields like Computer Vision.This document will provide a discussion of Mean Shift , prove its convergence and slightly discuss its important applications.

Intuitive Idea of Mean Shift

This section provides an intuitive idea of Mean shift and the later sections will expand the idea. Mean shift considers feature space as a empirical probability density function. If the input is a set of points then Mean shift considers them as sampled from the underlying probability density function. If dense regions (or clusters) are present in the feature space , then they correspond to the mode (or local maxima) of the probability density function. We can also identify clusters associated with the given mode using Mean Shift.

For each data point, Mean shift associates it with the nearby peak of the dataset’s probability density function. For each data point, Mean shift defines a window around it and computes the mean of the data point . Then it shifts the center of the window to the mean and repeats the algorithm till it converges. After each iteration, we can consider that the window shifts to a more denser region of the dataset.

At the high level, we can specify Mean Shift as follows : 1. Fix a window around each data point. 2. Compute the mean of data within the window. 3. Shift the window to the mean and repeat till convergence.

Preliminaries

Kernels :

A kernel is a function that satisfies the following requirements :

1.

2.

Some examples of kernels include :

1. Rectangular

2. Gaussian

3. Epanechnikov

Kernel Density Estimation

Kernel density estimation is a non parametric way to estimate the density function of a random variable. This is usually called as the Parzen window technique. Given a kernel K, bandwidth parameter h , Kernel density estimator for a given set of d-dimensional points is

Gradient Ascent Nature of Mean Shift

Mean shift can be considered to based on Gradient ascent on the density contour. The generic formula for gradient ascent is ,

Applying it to kernel density estimator,

Setting it to 0 we get,

Finally , we get

Mean Shift

As explained above, Mean shift treats the points the feature space as an probability density function . Dense regions in feature space corresponds to local maxima or modes. So for each data point, we perform gradient ascent on the local estimated density until convergence. The stationary points obtained via gradient ascent represent the modes of the density function. All points associated with the same stationary point belong to the same cluster.

Assuming , we have

The quantity is called as the mean shift. So mean shift procedure can be summarized as : For each point

1. Compute mean shift vector

2. Move the density estimation window by

3. Repeat till convergence

Using a Gaussian kernel as an example,

1. 2.

Proof Of Convergence

Using the kernel profile,

To prove the convergence , we have to prove that

But since the kernel is a convex function we have ,

Using it ,

Thus we have proven that the sequence is convergent. The second part of the proof in [2] which tries to prove the sequence is convergent is wrong.

Improvements to Classic Mean Shift Algorithm

The classic mean shift algorithm is time intensive. The time complexity of it is given by where is the number of iterations and is the number of data points in the data set. Many improvements have been made to the mean shift algorithm to make it converge faster.

One of them is the adaptive Mean Shift where you let the bandwidth parameter vary for each data point. Here, the parameter is calculated using kNN algorithm. If is the k-nearest neighbor of then the bandwidth is calculated as

Here we use or norm to find the bandwidth.

An alternate way to speed up convergence is to alter the data points during the course of Mean Shift. Again using a Gaussian kernel as an example,

1. 2. 3.

Other Issues

1. Even though mean shift is a non parametric algorithm , it does require the bandwidth parameter h to be tuned. We can use kNN to find out the bandwidth. The choice of bandwidth in influences convergence rate and the number of clusters. 2. Choice of bandwidth parameter h is critical. A large h might result in incorrect clustering and might merge distinct clusters. A very small h might result in too many clusters.

3. When using kNN to determining h, the choice of k influences the value of h. For good results, k has to increase when the dimension of the data increases. 4. Mean shift might not work well in higher dimensions. In higher dimensions , the number of local maxima is pretty high and it might converge to a local optima soon. 5. Epanechnikov kernel has a clear cutoff and is optimal in bias-variance tradeoff.

Applications of Mean Shift

Mean shift is a versatile algorithm that has found a lot of practical applications – especially in the computer vision field. In the computer vision, the dimensions are usually low (e.g. the color profile of the image). Hence mean shift is used to perform lot of common tasks in vision.

Clustering

The most important application is using Mean Shift for clustering. The fact that Mean Shift does not make assumptions about the number of clusters or the shape of the cluster makes it ideal for handling clusters of arbitrary shape and number.

Although, Mean Shift is primarily a mode finding algorithm , we can find clusters using it. The stationary points obtained via gradient ascent represent the modes of the density function. All points associated with the same stationary point belong to the same cluster.

An alternate way is to use the concept of Basin of Attraction. Informally, the set of points that converge to the same mode forms the basin of attraction for that mode. All the points in the same basin of attraction are associated with the same cluster. The number of clusters is obtained by the number of modes.

Computer Vision Applications

Mean Shift is used in multiple tasks in Computer Vision like segmentation, tracking, discontinuity preserving smoothing etc. For more details see [2],[8].

Comparison with K-Means

K-Means is one of most popular clustering algorithms. It is simple,fast and efficient. We can compare Mean Shift with K-Means on number of parameters.

One of the most important difference is that K-means makes two broad assumptions – the number of clusters is already known and the clusters are shaped spherically (or elliptically). Mean shift , being a non parametric algorithm, does not assume anything about number of clusters. The number of modes give the number of clusters. Also, since it is based on density estimation, it can handle arbitrarily shaped clusters.

K-means is very sensitive to initializations. A wrong initialization can delay convergence or some times even result in wrong clusters. Mean shift is fairly robust to initializations. Typically, mean shift is run for each point or some times points are selected uniformly from the feature space [2] . Similarly, K-means is sensitive to outliers but Mean Shift is not very sensitive.

K-means is fast and has a time complexity where k is the number of clusters, n is the number of points and T is the number of iterations. Classic mean shift is computationally expensive with a time complexity .

Mean shift is sensitive to the selection of bandwidth, . A small can slow down the convergence. A large can speed up convergence but might merge two modes. But still, there are many techniques to determine reasonably well.

Update [30 Apr 2010] : I did not expect this reasonably technical post to become very popular, yet it did ! Some of the people who read it asked for a sample source code. I did write one in Matlab which randomly generates some points according to several gaussian distribution and the clusters using Mean Shift . It implements both the basic algorithm and also the adaptive algorithm. You can download my Mean Shift code here. Comments are as always welcome !

Hi Saravanan !
thanks for this algorithm ! Ive already tested it successfully with my own vectors as input.
Im tring to change it a little bit in order to make image segmentation (with RGB images if possible).
Do you have any clue to modify the function sqdist ?

Looks like wordpress categorized your comment as a spam and hence the delay in the response. I do not have much exposure to computer vision and that was the reason I did not emphasize that aspect of mean shift – You can check this reference for more details on applying it to images – Dorin Comaniciu and Peter Meer, Mean Shift : A Robust approach towards feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence vol 24 No 5 May 2002.

Replacing -K'(x) with g(x) is just a notational convenience and nothing else.

Wrt the gaussian kernel , the basic idea is that derivative of e^x is again e^x . Again some paper use the L1 norm and many other use L2 norm. In the post , I obtained the derivative of the kernel and then use the L2 norm on the result. That is the reason it has power of two on it.

Hi!
There’s an error in the article. The g(x) function is wrongly definied. It shoud be this way: g(x) = -k'(x), where k is kernel profile (not a kernel K). The profile for the Gaussian kernel is exp(-(1/2) * x)) and then the formula for calculating a next point in the mean shift algorithm using Gaussian kernel is correct. More info in the paper [2].

The way to call it is meanshift(numDimensions,useKNNToGetH) . Dimensions is the feature vector’s dimension and useKNNToGetH is a boolean – If true it uses adaptive mean shift (see above) else uses the classical mean shift.

would you like to let me know where can I find paper or documents containing below
“Mean shift might not work well in higher dimensions. In higher dimensions , the number of local maxima is pretty high and it might converge to a local optima soon.” ?

For any optimization problem, the number of local maxima increases as dimensions increases. There are two scenarios : If the bandwidth is small, the update per iteration is small and hence converges to the nearest local maxima. If the bandwidth is very large it will converge to a single point which probabilistically may not be the global maxima. Basic details can be found in Peter Meer’s paper.

About this line of the code clusterCentroids(closestCentroid,:) = 0.5 * (curDataPoint + clusterCentroids(closestCentroid,:)); I have a question. The cluster centroids change in each data is processsed. But, you take average of the recent data point that is inside of that cluster and previous cluster centroid. However, I think we should find the average of all cluster centroids. If we call the cluster centroids in time by Xi, for example (X1+X2+X3)/3 is not equal to 1/2*[(X1+X2)/2+X3] which is cluster centroid computed in your code. Am I right?

To be honest, I do not remember exactly why I alter centroids in each step. If my memory is correct, it helps to speed up the convergence. Also, if one of the point is slightly far away from the “actual” centroid, this helps to bring it closer to the actual centroid faster. And you are right in that what I do is not the averaging that happens in other clustering algorithms like K-means .

I’ve read your post and is a bit confused.
In the section “Gradient Ascent Nature of Mean Shift” you write the generic formular for gradient ascent:
x_{1}=x_{0}+\eta f'(x_{0})
which is fine…
Then you write, that you apply this to the kernel density estimator. Ok, so you differentiate (gradient) the function \hat{f} and get:
\bigtriangledown{\displaystyle \hat{f}(x)=\frac{1}{nh^{d}}\sum_{i=1}^{n}K’\left(\frac{x-x_{i}}{h}\right)}

Then you write “Setting it to 0 we get”, but what are you exactly setting to zero, and how do you get the following formula?
{\displaystyle \sum_{i=1}^{n}K’\left(\frac{x-x_{i}}{h}\right)\overrightarrow{x}=\sum_{i=1}^{n}K’\left(\frac{x-x_{i}}{h}\right)\overrightarrow{x_{i}}}

In the paper “Mean Shift Based Clustering in High Dimensions: A Texture Classification Example”, which is last paper published by Peter Meer’s lab on the subject, the formula (5) used an additional weight 1/h_i^(d+2). I do not understand how you can omit this additional weight, especially for the KNN case, in which h_i are different for each x_i and therefore cannot be canceled out.

Could you also comment on the choice of the kernel?
For example, what is the difference between using e^(-dist/(bandwidth^2)) and e^(-1/2*dist/(bandwidth^2)). Are they exactly the same kernel, or actually somewhat different? I understand that because of the division some constants can be canceled out, but I am not sure if that’s the case with different Gaussian kernels.

I understand errors in floating point calculations may lead to different data points being shifted to slightly different positions even if they are meant to converge to the same mode. However, Matlab does arithmetic with double-precision numbers, I do not understand how the final error can be as large as 4. Why 4? why not 0.001, or 1, or 10?

It has almost been an year since I wrote this post – so my memory is a bit hazy.

But I my memory serves right, it is not possible to prove that the sequence \{y_{j}\}_{j=1,2,…} . I found some resource for the same and I also discussed with my prof.Chris Ding about it. I will check if I find any more details of our discussion.

Unfortunately, I do not much about image processing or computer vision and may not be able to help you. I think may be the matlab central has some image processing code ? Or there might be some opencv mean shift code in net ….

Thank you for this beautiful introduction, Saravanan!
If someone want to perform image segmentation, there is an open source software that does it [I discovered mean-shift trough it, and it work very well!].
It is named “Mirone”, based on matlab:

I would like to use your code. but don’t know how to use it as I don’t have good experience in program especially matlab. I copied your codes into m.file but how is it run? I have only a 3D point set so need to cluster using mean shift. can you explain how to use these codes?do I need to get neibours(KNN) for each point? so how can I give input data? I hope, this is a stupid question. sorry..Plese explain each steps to run this code.

IIRC, the code already uses a variable called numDimensions and can handle arbitrary dimensions (although it cannot plot them). The easiest way to run the code is to load the file in Matlab and press the execute button. I used Octave to test my code. Currently it generates some random points and tries to cluster them. The code has two modes. In one, the bandwidth (H) is hard coded. In other mode (which occurs when useKNNToGetH is true), the nearest k neighbours decide the bandwith. I have hard coded K as 100. If you want you can change it in line 6. Hope this helps !

hi saravanan!
thanx very much for your post, i ve got one question though:
when we apply the gradient ascent to the Kernel density estimator, why do you set the differential of the KDE to zero? is it to find a local maxima/minima?
Thank you 🙂

Hello Saravanan,
Thanks for the great explanation, the gradient ascent thing really made my life, but the derivation step has been bothering me, i am not quite getting the idea, if you could provide more details, and when you saying set to 0, do you mean you’re setting the gradient to 0 so that you find local maxima ?
Thanks
AB

Hey,
First of all thanks for the realy great explenation of the mean shift algorithm.
But I am working with openCv, and I would like to use the mean shift algorithm to cluster some datapoints. Yea I know there exists a meanshift Funktion in openCv, but as fas as i know this funktion can only be used as a histogram based tracker. Does anyone konw how to use the opencv funktion for clutstering, or even better did someone implement this matlab function in c++, or is there a good c++ implementaion of a meanshift clustering around? I am thankfull for every clue?

Hi,
Thanks for this wonderful write up.
My aim is to classify an image into 2 clusters.These clusters may not be physically together.i.e. few pixels may belong to cluster1 then few to cluster 2 ,again to cluster1 and all.
which technique will work best in this case?
Regards
Richa

I’m curious to find out what blog platform you’re using?
I’m experiencing some minor security issues with my latest website and I would like to find something more safeguarded. Do you have any recommendations?

hi dear saravanan,
please rectify the cod, i use it in my project but it has error , i do not understand what is the problem!
please help me , can you rectify your code and send me the correct one?
thank you very much

nice article saravanan. i have implemented my own version which uses a kd-tree and it works fairly well and i am applying it for some clustering problems where a model-based approach seems infeasible due to the massive number of clusters i have to deal with.

However, i have been using a fixed bandwidth so far and on some of my data i find it very hard (literally impossible) to set the bandwidth such that it works on all my data given its diversity. I am now searching for ways to make the bandwidth data-driven or locally adaptive. I like the knn idea (gives large bandwidth at sparse areas) and smaller bandwidth (at dense areas) but then i but setting k doesnt seem easier too.

What does your intuition say on the bandwidth selection issue? It would be great if you could point to some important papers in this regard with ideas that work robustly. I found lots of papers on google but i am not sure which works.

The knn idea would surely make the code run faster but in terms of achieving good clustering consistently of a different datasets (sparse to dense) is it any better than using a fixed bandwidth in terms of sensitivity of parameters?

Shouldn’t there be a normalization constant in front of the kernel(s)? Comaniciu et al used defined a kernel profile, then a kernel based on the kernel profile that satisfies the “integrates to 1” criterion, which requires a constant I think.

Hi Saravanan. u have explained it very well. I am looking for C# code of mean shift. Any one having mean shift code in C# plz let me know, coz i dont want to write it all from scratch. i am doing a research project and mean shift a bit part of it, so if any one having the code plz mail me, I will appreciate it.

Hey Saravan! Thank you for your nice and analytical post! Could you please tell me, what change should be done in your code in order to not insert the number of classes manually? I need to find the proper modes in an automated manner. Thank you in advance