Computer Vision Application with C# - Part III

Introduction

Moving on from Part II, this article presents a tracking algorithm. The Centroid tracker presented in Part II was not actually
a tracking algorithm but a fancy trick. All the "tracker" was doing was to find the centroid of the backprojected image and put a bounding box around it - this is not tracking
by definition. So this article presents a very well known tracking algorithm in Computer Vision community called Meanshift Tracking. This article is based on a series of research papers
by Comaniciu Et Al. [1],[2],[3].

Moving on, visual tracking can be described as identifying an object in a video frame and then tracking that object in subsequent frames. It does not include the detection
of the object. Tracking assumes that the object is selected either manually or automatically. In this article, we'll use manual object selection. Once the object is selected,
some kind of model(s) is created based on some attribute(s) and then using the model(s), the object is tracked in subsequent frames. Meanshift, as described in the mentioned research
papers, uses colour as the attribute and histogram for modeling this attribute. We have already covered these pre-requisites in Part I.

Meanshift Tracking - High Level View

Given an object, we create a colour histogram - let's call this target histogram. When we get the next frame, we try to find an object with a similar colour histogram
in the vicinity of the object in the previous frame. Once we find this object, we stop looking for it and wait to get another frame. This process is repeated over and over again
which results in tracking. At this point, there are two questions one must ask:

How do we know that we have found our object?

How do we search the vicinity of the object?

The answer to the first question is that we use some kind of histogram matching, such as Bhattacharyya Coefficient (see Part I). So we create the histogram of our possible
object and match it to our target histogram. If it is close enough, we assume it is the same object; if not, we make a move in the vicinity and create a histogram and do a match
again till we find a close enough object.

The way of looking in the vicinity is a technique from which the algorithm gets its name: Meanshift. Meanshift is the mode seeking algorithm. In other words, given a probability
distribution a.k.a. normalised histogram, the Meanshift algorithm seeks the peak of this histogram.

Basic Meanshift Tracking

We first pick the attribute(s) of the object to model the object. We are using colour as the attribute and the normalised histogram as our model. So in the video frame, we select the
target object by placing a bounding box, a.k.a. ROI (Region of Interest), around it. Assuming everything, including
the background, in the bounding box belongs to the target, we create the normalised colour histogram.

Now we grab the next frame and place our bounding box at the same location as the previous frame and create a candidate normalised colour histogram.

Next comes the Meanshift bit. The Meanshift algorithm iterations consists of steps 3 to 5 below:

Create a target model in the ROI in the form of a histogram based on its feature - most commonly it is colour.
Normalise the target histogram.

In the next image frame, create a candidate model histogram at the location "x"
of the target in the previous image. Normalise the candidate histogram.

Each pixel "I" in the ROI around "x" provides a weighted sample. Suppose its colour is in bin u of the histogram, then the weight contributed
by a pixel is given by:

Update the meanshift vector:

Update the current position and iterate:

The effect of the above steps is that an object is tracked in a sequence of frames. The weights are important, since otherwise the samples are simply
at the pixel locations and therefore uniformly distributed. The Meanshift algorithm works because the weighted ratio (marked in red above) is biased towards
a mode because colours are more likely to match in the correct direction. Hence the weights will on average be greater closer to the true location.
Don't be put off by the Mathematics - here is what is going on:

The table below shows the Meanshift calculations. X and Y are the coordinates, RGB is the pixel colour value, Bin is the histogram index corresponding to the RGB colour value,
is the value of the target histogram bin, and
is the value of the candidate histogram bin:

The new centroid is found as shown below. We keep on moving the centroid till we find the likely target, i.e., using the Bhattacharyya coefficient to match the histogram.

An improvement to this basic Meanshift tracking proposed by Comaniciu Et Al. is that once we find the new centroid, we create a histogram of the image under the ROI
and find the Bhattacharyya coefficient (let's call it BC1). We move the centroid in small increments and create a histogram. Then as long as the Bhattacharyya coefficient
at the moved centroid is less than BC1, we keep on inching the centroid towards the target. Furthermore, we also keep track of how much the centroid has moved from the previous
position and the number of Meanshift iterations. The centroid movement is found by using the Euclidean distance between the previous and current centroids. If the centroid
movement and the number of Meanshift iterations are less than some threshold, we keep on running the Meanshift algorithm. Using the threshold is important for practical
purposes because if the tracker gets lost, we don't want to be stuck in an infinite loop!

Kernel-based Meanshift Tracking

In kernel-based Meanshift, a kernel profile is overlaid on the ROI and the model is created based on the kernel. The tracker proposed by Comaniciu Et Al [3]
uses the Epanichnikov kernel shown below. Because of the shape of this kernel, the pixels towards the centre of the ROI would have a greater value than the pixels towards
the edges, which gives the tracker the ability to reject interfering backgrounds.

There is one key difference between a kernel-based Meanshift and basic Meanshift. When the target or candidate model is created in a kernel-based Meanshift tracker,
a suitable kernel is used in the ROI and the value of the kernel corresponding to the colour pixel value is used to create the histogram instead of the actual colour value.
The process of creating the model, i.e., histogram, in the kernel-based Meanshift tracker is the same as usual, expect that instead of incrementing the corresponding bin by 1, the value
of the kernel at that location is added to the bin. For example, if a pixel colour value falls in bin 3 and the kernel value at the pixel is 0.843, then we would
add 0.843 to bin 3 instead of incrementing the bin value. Using the Epanichnikov kernel, the bin value "k" will look something like this (the equations
in 2b and 2c come from the Epanichnikov profile):

Using the Epanichnikov kernel for an ROI of 5X5 pixels might look like shown below. We can see that the pixels near the centre have a larger value than the pixels at the edges:

Background Weighted Meanshift Tracking

The background is always of concern in visual tracking. When the target and background colours are the same, there is a high probability that the tracker will get lost.
Comaniciu Et Al. [3] have proposed a variation of the Meanshift algorithm in which the histogram of the background area around the ROI is also calculated and fused with the histogram
of the ROI to create a more reliable model. A histogram is created for an area around the ROI window. An appropriate area can be selected experimentally. For each bin, a weight
is then derived by taking the minimum between 1 and the minimum non-zero number in that bin. If is the normalised
histogram and is the smallest non-zero entry, then the weights are derived according to:

The weights are to reduce the importance of those features that are prominent in the background, i.e., with low - this is given below:

Get the smallest non-zero value from the background histogram

For each bin in the target histogram:

Find

Apply weights to the target model:

Using the Code

Step 4: Select an area in the image using mouse - Right click the mouse and drag. This will create a colour histogram of the selected area.
The target histogram is shown in red and will not change during the tracking. The red ROI is for the target and the yellow is for the background.

Step 5: Click the "Start Tracking" button to see the tracking in action.

For Meanshift, the candidate histogram is shown in blue and will change during tracking as it is the histogram of the area that the tracker "thinks" is the target.
The centroid and Bhattacharyya Coefficient are displayed on the screen. For the centroid, only the centroid information is displayed.

Moving Forward

I have not implemented scale adaptation described in the paper. Scale adaptation means the ROI should shrink and expand when the object moves away from and near
to the camera, respectively. The paper suggests evaluating Bhattacharyya coefficient at a scale smaller and larger than the current ROI and then set the ROI size
to the one that gives you the maximum coefficient. Also, I would encourage the readers to implement the CAMShift Tracker [4]. The series of articles so far has
all the information required to implement the CAMShift tracker plus the paper is easy to follow - Good luck!

Acknowledgements

For any third party code or material used, please refer to "Points of Interest" in Part I.

Comments and Discussions

Hello Arif,
You've done a good job. But i want to make a subtle suggestion.
1. Clicking the "Start Tracking" button without loading the sequence did nothing
2. After loading the sequence, clicking the "Start Tracking" button made the program fail because of a null reference exception (Target Window has not been created).

Hence i've modified the Click Event Handler of the "Start Tracking" button (btnTrack_Click) to notify the user if he has not loaded the sequence or selected a region of interest.

Hi, could you provide an explanation of how CAMSHIFT can be used to detect text in images after the texture properties of an image is detected using svm classifiers? can u plz provide an example in using CAMSHIFT for text detection plz?

CAMSHIFT is a tracking algorithm and not a detection algorithm. Something like SVM or Neural Network will do the detection for you. Given that you have already done the detection using SVM, I am not sure what you need CAMSHIFT for.

Any how, you can get some information and the link to the original CAMSHIFT paper here: http://www.gergltd.com/cse486/project5/

hi,this paper is great,thank you very much.
there is one questions,the input is a sequence of bmp picture,but in most case the input should be a video.that is if i want to track a target in video,how should i do,i am not familiar with this field,expect your help....