Data clustering is one of unsupervised learning techniques that attempts to find some structure in the data. Given some dataset, we may be interested to see if there are some natural clusters. This is where the classic k-means algorithm kicks in. Given a dataset and desired number of clusters, the algorithm returns the center coordinates of each cluster.

To my knowledge, there are plenty of small and big, simple and complex implementations of the k-means algorithm [WIKI]. Either it is useful only for small instructive datasets, either it is buried well in some framework. All we want is to throw in our data in MATLAB and get the results as fast as possible.

I managed to create a small interface MEX for MATLAB [WWW] which calls the k-means implementation in OpenCV [WWW]. That implementation is well tuned and leverages all your processors. If you have a recent Intel processor and have compiled the OpenCV using the Intel compiler [WWW] + friends (IPP [WWW] and TBB [WWW]), you are owning already a powerful rocket, full of fuel, pointing to outer regions of the known space! Why not to use it?

You can download the source code [HERE] along with pre-compiled MEX files for Mac OSX Lion, Matlab 2012a 64-bit using the Intel compiler. Use the supplied script from terminal to compile your own blob.

If you have installed the OpenCV under /usr/local directory, you can try to compile the MEX file from MATLAB’s command prompt:

OpenCV is a framework for image processing, machine learning and related convenience processing of images and other data. One of the central data structures is the famous matrix, the Mat type. It can store multiple channel data, different color spaces (?), sparse and dense representations and who knows what else. Bref, it’s a big and complicated piece of software.

When using the matrices created, manipulated and returned by OpenCV’s functions, all is fine and simple. The trouble comes when you need to get your data in and out of the Mat object. There are multiple types of data that can be stored in the Mat object. You are all familiar with “CV_8U”, “CV_16U” etc. How do you know what type is stored in the matrix object after some long mangling in OpenCV internals? Is it int? Or is it unsigned int? Or may be even float or double?

Fortunately, you have access to a Mat class function type() which will return you a code. The code is a mere integer. See the table with correspondences:

CV_8U

0

8-bit unsigned integer (uchar)

CV_8S

1

8-bit signed integer (schar)

CV_16U

2

16-bit unsigned integer (ushort)

CV_16S

3

16-bit signed integer (short)

CV_32S

4

32-bit signed integer (int)

CV_32F

5

32-bit floating point number (float)

CV_64F

6

64-bit floating point number (double)

CV_USRTYPE1

7

User-defined type

When creating a Mat object from int, float, double or any other C/C++ type, you can use a convenience interface DataType. For example, you are manipulating a double type array data in your C/C++ code. Suppose that now you want to put that in a Mat object. Instead of guessing which CV_*** choose, do the following and the compiler will determine what to take:

In Computer Vision applications the local keypoint descriptors are well known. They encode in a vector some visual information contained around a selected keypoint in the image. We outline “some” since there are many assumptions and information types that can be extracted and described compactly from an image patch (e.g. color histogram, edge directions, gradient histograms etc etc).

Local descriptors are there for at least a decade and many of them were proposed over the time. The best known is SIFT (Scale-Invariant Feature Transform) [PDF] [PDF] that attempts to capture dominating gradients in a structure of cells forming a keypoint. Another more recent is SURF (Speeded Up Robust Features) [PDF] that improves in keypoint detection and descriptor stages by building distribution based descriptor while being much faster to extract and compact.

There are many more keypoint detectors and descriptors. In this article we focus on a novel descriptor named FREAK (Fast Retina Keypoint) [PDF]. It is a binary descriptor encoding simple image intensity comparisons on a specific pattern at large scale and small scale. It features human-like manner of capturing visual information – coarse in peripheral regions of retina and fine in central or fovea region. Its main practical interest stems from extremely fast extraction and matching which is long awaited in embedded devices or in large-scale applications. We refer reader to the original paper for more exhaustive discussion and comparison and technical details.

As the FREAK descriptor is very recent and as of this writing was recently integrated into OpenCV, I decided to write a MEX interface for MATLAB. For descriptor extraction, there are two MEX files – detected and dense feature version. The first uses SURF detector to get the keypoints, while the second samples densely the keypoints from all the image.

Additionally, I would like to put forward three additional functions to work with those binary descriptor vectors:

Bi-directional FREAK descriptor matching for two sets using the Hamming distance

The code can be downloaded [HERE]. The code is distributed under BSD license [WWW].

For compilation, you will need OpenCV 2.4.2 [LINK] at least! To get the efficient Hamming distance computation compiled and working, you will need a recent compiler and Intel processor supporting SSE3 instructions!

Feel free to write me if you find any bugs that I did not notice yet. Porting the whole thing to a CMake project could be a nice thing among other things to do.

EPILOG: I would like to ask one question to all: Has anybody worked with large amounts of binary data and performed clustering with Hamming distance (or any other metric suited for binary data)? I could not find any implementation apart the Matlab’s from Statistical toolbox which seems not to be suited to large amount of data.

With new things, the good old working things may break. This was the case for Matlab [WWW] MEX [WWW] compilation after migration from Snow Leopard [WWW] to Lion [WWW]. For those who don’t know, Matlab is an advanced scientific computation environment with many tools available in different domains (image & signal processing and well beyond). Some tasks are quite slow in Matlab, that’s why it is possible to compile a binary using GNU GCC [WWW] which then runs at full speed and returns the result to the environment.

Some time ago I bought Intel Composer XE compiler [WWW] with TBB [WWW] and IPP [WWW] libraries. It is logical that I wanted to use the compiler in Matlab. Unfortunately, even the most recent version of Matlab (2012a as of this writing) cannot detect the compiler under Mac OS X Lion. Matlab’s support confirmed that it is not supported and Google gave no solution to my best search efforts.

Having abandoned quickly the idea of tweaking the mexopts.sh file responsible of setting compiler and its options on each system, I chose a different approach. After all, Matlab will just call the C/C++ compiler with some bunch of options. You could just compile the MEX file [WWW] from console as any other binary, right? What are those options?

Intel’s C++ compiler is typically named “icpc” and can be run from console. Basically all you need is to specify Matlab’s include and library directories (remember the “mex.h” header in each MEX file) + some special options.

I have Matlab 2012a version installed in “/Applications/MATLAB_R2012a.app“.

I have a simple C++ code file doing some very smart computations “match.cpp“.

After digging in the mexopts.sh file to understand what options are passed, I came up with the following command:

The command looks long, weird and unfriendly. But just take a look and you will see some important information there.

Notice -I and -L switches that points to the Matlab’s header and library directions. Notice the three special Matlab libraries. Notice a bunch of special switches that enable the output binary to be a real MEX file. Without them the compiler will complain about the missing main() function and some other weird stuff. Finally notice that I had to specify the -o switch which will basically rename the a.out file to a proper file with a proper extension.

Now let’s do two simple checks to see if the binary we got is not a rubbish but a valid Matlab MEX file. We will use the “file” and “otool” utilities. The latter is installed if you have Xcode and command-line developer tools installed.

2) If we check the used libraries (we see 3 Matlab specific libraries and one Intel’s runtime library):

SURFmexv2 xeon$ otool -L match.mexmaci64 match.mexmaci64: @rpath/libmx.dylib (compatibility version 0.0.0, current version 0.0.0) @rpath/libmex.dylib (compatibility version 0.0.0, current version 0.0.0) @rpath/libmat.dylib (compatibility version 0.0.0, current version 0.0.0) mac64/libcilkrts.5.dylib (compatibility version 0.0.0, current version 0.0.0) /usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 52.0.0) /usr/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1094.0.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 159.1.0)

All is ok here too. Matlab’s libraries are listed as well as Intel’s runtime libraries + Standard C++ library. You may have more dependent libraries depending on complexity of your code.

I did not try it with more C++ source files. If there are multiple source code files, all you need is to first compile the object files separately and the add the ***.o files in the final compilation and linking stage. This should be tested.

As I am not strong with Make files, a nice make project can be built to compile the MEX files without using the Matlab environment.

It seems trivial to verify presence of an element (x,y) in a rectangle window. Now imagine that you have millions of elements and many thousands or more of windows. Your problem quickly gets unwieldy as it scales as badly as O(N^2)!

I propose a simple solution based on buckets. First the whole space is split into small, say 5×5 regions or cells. Given a window and a set of elements, it is very easy to find the top-left and top right cell that belongs to the window. The cells that are inside are easy to find too. Next, given each element and knowing the width of each cell, it is very easy to calculate which cell in X or Y direction the element belongs to.

My implementation is in MATLAB MEX. There is only one for loop that puts every element into its cell. Then a for loop over windows efficiently finds all the elements falling into it by considering now cells, not the individual elements. The size of a cell can be tuned by the user.

The code as well as 64bit MATLAB binary for Mac can be downloaded from [HERE]. The code is licensed under BSD license.

In Computer Vision the OpenCV [WWW] framework is a well known collection of high-quality and real-time functions performing image processing, feature extraction, matrix operations, object detection and many many more ready to use routines. The framework is in active development and new functionality is added on constant basis.

The computations in Computer Vision are classically very demanding and lot of effort has been put to optimize the algorithms. Intel has two library collections that push this barrier even harder: Integrated Performance Primitives (IPP) [WWW] and Threading Building Blocks (TBB) [WWW].

The IPP provides over 10K highly optimized functions for multimedia, data processing and communications applications. For example, you can create thumbnail images in real-time or accelerate data compression tasks with this library.

The TBB provides high level functions to leverage the power of multi-core systems that are found in every even low cost machines. The bright side is that the user does not need to be an expert avoiding many common pitfalls in parallel computations. It may be seen as a replacement for the OpenMP framework [WWW].

The issue is that I could not compile the OpenCV 2.3.1 and the recent OpenCV 2.4.0 on my Mac OS X (Snow Leopard). I have a relatively recent Intel Composer XE 2011 C++ [WWW] compiler with TBB (4.0 interface 6003), IPP (7.0.205) and MKL libraries that I would like to leverage.

The standard command “cmake ..” from the build directory failed to detect the IPP and TBB libraries with errors:

The second error was easy to fix. First check which version of python the system uses (issue “python –version“). If it is 2.6, then using Mac Ports you can install the missing module by issuing the command “sudo port install py26-sphinx“.

The first error is bit more complex to fix. The problem lies in the file “cmake/OpenCVFindIPP.cmake” which is used by CMake to find and point the variables to the IPP library. These are include and library directories basically. You need to modify two lines in function set_ipp_variables() (I assume that you are working on 64bit system):

(1) Do the first change (old and new version)

if(NOT EXISTS ${IPP_ROOT_DIR}/lib/intel64)

if(NOT EXISTS ${IPP_ROOT_DIR}/lib)

(2) Do the second change (old and new version)

set(IPP_LIBRARY_DIRS ${IPP_ROOT_DIR}/lib/intel64 PARENT_SCOPE)

set(IPP_LIBRARY_DIRS ${IPP_ROOT_DIR}/lib PARENT_SCOPE)

And you are done.

On my system I enabled the TBB and IPP with the following command from the build directory:

Bag of Visual Words (also known as Bag-of-Words) [LINKS] is a well known technique describing visual content in Pattern Recognition and Computer Vision. Idea is to represent an image or an object as a histogram of visual word occurrences. Here visual words are quantized local descriptors such as SIFT [WWW] or SURF [PDF]. Quantization of extracted descriptors is usually done using k-means [WWW] algorithm.

I have encountered a problem to efficiently compute such histograms not over whole image but over multiple sub-regions of an image. This requires fast feature selection enclosed by each region and histogram computation. Knowing what tools are available and after some search in net, I decided to implement my own Matlab MEX [WWW] version.

PDF stands for Portable Document Format [WIKI]. Not in all cases this format is that portable. Sometimes this is due to some special fonts that are not so widespread and are not included or embedded in the PDF file. To make the document portable in that sense, we may want to embed all the fonts.

I am assuming that you are working on a UNIX-like system. You will need Ghostscript [WWW] and Poppler [WWW].

When doing advanced stuff such as image processing, it is very wasteful to re-invent a wheel. This time this is about a very common operation as a sliding window operation. Imagine that we have an image I and a small region MxN (called ROI – Region of Interest) that we want to slide over the I image. What for? We may do many things:

image enhancement;

collecting statistics as average, standard deviation etc

etc

In my case I wanted to collect a simple statistics as sum of pixel values under this ROI. Instead of writing my custom function doing loop inside loop, I devoted some time to search over net. I felt that all the necessary tools are already there in my machine and waiting that I start using them. That was the case – a single function “nlfilter” [REF] from Image Processing toolbox in MATLAB will do all the magic.

Few steps are necessary:

Define a function that will return a scalar for every ROI:
fun = @(x) sum( x(:) ); <— notice that output of this function should be a scalar!

Notice that input of the algorithm is an image (matrix of values) and output is an image (matrix of values). This is up to you to make sense of the value in input/output and the workings of the function!

Finally, depending on your needs, this function can be suboptimal. For example, if you want to do convolution or other filtering in signal processing, use conv2 instead!

As you noticed, the Image Processing toolbox [WWW] is necessary to get this operation done. If somebody has pointers to a freely available MEX implementation, please let me know.

Term-Frequency word weighting scheme [WWW] is one of most used in normalization of document-term matrices in text mining and information retrieval. This down-weights the very frequent or called stop-words. The idea is that stop-words are very often non-informative e.g. “and”, “the”, “la”, “le” etc. A classifier trained on such non-weighted data may miss some discriminative information brought by other less frequent words.

You can download a simple MATLAB function [HERE] which will return a novel TF-IDF document-term matrix derived from standard word count document-term matrix. It can be not that efficient if LOTs of documents need to processed. It would be pretty easy to create a MEX file and get full C speed.

Input to the function is a MxN matrix where each column is a vector representing a document with M words. A vector contains the information in the form of word counts (not frequency!).

If you are interested in this topic, refer to a very nice and completely FREE book on Information Retrieval [HERE].

Credits go to [WWW] from where I snatched the code, added comments and fixed a nasty NaN bug.

UPDATE: The new version of the functions is now completely vectorized that enables fast processing even on large matrices.!