Experiment

I evaluated this paper by indexing local patch descriptors provided from Photo Tourism project[3]. To learn metrics we gather constratins from 10K matching and non-matching patch pairs from the Trevi and Halfdome data. All methods are tested on 10K pairs from the Notre Dame data. I extracted both the raw patch intensities (4096 dim) and SIFT descriptors (128 dim) from each patch. To reduce experiment time, I reduce the dimension of raw patch intensities to 1024 dim. Then I retrieved patches with L2 baseline and learned metrics for each representation.

To determine the parameter gamma, I performed validation with 1K pairs from the training set sevral times with parameters 10^[-10:5]. I found that the best number for parameter gamma of ITML is 0.000001 for RAW data and 0.001 for SIFT data.

Discussion

The absolute values in my results are different from that of the paper. Overall performance of SIFT is higher and that of RAW is lower in my implementation. First one is because I didn't jitter the images. Second one is mainly because we reduced the dimensionality of raw data by averaging neighbor pixels. But the trend of results of the paper and mine is similar. Our learned metrics with ITML performs better in terms of search accuracy than L2 distance metric.
It seems not scalable because it takes very long time to compute new matrix A with high dimensional data (4096 dim in the RAW case). Training a metric for RAW data with 10K constraints takes about 1 hours and it increases cubically (d ^ 3) to the dimensionality of data.