Our approach uses co-occurences of visual words for large scale content-based image search. Retrieval with AND queries can be carried out very efficiently using skipped inverted files, a key component of web-scale text search engines. For each word, the inverted file provides the images in which the word appears. The images that contain a co-occurence of several words can be found by merging the individual lists. Using skip pointers the merging becomes very efficient: the retrieval time is proportional to the number of results. The efficiency of AND queries is illustrated with the following text search example:

The table above lists the number of documents retrieved by the live.com text search engine for ANDs of terms. Although each of the terms is common, the conjunction of the three words yields fewer than 1000 results, which can be recovered in linear time using skip pointers. By formulating visual queries as ORs of ANDs, we can gain the same efficiency in similar-image search.

Skip pointers allow us to define the co-occurrences at query time instead of at index time. Thus, we propose a method that adaptively chooses the OR-of-AND filter most suited to retrieve results similar to the specific query image. We cast the choice of the filter as an optimization problem: for each query image, select the OR-of-AND filter which maximizes training-set recall for an adjustable bound on the number of results. This optimization can be efficiently implemented using a linear program relaxation.

Below we show the filter selected for a query image containing a flag (first column). In this case the method chooses an OR of two conjunctions (shown in the two row-cells of the first column). Each conjunction is a visual phrase, i.e. a Boolean expression testing whether specific visual words occur more than (or less than) a certain number of times in the image. The individual tests used in each phrase are shown next to the query image, with color-coded visual words. The true positive images retrieved by each phrase are shown in the second column.

We have tested this approach on the Caltech256 data set augmented with a distractor set of almost 800K Flickr images. Comparative results are presented in the figure below which shows recall (i.e. proportion of true positives retrieved) as a function of the average number of retrieved results. Our filtering approach finds 5 times as many true positives than the method of [Chum et al, 2008], which is the only other system capable of operating under tight bounds on the number of retrieved results (the filter set size).