Maximum Variance Hashing via Column Generation

1College of Computer, National University of Defense Technology, Changsha, Hunan 410073, China2School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China3School of Computer Science, The University of Adelaide, Adelaide, SA 5005, Australia

Abstract

With the explosive growth of the data volume in modern applications such as web search and
multimedia retrieval, hashing is becoming increasingly important for efficient nearest neighbor (similar
item) search. Recently, a number of data-dependent methods have been developed, reflecting the great
potential of learning for hashing. Inspired by the classic nonlinear dimensionality reduction algorithm—maximum variance unfolding, we propose a novel unsupervised hashing method, named maximum
variance hashing, in this work. The idea is to maximize the total variance of the hash codes while
preserving the local structure of the training data. To solve the derived optimization problem, we propose
a column generation algorithm, which directly learns the binary-valued hash functions. We then extend
it using anchor graphs to reduce the computational cost. Experiments on large-scale image datasets
demonstrate that the proposed method outperforms state-of-the-art hashing methods in many cases.

1. Introduction

Nearest neighbor search is a fundamental problem in many applications concerned with information retrieval, including content-based multimedia retrieval [1–3], object and scene recognition [4], and image matching [5]. Due to the exciting advancement of data acquisition techniques, more and more data have been produced in recent years, leading these applications to suffer from the expensive time and storage demand. Recently, hashing has become a popular method to address this issue in terms of storage and speed. These methods convert a high-dimensional data item, for example, an image, into a compact binary code so that more items can be loaded into the main memory and the distance between two items can be computed efficiently by using bit XOR operation of their binary codes, and therefore they have great potential to solve complex problems.

Seminal work of hashing, such as locality-sensitive hashing (LSH) [6], focuses on using random projection to generate random binary codes in the Hamming space. It is then extended to accommodate more distance metrics [7, 8] or kernelized to capture nonlinear relationships in the data space [9, 10]. Without using any training data, LSH and its variances can map close data samples to similar binary codes, and it is theoretically guaranteed that the original metrics are asymptotically preserved in the Hamming space as the code length increases. However, because of the random projection, they need very long codes to achieve good precision in practice.

Data-dependent hashing methods, instead, take advantage of the available data to learn more impact codes for specific tasks, leading to the recent endeavors in hashing. For instance, PCA hashing (PCAH) [11] generates linear hash functions through PCA projections of the training data and is suggested for producing more impact codes rather than random projections. PCAH can be considered as the simplest data-dependent hashing method and can not capture nonlinear similarity information that is available in the training data. Alternatively, spectral hashing (SH) [12] and self-taught hashing (STH) generate hash codes from the low-energy spectrums of data neighborhood graphs to seek nonlinear data representations. The difficulty is how to compute the code of an unseen item, which is known as the problem of out-of-sample extension. As a result, SH has to assume that the training data are uniformly distributed in a hyper rectangle, which limits its practicabilities. STH addresses this problem in another way. By viewing the binary codes of the training data as pseudo-labels, it learns the hash functions via an extra pseudo-supervised learning stage. Nevertheless, learning errors in the self-taught stage may collapse the manifold structure of the learning data as illustrated in Figure 1.

Figure 1: From (a) to (d), a Swiss roll and its hash codes (embedded to 3D by PCA) after applying SH, STH, and MVH-CG, respectively. MVH-CG can maintain the manifold of the Swiss roll in some sense. SH and STH fail to preserve the manifold.

Indeed, all these mentioned data-dependent methods aim at hashing the high-dimensional features of the training data with low-dimensional binary codes while preserving the underlying data structure. These methods normally suffer from loss of local geometric structure of the training data. However, by viewing this problem from a different angle and removing the constraint of Hamming space, it can be seen as a variation of the traditional dimensionality reduction problem. Among the large number of dimensionality reduction methods (see [13] for a survey), maximum variance unfolding (MVU) [14] can almost faithfully preserve the local geometric structure of the training data (e.g., the distances and angles between nearby samples in details).

Meanwhile, Liu et al. [15] recently proposed a scalable graph-based hashing method, named anchor graph hashing (AGH). They approximated the origin data by a small set of anchors and learned the hash functions by using the Nyström extension [16]. But the generalized eigenfunctions are derived only for the Laplacian eigenmaps embedding and their performance may decline rapidly when their number increases.

In summary, the main contributions of this work can be described as follows. (i) Inspired by MVU, we propose maximum variance hashing (MVH), which directly embeds the high-dimensional data into a specified low-dimensional Hamming space and preserve the geometric properties of local neighborhoods. The idea is maximizing the total variance of the hash codes subject to the constraints imposed by the rigid rods between nearest neighbors (NN). (ii) To address the out-of-example extension difficulty, we propose a column generation-based solution of the derived optimization problem, named MVH-CG. As the size of training data increases, the construction of neighborhood graphs become infeasible. (iii) On the other hand, since the outputs of MVH-CG are a set of binary-valued functions, we can learn the hash functions on the anchor set and then apply them to any unseen data items directly. This motivates us to propose the anchor version of MVH (referred as MVH-A) to reduce the computational cost.

We put forward our algorithms and present the main results on several large-scale image datasets in the next sections.

2. Methodology

2.1. Notation

The following notations will be used throughout this paper: (i)a bold lower-case letter (): a column vector; (ii)a bold upper-case letter (): a matrix; (iii)a calligraphic style upper-case letter (): a set; (iv): the number of elements in ; (v) or : a function with one or two inputs; (vi): the -dimensional real number space; (vii): a pair of order numbers for two data samples.

2.2. Problem Definition

Given a set of samples , we would like to map each point to a low-dimensional binary code for fast nearest neighbor searching. Suppose that the desired number of dimensions of the embedded binary space is , the goal is to seek a transformation , where the pairwise relationship of in is kept, in some sense, with their counterpart in . Here each code is an -dimensional binary vector projected from using a set of binary-valued functions .

Formally, we denote the relationship of as their Euclidean distance (Any other metric can be chosen based on the nature of , though here we use Euclidean distance as a normal setting.) . Meanwhile, the relationship of , can be defined as their Hamming distance naturally. We minimize the following objective to keep the pairwise relationship:
where , depending on the specific application, is the weight of how important the relationship of and should be kept during the transformation and is a constant scale factor. Typically, it is reasonable to use the 0-1 adjacency matrix of the training data’s NN graph to define in order to preserve the local structure of . That is, the distance between and will be kept if and only if is a NN of or the other way around. The NN graph has been successfully used in STH to represent the local similarity structure, and the sparse nature of it greatly reduces the computational demand of the next optimization process. In addition, Weinberger and Saul [14] proved that if we add a small number of edges over the NN graph, both the distances along the edges and the angles between edges in the original graph are preserved. Accordingly, we define as:
where if and only if and are nearest neighbors themselves or common nearest neighbors of another sample.

On the other hand, the Hamming distance used in may not be as descriptive as their counterpart in , especially when is small (which is desirable in practice). We therefore relax the discrete Hamming distance to the real weighted Hamming distance , where is a nonnegative weight factor associated with and is the shorthand symbol of . Let , , and be the slack between them (we remove the constant scale factor by merging it into the weight factors); the objective then is

As discussed in [14], by preserving the pairwise distances in the extended NN constraint set , we faithfully preserve the local manifold of . Direct optimization of (3), however, tends to crowd points together in one location due to the absence of pairs other than NNs in the constraint, which is also a problem in the research of manifold learning. Various methods have been proposed in the literature to overcome this problem. t-SNE [17], for instance, uses the long-tailed Student’s-t-distribution to make all pairwise information in use. But the constraint set is not sparse any more in t-SNE. Weinberger and Saul [14], instead, proposed to maximize the variance of the embedded codes. That is, the mapping codes will be pulled apart as far as possible subject to the NN distance constraints in the embedded space. The maximum variance formulation has two advantages—it is (i) a global formulation and (ii) economical in computation. The variance of measured by weighted Hamming distance is . Here, is the variance of the th bit. It is easy to see that , where is the count of “1”s produced on . Combining them together, the optimization object can be written as:
where is the balancing parameter of the two terms. The introduced variable here is nontrivial, which will play a critical role in the derivation of (4)’s dual.

2.3. Column Generation Algorithm

There are two sets of variables to be optimized in (4)—the binary-valued hash functions and their weights ,. It is normally difficult to optimize them simultaneously, and the fact that the former one is a group of functions, even adds to the difficulty of solving the problem. We can use the column generation (CG) technique to find an approximate solution of it iteratively. It has been successfully applied in boosting algorithms [18, 19], which also have to generate a series of binary-valued functions and optimize their weights at the same time. Similar to the well-known expectation-maximization (EM) algorithm, column generation has a two-step iteration framework, where one set of variables are treated as constant in each step. The aim of column generation is to reduce the gap between the primal and the dual solutions iteratively.

We first consider in (4) as a known function and as a variable to be optimized. Then, the Lagrangian of it is
where and are Lagrange multipliers. At optimum, the first derivation of the Lagrangian w.r.t. the primal variables must vanish as follows:

The Lagrange dual function is

Then, it is easy to obtain the Lagrange dual as follows:
The idea of CG is to iteratively add a variable by selecting the most violated constraint of the dual, and then optimize the related variables by solving a restricted version of the origin optimization problem. It works on the basis that the sequence of restrict primal problems all have the same dual in which the most violated constraint indicates the steepest ascent direction of the dual. For (8), the subproblem for generating the most violated constraint is as follows:
where is the class of the base binary-valued hash function. Since there can be infinitely many functions in , we restrict it to be the decision stumps (Since decision stump is a deterministic model, the column generation process will converge when all the constraints in the dual are satisfied, which means that no new hash function can be generated. This convergence, however, usually happens after the required (typically less than 128) iterations in practice. Moreover, if it is a nondeterministic model here, column generation can produce new hash functions even after satisfying all constraints. We, therefore, do not mention the convergence in Algorithm 1.) [20], a machine learning model widely used in ensemble learning techniques, on the training set . By the restriction of decision stump, (9) can be solved by exhaustive search within reasonable time.

Algorithm 1: MVH-CG: Column generation for maximum variance hashing.

We summarize our MVH-CG framework in Algorithm 1. In the th iteration, we add a new function to the restricted primal problem. Let ; we use , , , and to, respectively, gather the corresponded scalars, and let to denote the learned hash functions’ response on such that the th column of it is the gather of . Then, the restricted primal problem (without the introduced variables ) can be written as:
As a quadratic programming problem, (10) can be solved efficiently by the off-the-shelf solver Mosek [21]. The KKT condition (6b) establishes the connection between the primal variables and the dual variables at optimality as follows:
The outputs of Algorithm 1 are the learnt binary-valued hash functions , their weights , and the binary codes of the training set . Given a new observation , is used to obtain its -bit binary code as follows:

The weight vector is a result of the relaxation of the difficult discrete problem. We simply abandon it and use only in hash applications.

2.4. Anchor Hashing Using MVH

In MVH-CG, we use the NN set to preserve the manifold structure in . The sparse nature of NN matrix can reduce the number of variables in (8). Yet the size of training set can be large, for example, the image dataset CIFAR-10 (http://www.cs.toronto.edu/~kriz/cifar.html) has 60,000 images, and the digits recognition dataset MNIST (http://yann.lecun.com/exdb/mnist/) has 70,000 samples. To solve the hashing problem more efficiently, Liu et al. [15] proposed to represent a data point by a set of anchors, which are the cluster centers obtained by running K-means on the whole (or a random selected small subsample of the) database. As the number of anchors are sufficiently small, the effective Laplacian eigenvector-based hashing method [12] can be processed on it in linear time. The main difficulty is how to generate hash codes for unseen points, which is known as out-of-sample extension problem. For this reason, [15] has to use the Nyström method [16] to learn eigenfunctions for a kernel matrix. Instead, our MVH-CG method learns a set of binary-valued hash functions, which can be directly applied to any data points. As a result, we only need to run the MVH-CG algorithm on the anchor set, and then apply the learnt binary-valued functions to hash the whole dataset. The anchor version of MVH (referred as MVH-A) is summarized in Algorithm 2.

Algorithm 2: MVH-A: Maximum variance hashing with anchors.

3. Evaluation of the Algorithm

In this section, we evaluate the hashing behavior of MVH-CG and the influence of the parameters.

We first evaluate the hashing behavior of MVH-CG on a Swiss roll toy data. The Swiss roll is a 2D submanifold embedded in a 3D space and can be thought of as curling a piece of rectangular paper. We apply MVH-CG, spectral hashing [12] (SH) and self-taught hashing (STH) [22] on it with the same code length , respectively. To visualize the results, we embed the obtained hash codes into by PCA. The results are illustrated in Figure 1. It is shown that all three methods tend to keep the neighborhood relationship during the mapping. MVH-CG maps the Swiss roll into a cube and can maintain the submanifold of it in some sense. SH and STH fail in preserving the manifold. For SH, one reason may be that it attempts to keep all pairwise relationships during the mapping. Studies in dimensionality reduction point out that NN can achieve good approximation of the original manifold, and a method built on NN kernel (used in MVH-CG) can analyze data that lies on a low-dimensional submanifold more faithfully than a predefined global kernel (used in SH) [23]. For STH, the failure may be due to the learning errors in its self-taught stage.

We then take the MNIST dataset as an example to evaluate the influence of the parameters. The MNIST dataset consists of 70,000 images of handwritten digits divided into 10 classes of pixel image. We use the original 784-dimension pixel representation for the MNIST dataset. There are two parameters, and , in MVH-CG. is the NN size, and is the balancing parameter between the two terms of object function (4). To eliminate the scale difference of these two terms, is multiplied with a constant in experiments. We randomly generate 4,000 samples of the MNIST dataset, half for training and the rest for test, to evaluate the influences of and . The results are summarized in Figure 2. From (a), we can see that the MAP curves rise after , which indicates that the second term of (4) is somewhat more important; from (b), we see that the performance of MVH-CG does not change significantly with the number of nearest neighbors . Based on the above observation, we set and for the remainder of our experiments.

Figure 2: MAP results versus varying balancing parameter ((a), fixing ) and number of nearest neighbors ((b), fixing ) for MVH-CG. The comparison is conducted on a subset of the MNIST dataset.

We also run an experiment to evaluate the influence of anchor set size of MVH-A based on the MNIST dataset. We randomly select 1000 samples as test set and the others (which are 69,000 samples) for training. As described in Algorithm 2, in this experiment, we first reduce the 69,000 training samples into anchors by K-means clustering and then run MVH-CG on the anchor set. The resulted MAP curves in Figure 3 basically remain stable from to . We, therefore, set for MVH-A.

Figure 3: MAP results versus anchor set size on the MNIST dataset.

4. Experiments

In this section, we evaluate the proposed hashing algorithm on the large-scale image datasets MNIST and CIFAR-10. The MNIST dataset consists of 70,000 images of handwritten digits. The CIFAR-10 dataset consists of 60,000 images of 10 classes, which means that there are 6,000 samples for each class. In our experiments, we use the original 784-dimension pixel representation for the MNIST dataset and a 512-dimension GIST [24] feature for the CIFAR-10 dataset. Both of them are split into a test set with 1,000 images and a training set with all other samples. Since the proposed MVH method is fully unsupervised, we compare it with several unsupervised hashing algorithms including PCA-based hashing (PCAH) [11], spectral hashing (SH) [12], self-taught hashing (STH) [22], and anchor graph hashing (AGH) [15]. The performance of the comparison methods is measured by Mean Average Precision (MAP) or precision-recall curves for Hamming ranking.

4.1. Results on the MNIST Dataset

We report the experimental results based on MAP for Hamming ranking with code length from 32 to 128 bits in Figure 4(b). We can see that AGH obtains a high score at very short code length. Its performance, however, declines rapidly as increases and is inferior to MVH-CG after . The performance of PCAH and STH, similar to AGH, also drops down with longer bit lengths. By contrast, MVH-A and MVH-CG consistently improve their performance as code length grows. This property is important in very large-scale problems, when a short hash code, with a length of 32 bits, for example, is not enough to describe the whole dataset. MVH-CG is consistently superior to MVH-A as more data are used in the learning process. Yet, MVH-A also catches up with AGH at . We then plot the precision-recall curves for the compared methods in Figure 5. It can be seen that the curves of AGH are relatively high at the beginning, but they drop rapidly when more samples are returned. We also see that our MVH methods perform better at larger , which confirms the observation in Figure 4. PCAH performs worst in this case since it simply generates the hash hyperplanes by linear projects, which cannot capture the nonlinear similarity information behind the training data. SH is slightly better, but much worse than others, because it relies upon the strict uniform data assumption.

Figure 4: Comparison of different methods using MAP for varying code lengths on CIFAR-10 (a) and MNIST (b).

Figure 5: Precision-recall curves for competing methods on the MNIST dataset for different code lengths.

4.2. Results on the CIFAR-10 Dataset

The CIFAR-10 dataset is a manually labeled subset of the well-known 80 million tiny images dataset [4]. It consists of 60,000 images from 10 classes as in the examples shown in the top row of Figure 6. Each image is represented by a 512-dimension GIST [24] feature and then hashed by MVH-A. MVH-CG is not run since decision stump on the whole dataset is expensive. The bottom row of Figure 6 shows the returning list of the example query, where the first 8 results are correct and the last 2 are false positives. The MAP scores against code lengths are plotted in Figure 4(a). On this dataset, we can see that MVH-A yields rising performance as the number of bits increases. It outperforms all its competitors from onward and achieves the highest MAP score at . PCAH and SH perform worst again. Figure 7 shows the precision-recall curves of hamming ranking for the compared methods with different code lengths. When , MVH-A is inferior to AGH. As the code length grows, the areas under the precision-recall curves of MVH-A are much broader than AGH and other methods. This trend is consistent with the MAP results.

Figure 6: (a) Samples from the CIFAR-10 dataset, one for each category. (b) The results for a query of “horse” image returned by MVH-A with 128 bits. The last two returns are false positive.

Figure 7: Precision-recall curves for competing methods on the CIFAR-10 dataset for different code lengths.

5. Conclusion

This paper has proposed a novel unsupervised hashing method based on manifold learning theories, which can maximize the total variance of the hash codes while preserving the local structure of the training data. Two algorithms, MVH-CG and MVH-A, have been proposed to solve the derived optimization problem. Both of them can embed the input data into binary space while maintaining the submanifold with very short hash codes. The training process of MVH-A is faster than MVH-CG, but the anchor representation of MVH-A may degrade the retrieval performance of the resulted hash codes. Experimental results on large-scale image datasets show that, in the case of image retrieval, the proposed algorithms are consistently superior to the state-of-the-art unsupervised methods such as PCAH, SH, and STH and outperform AGH with relatively longer codes. The idea of manifold learning has a great potential for large-scale hashing. We are going to develop more efficient hashing method based on other manifold learning approaches.

Acknowledgments

The authors gratefully acknowledge the kind help from the Academic Editor Shengyong Chen. This work was supported by the National Nature Science Foundation of China under NSFC nos. 61033008, 61272145, 60903041, and 61103080, Research Fund for the Doctoral Program of Higher Education of China under SRFDP no. 20104307110002, Hunan Provincial Innovation Foundation For Postgraduate under no. CX2010B028, Fund of Innovation in Graduate School of NUDT under nos. B100603 and B120605.