Method for constructing image database for object recognition, processing apparatus and processing program

外国特許コード

F120006112

整理番号

S2008-0587

掲載日

2012年1月6日

出願国

アメリカ合衆国

出願番号

98990609

公報番号

20110164826

公報番号

8340451

出願日

平成21年4月27日(2009.4.27)

公報発行日

平成23年7月7日(2011.7.7)

公報発行日

平成24年12月25日(2012.12.25)

国際出願番号

JP2009058285

国際公開番号

WO2009133856

国際出願日

平成21年4月27日(2009.4.27)

国際公開日

平成21年11月5日(2009.11.5)

優先権データ

特願2008-117359
(2008.4.28)
JP

特願2008-119853
(2008.5.1)
JP

2009JP058285
(2009.4.27)
WO

発明の名称
（英語）

Method for constructing image database for object recognition, processing apparatus and processing program

発明の概要（英語）

Provided is a method for constructing an image database for object recognition, which includes a feature extraction step of extracting local descriptors from object images which are to be stored in an image database, a scalar quantization step of quantizing a numeric value indicating each dimension of each of the local descriptors into a predetermined number of bit digits, and a storing step of organizing each of the local descriptors after the quantization to be able to be searched for in the closest vicinity, giving to the local descriptor an identifier of the image from which the local descriptor has been extracted, and storing the local descriptor to which the identifiers are given in the image database. The storing step comprises extracting the local descriptors from the object images when a search query is given, scalar-quantizing each dimension, determining a local descriptor in the closest vicinity of each of the local descriptors from the image database, and storing each local descriptors so as to be able to identify one image by majority vote processing from the images including any determined local descriptor. The scalar quantization step comprises quantizing each dimension of each of the local descriptors into 8 bits or less. Also provided are a processing program for the method and a processing device for performing the processing.

従来技術、競合技術の概要（英語）

BACKGROUND ARTLocal descriptors in SIFT (Scale-Invariant Feature Transform) or the like can realize object recognition that is relatively robust to occlusion or variation of a lighting condition, and thus, currently, the local descriptors attract attention (e.g., see Non-Patent Literatures 1 and 2). A model called "Bag of Words" or "Bag of Features" is basically used for recognition. In this model, locations or co-occurrences of the local descriptors are not considered. Only the frequency of occurrences of the local descriptors is used for recognizing an object.Here, the local descriptors represent local features of an image. The local descriptors are extracted through a predetermined procedure so as to have characteristics that are robust to variation (geometric transformation, lighting conditions, or variation of resolutions) of an image. In addition, because the local descriptors are determined from a local area of an image, the local descriptors are robust also to occlusion. In the present specification, the local descriptors are also referred to as feature vectors because the local descriptors are represented as vectors.In general, the number of local descriptors extracted from an image is several hundreds to several thousands, or sometimes reaches several tens of thousands. Therefore, an enormous amount of processing time is needed for matching of the local descriptors, and an enormous amount of memory is needed for storing the local descriptors. Therefore, the important research subject is how to reduce the amount of processing time and the amount of memory while keeping a recognition accuracy at a certain level.For example, in the SIFT, as typical local descriptors, each local descriptor is represented as a 128-dimensional vector. In addition, there is known a PCA-SIFT that uses a vector having reduced dimension from that of the SIFT by performing principal component analysis. However, an example of local descriptors used in a practical PCA-SIFT is 36-dimensional vectors. Moreover, a general data type for representing the value of each dimension is a 32-bit float type or integer type which is applied to general numerical representations. When a higher accuracy is needed, a 64-bit double type is used. On the other hand, when a limited range of values are used or when it is desired to reduce the amount of memory even while sacrificing the accuracy, a 16-bit short integer type can be specially used. Even in the PCA-SIFT using a 36-dimensional vector and specially using the short integer type to prioritize reduction of the amount of data, each local descriptor needs a memory of 16 bits * 36 dimensions=512 bits (64 bytes).In general, nearest neighbor searching calculates the distance between vectors and determines the nearest local descriptor. It has been commonly considered that if an accuracy of data of each dimension is decreased, accurate nearest neighbor searching cannot be performed, and therefore, an accuracy (recognition rate) of recognition of an image is decreased.Accordingly, many conventional techniques employ the following approach. Local descriptors obtained from an image for constructing a model are vector-quantized (a technique of classifying local descriptors into a predetermined number of groups such that each group includes similar local descriptors, and then expressing each local descriptor included in the same group by a representative value thereof, i.e., clustering), several thousand to several hundred thousand visual words (which correspond to the above representative values) are determined, and an image is described by using the visual words (e.g., see Non-Patent Literature 3). Upon recognition of an unknown image, local descriptors obtained from the image are converted into visual words, and the frequency and the like are measured. In such an approach, if the number of visual words is sufficiently small, high-speed processing can be expected. On the other hand, it is pointed out that, if the number of visual words is large, a sufficient recognition rate cannot be attained (e.g., see Non-Patent Literature 4). The larger the number of visual words is, the more difficult it is to ignore calculation time needed for vector quantization. In addition, a problem arises with respect to the amount of memory for storing the visual words.The above advantage and problem are the most prominent in an extreme case, that is, when individual local descriptors obtained from an image for constructing a model are directly converted into visual words. For example, about two thousand local descriptors are extracted from a general VGA-size image. Therefore, when hundred thousand VGA-size images are used for constructing a model, the number of visual words is two hundred millions, and enormous amount of calculation resources are needed for matching and storage. Meanwhile, when a large number of local descriptors are used for a model, highly accurate recognition can be realized.One of solutions to the problem of processing time is to introduce "approximate nearest neighbor searching" in matching of local descriptors (e.g., see Non-Patent Literature 5 and Patent Literature 1). It is known that for example, when a recognition task of the above magnitude is to be performed, the "approximate nearest neighbor searching" enables the processing time to be smaller than 10-6 times the processing time taken for simply performing matching of all local descriptors, without almost any decreasing of the recognition rate. On the other hand, one of solutions to the problem of the amount of memory is to performing vector quantization more roughly. However, this solution is not necessarily preferable because the recognition rate decreases.Citation ListPatent LiteraturePatent Literature 1: International Publication WO2008/026414Non-Patent LiteratureNon-Patent Literature 1: D. Lowe, "Distinctive image features from scale-invariant keypoints", International Journal of Computer Vision, vol. 60, no.2, pp. 91-110, 2004Non-Patent Literature 2: J. Ponce, M. Hebert, C. Schmid, and A. Zisserman Eds., Toward Category-Level Object Recognition, Springer, 2006Non-Patent Literature 3: J. Sivic and A. Zisserman, Video google: A text retrieval approach to object matching in videos, Proc. ICCV2003, Vol. 2, pp. 1470-1477, 2003Non-Patent Literature 4: D. Nister and H. Stewenius, Scalable recognition with a vocabulary tree, Proc. CVPR2006, pp. 775-781, 2006Non-Patent Literature 5: Kazuto Noguchi, koichi Kise, Masakazu Iwamura, "Efficient Recognition of Objects by Cascading Approximate Nearest Neighbor Searchers", Meeting on image recognition and understanding (MIRU 2007) Collection of papers, pp. 111-118, July, 2007

特許請求の範囲（英語）

[claim1]1. A method for constructing an image database that is used for object recognition comprising the steps of: extracting, from an image showing an object and to be stored in the image database, a plurality of local descriptors each of which is a vector representing respective local features of the image; scalar-quantizing the vector on a dimension by dimension basis of the vector; and storing into the image database the image and the corresponding scalar-quantized vectors, with (1) calculating an index value for referring to a bin of a hash table from each scalar-quantized vector by using a predetermined hash function, and (2) storing (i) the value of each scalar-quantized vector dimension and (ii) an image ID for identifying the image from which each vector is extracted into the bin referred to with use of the calculated index value as an entry; wherein each of the steps is executed by a computer and the storing step stores each vector so that, when an image showing an object in question is given as a query while a plurality of images are stored in the image database, the computer extracts a plurality of query local descriptors from the query through a similar step to the feature extraction step, quantizes each query local descriptor through a similar step to the scalar quantization step, retrieves vectors as neighbor vectors of each query local descriptor, each of which is retrieved from the vectors stored in the image database by using an algorithm of approximate nearest neighbor searching, obtains the image IDs attached to the neighbor vectors and determines at least one image(s) which shows the object in question based on the obtained image IDs; and wherein the scalar quantization step quantizes each vector dimension into a scalar number of 8 bits or less and 1 bit or more.[claim2]2. The method according to claim 1, wherein the scalar quantization step quantizes each vector dimension into a scalar number of 2 bits or less.[claim3]3. The method according to claim 1 or 2, wherein the storing step stores each vector through the step of when (i) the value of each scalar-quantized vector dimension and (ii) the image ID are stored as an entry into the bin corresponding to the vector, which has been extracted from the image to be stored in the image database, eliminating every entry stored in the same bin and preventing further entry from being stored in the bin in case where the number of the entries stored in the bin exceeds a threshold.[claim4]4. The method according to claim 3, wherein the storing step stores each vector so that the computer determines the image(s) through the process of retrieving the neighbor vectors, and wherein the computer calculates the index value using the quantized vector dimensions, further calculates one or more other index value(s) using one or more neighbor(s) of each quantized vector dimension, and retrieves the neighbor vectors from the vectors stored in the bins referred to with use of the calculated index values.[claim5]5. The method according to claim 4, wherein the algorithm of the approximate nearest neighbor searching includes process of calculating a distance between each of the query local descriptors and the vectors stored in the bins referred to with use of the calculated index values; and specifies one or more vectors that are within a predetermined distance, or a vector in the shortest distance.[claim6]6. The method according to claim 3, wherein the algorithm of the approximate nearest neighbor searching includes process of calculating a distance between each of the query local descriptors and the vectors stored in the bins referred to with use of the calculated index values; and specifies one or more vectors that are within a predetermined distance, or a vector in the shortest distance.[claim7]7. The method according to claim 2, wherein the algorithm of the approximate nearest neighbor searching includes process of calculating a distance between each of the query local descriptors and the vectors stored in the bins referred to with use of the calculated index values; and specifies one or more vectors that are within a predetermined distance, or a vector in the shortest distance.[claim8]8. The method according to claim 1, wherein the algorithm of the approximate nearest neighbor searching includes process of calculating a distance between each of the query local descriptors and the vectors stored in the bins referred to with use of the calculated index values; and specifies one or more vectors that are within a predetermined distance, or a vector in the shortest distance.[claim9]9. An apparatus for processing an image database that is used for object recognition comprising: a feature extraction section for extracting, from an image showing an object and to be stored in the image database, a plurality of local descriptors each of which is a vector representing respective local features of the image; a scalar quantization section for scalar-quantizing the vector on a dimension by dimension basis of the vector; a storing section for storing into the image database the image and the corresponding scalar-quantized vectors , with (1) calculating an index value for referring to a bin of a hash table from each scalar-quantized vector by using a predetermined hash function, and (2) storing (i) the value of each scalar-quantized vector dimension and (ii) an image ID for identifying the image from which each vector is extracted into the bin referred to with use of the calculated index value as an entry; and a retrieval section, when an image showing an object in question is given as a query while a plurality of images are stored in the image database, and after the extraction section extracts a plurality of query local descriptors from the query in a similar manner as in the image to be stored and the scalar quantization section quantizes each query local descriptor in a similar manner as in the image to be stored, for retrieving neighbor vectors for respective query local descriptor among the vectors stored in the image database using an algorithm of the approximate nearest neighbor searching, obtaining the image IDs attached to the neighbor vectors, and determining at least one image(s) which shows the object in question based on the obtained image IDs, wherein the scalar quantization step quantizes each vector dimension into a scalar number of 8 bits or less and 1 bit or more.[claim10]10. A non-transitory computer-readable medium, upon which is stored a program for processing an image database that is used for object recognition, the apparatus causing a computer to function as: a feature extraction section for extracting, from an image showing an object and to be stored in the image database, a plurality of local descriptors each of which is a vector representing respective local features of the image; a scalar quantization section for scalar-quantizing the vector on a dimension by dimension basis of the vector; a storing section for storing into the image database the image and the corresponding scalar-quantized vectors , with (1) calculating an index value for referring to a bin of a hash table from each scalar-quantized vector by using a predetermined hash function, and (2) storing (i) the value of each scalar-quantized vector dimension and (ii) an image ID for identifying the image from which each vector is extracted into the bin referred to with use of the calculated index value as an entry; and a retrieval section, when an image showing an object in question is given as a query while a plurality of images are stored in the image database, and after the extraction section extracts a plurality of query local descriptors from the query in a similar manner as in the image to be stored and the scalar quantization section quantizes each query local descriptor in a similar manner as in the image to be stored, for retrieving neighbor vectors for respective query local descriptor among the vectors stored in the image database using an algorithm of the approximate nearest neighbor searching, obtaining the image IDs attached to the neighbor vectors, and determining at least one image(s) which shows the object in question based on the obtained image IDs, wherein the scalar quantization step quantizes each vector dimension into a scalar number of 8 bits or less and 1 bit or more.