2 Answers
2

I don't know which algorithm Google uses. But, since you wanted a best guess, let me give some ideas on how a similar system could be constructed.

The whole field dealing with search-image-base-by-image is called Content Based Image Retrieval (CBIR). The idea is to, somehow, construct an image representation (not necessarily understandable by humans) that contains the information about image content.

Two basic approaches exist:

retrieval using low-level (local) features: color, texture, shape at specific parts of images (an image is a collection of descriptors of local features)

semantic approaches where an image is, in some way, represented as a collection of objects and their relations

Low-level local approach is very well researched. The best current approach extracts local features (there's a choice of feature extraction algorithm involved here) and uses their local descriptors (again, choice of descriptors) to compare the images.

In newer works, the local descriptors are clustered first and then clusters are treated as visual words -- the technique is then very similar to Google document search, but using visual words instead of letter-words.

You can think of visual words as equivalents to word roots in language: for example, the words: work, working, worked all belong to the same word root.

One of the drawbacks of these kinds of methods is that they usually under-perform on low-texture images.

I've already given and seen a lot of answers detailing these approaches, so I'll just provide links to those answers:

Semantic approaches are typically based on hierarchical representations of the whole image. These approaches have not yet been perfected, especially for the general image types. There is some success in applying these kind of techniques to specific image domains.

As I am currently in the middle of research of these approaches, I can not make any conclusions. Now, that said, I explained a general idea behind these techniques in this answer.

Once again, shortly: the general idea is to represent an image with a tree-shaped structure, where leaves contain the image details and objects can be found in the nodes closer to the root of such trees. Then, somehow, you compare the sub-trees to identify the objects contained in different images.

Here are some references for different tree representations. I did not read all of them, and some of them use this kind of representations for segmentation instead of CBIR, but still, here they are:

In addition to the answer of penelope, there are two approaches, perceptual hashing and the bag-of-words model whose basic functionality is easily implemented and are therefor nice to play with or to learn from, before venturing into more advanced territory.

Perceptual hashing

Perceptual hashing algorithms aim to construct a hash, that unlike a cryptographic hash, will give similar, or near similar hash values for identical images that have been slightly distorted for example by scaling or JPEG compression. They serve a useful purpose in detection near duplicates in an image collection.

In its most basic form, you can implement this as follows:

Convert image to grayscale

Make your image zero mean

Crush your image down to thumbnail size, say [32x32]

Run the two dimensional Discrete Cosine Transform

Keep the top left [8 x 8], most significant low frequency components

Binarize the block, based on the sign of the components

The result is a resilient 64 bit hash, because it is based on the low frequency components of the image. A variant on this theme would be to divide each image into 64 subblocks and compare the global image mean to the local subblock mean and write out a 1 or 0 accordingly.

The bag-of-words model aims to semantically identify an image, e.g. all images with dogs in them. It does this by using certain image patches in the same spirit that one would classify a text document based on the occurrence of certain words. One could categorize the words, say "dog" and "dogs" and store them as identifier in an inverted file where the "dog" categorie now points to all documents containing either "dog" or "dogs".

In its most, most simple form, one can do this with images as follows:

Deploy the so called SIFT features, for example using the excellent vlfeat library, which will detect the SIFT feature points and a SIFT descriptor per point. This descriptor is basically a smartly constructed template of the image patch surrounding that feature point. These descriptors are your raw words.

Gather SIFT descriptors for all relevant images

You now have a huge collection of SIFT descriptors. The problem is, is that even from near identical images, there will be some mismatch between descriptors. You want to group the identical ones together more or less like treating some words, as "dog" and "dogs" as identical and you need to compensate for errors. This is where clustering comes in to play.

Take all SIFT descriptors and cluster them, for example with an algorithm like k-means. This will find a pre determined number of clusters with centroids in your descriptor data. These centroids are your new visual words.

Now per image and its original found descriptors, you can look at the clusters these descriptors were assigned to. From this, you know wich centroids or visual words 'belong' to your image. These centroids or visual words become the new semantic descriptor of your image which can be stored in an inverted file.

An image query, e.g find me similar images to the query-image, is then resolved as follows:

Find the SIFT points and their descriptors in the query image

Assign the query descriptors to the centroids you earlier found in the enrollment phase. You now have a set of centroids or visual words that pertain to your query image

Your bag-of-words approach is basically what my links for the "local approach" lead to :) Although it is not really semantic in nature: you would never represent a single dog with one feature, nor would it be that easy to identify different dog spices as dogs. But perceptual hashing is nice, didn't know about that one. Explanations are nice. Which got me thinking... would you have any suggestions how to apply that technique to a non-rectangular area? Or maybe provide some references to articles, I could read up a little and if the question makes sense, open it up as a separate question.
–
penelopeNov 15 '12 at 21:12

1

@penelope I have actually read in article, years ago, where the authors split up an image in arbitrary triangles. And there is the trace-transform which has also been used as a basis for a perceptual hash. I'll get back to you.
–
MauritsNov 15 '12 at 22:35

Everything I want to ask you about this is much beyond the scope of this question, so I opened a new one Any more info/references about the basic technique would still be great as well, either in this answer or that one. Looking forward :)
–
penelopeNov 15 '12 at 23:16