According to a TechCrunch interview with Google engineering director Radhika Malpani, the approach is based on indexing visual similarity, perhaps along the line of this well-reported WWW 2008 paper co-authored by Googler Shumeet Baluja.

The feature is neat, and I think similarity search is a great fit for image search. Regular readers may have read previous posts here about Modista, a startup specializing in exploratory visual search.

Still, I do have three criticisms. First, I find that many searches don’t return enough diversity to make similarity search helpful, e.g., blackberry returns no images of the fruit. Second, the images returned aren’t organized–which seems like a lost opportunity if Google knows enough to cluster them based on pairwise visual similarity. Third, similarity is too fine-grained: I find that similar images are often near-duplicates or the starting image.

Nonetheless, this is a solid launch, and I’m delighted to see Google do anything related to exploratory search.

20 responses so far ↓

First, I find that many searches don’t return enough diversity to make similarity search helpful

This is related to the core of my discussion/concern the other week with Peter Norvig, about having algorithms that only rely on “big data” rather than on intelligent algorithmics. My ongoing position is that you are seeing this sort of non-diversity in the result set exactly because their underlying methods rely on “big data”, find-the-best-answer methods.

I think the big data approach works for finding home pages (navigational web search) and for spelling correction and for canonical image finding. It does not work, I believe that I see from the evidence, for diversity-generating searches.

Indeed, I would think Google would at least hide near-duplicates the way it aggregates similar news stories at Google News. Whether the news stories are interchangeable is another discussion, but here we’re talking about images that they can objectively compare on visual similarity. And the freed up space would be used to increase diversity.

Did you see that on the standard image search, theres now this ‘color’ drop down that lets you specify a few basic ‘colors’ that you want the pictures to be. Its fun searching for something like ‘oranges’ and then choosing that you want the picture to be mainly purple – i get a woman with a pencil up her nose. Anyway, I didnt see anything official about this, but its great to see them adding to the ‘photo/drawing/etc’ drop down that they put up some time ago.

They’re probably not removing duplicates because that requires a different technique. Image similarity and image fingerprinting are two different families of algorithms and would require different supporting application code.

As for removing duplicates requiring a different technique, I’d love to hear more detail on that. Aren’t they able to compute pair-wise similarity? Or is the problem that index lookup on the fingerprint is cheap, while similarity computations is comparatively expensive?

I don’t think there’s anything inherently more difficult about the fingerprinting side. I think the index lookups for both would be comparable, although it’d depend on their implementation.

I suspect the reason they haven’t done it is either a lack of inclination, or a lack of time to get the code production-ready.

I work for Pixsta so am quite familiar with the similarity algorithms (we just launched Empora.com which makes heavy use of visual similarity). I only know about the fingerprinting stuff via academic interest.

To give you a bit of background, most techniques in that area are based around finding repeatable keypoints in an image, then finding ways to encode the notable properties of those keypoints. Then when you query with a different version of the same image you get the same set of keypoints with the same set of properties (within a degree of error). The SIFT algorithm is a good place to start: http://en.wikipedia.org/wiki/Scale-invariant_feature_transform

Apps that use SIFT-like algorthms include SnapTell (iphone) and TinEye (web and iphone).

Well, if it’s easy, then I suspect they’ll get around to it. I was thinking it might be cheap to get a set of images within a fairly small distance of a target, but expensive to do a lot of more precise comparisons. I’m not particularly familiar with image search techniques, so I’m speculating by analogy to text search, where working through an inverted index is generally cheaper than doing run-time document analysis.

BTW, feel free to contribute a guest post on the subject. I’m sure I’m not the only person here who would be interested!

To give you a bit of background, most techniques in that area are based around finding repeatable keypoints in an image, then finding ways to encode the notable properties of those keypoints. Then when you query with a different version of the same image you get the same set of keypoints with the same set of properties (within a degree of error).

FWIW, this is the exact same type of technique that is used in most very early stage music information retrieval systems as well. For example, that is how Shazam does its iTunes app, where you hold the phone up to a song that is playing in a crowded bar, and the app texts back to you the name and artist. The algorithm looks for specific, exact markers in the song, things that will still be there despite all the noise from the bar in the background, and uses it to do exact match retrieval.

But in both cases, image or music, it’s still just exact match retrieval. More interesting to me are methods that do true perceptual similarity, and really try to determine how similar two images are not based on subsets of exact matches, but based on sets of non-exact matches.

For example, applied to music, these methods only allow you to find the exact same song that you query for, i.e. the same version by the same artist. Often you can’t even retrieve the live version using the studio version of the song, by the same exact band, because all the markers are different. But what I am more interested is finding the reggae version of, let’s say, Anarchy in the UK, by a band other than the Sex Pistols. Or the punk version of “Somewhere Over the Rainbow” by someone other than Judy Garland.

That is where these methods fail for music and, by analogy, for images.

@Daniel, the analogy to text search is completely valid. The only question is what your “words” mean. For example, you could either encode mathematical properties of an image’s shape, or colours and index those (similarity), or use a machine learning algorithm to recognise particular types of objects and index the actual name of the object type (classification), or index encoded visual keypoints (fingerprinting). BTW, always interested in writing. Drop me an email.

@jeremy, spot on. The only thing I’d add would be that although the techniques are very similar people have different uses for sounds and images. I think that over the next year or two apps like Empora will demonstrate that there’s value in these image similarity methods. I might be biased though 🙂

@Richard Marr:
Apps that use SIFT-like algorthms include SnapTell (iphone) and TinEye (web and iphone).

SnapTell are certainly using a keypoint method similar to SIFT, but it’s interesting to hear you say that TinEye are too. I spent an hour or two trying to reverse engineer what TinEye are doing, and from the behaviour of the searches I came to the conclusion it was probably a different kind of technique. That said, I was demoing my own engine yesterday and some of the things it matched surprised me – so perhaps my reverse engineering guess is not so reliable.

@jeremy
If you haven’t already seen it, there is a nice example of work involving perceptual similarity from Alyosha Efros’ group at CMU. That work is very much of the simple-method-big-data school, and is rather convincing. Data certainly helps.

I’ve also been arguing against the “data messiah” for a while though. I find the limits of the approach more interesting than the success.

Yes, isn’t that Hayes and Efros paper the same one that Norvig quotes?

Here’s the problem I have with the method: It works really well if your task is to fill in image spaces with generic, through semantically consistent, backgrounds.

What if, however, the background should be filled with something that you know really is supposed to be there. For example, suppose you have a picture of Fred and Ina, taken in their kitchen in Wilford, Idaho. You want to remove Fred from the picture. But behind Fred is that vase that they picked up on their trip to France in the late 1970s. And just next to the face is the old clock handed down through the generations from Ina’s family, from ancestors who used to be clockmakers in Switzerland. And below the clock is the old WWI photo of Ina’s Great Uncle.

Now, you want to remove Fred from the picture, but not replace him with any old generic kitchen kitsch shelving. You want to replace it with what is really there in the picture.

How does big data help? I don’t really think it does.

With smart algorithms, on the other hand, you could get the user to provide the algorithm with just 2-3 pictures of the same scene, taken from different angles, and then have the algorithm reconstruct what really is behind Fred.

So that’s my only point. Simple-method-big-data works great if you simply want a generic, albeit semantically meaningful, fill-in. But if you’re really trying to replace what is behind the removed element, I don’t see how big data helps you. As I often argue, it comes down to the task you are trying to solve.