I'd like to run a 'fuzzy hash' (a.k.a. 'fingerprint') algorithm over the images in our web archives, so that I can detect simlar images across many millions of files. We already do this for text, using ssdeep, which has the distinct advantage of producing a plain text string that will be similar for similar texts. This is important because it means I can just index the text in Solr and use the built in support for n-grams or fuzzy searches to find related texts.

For images, I should clarify the kind of image similarity I mean. I'm interested in hearing about any work done in this area, but I'm particularly interesting in trying to find very closely related images that may have been derived from one another. For example, rescaled images, cropped images, or via minor edits/modifications.

I'm aware of some image similarity and detection methods, but they don't always work in this way, and sometimes they hide the fingerprint and it's not clear how to get to it.

For example, matchbox looks great, but as far as I can tell, and pure SIFT-based feature comparison of that ilk means that each query feature set has to be compared with each feature set from each other image, one by one. i.e. the query time scales such that a single query on two thousand images takes a thousand times longer than comparing against two images (i.e. O(N)), and therefore any attempt to cluster similar images take much longer (i.e. O(N^2)).

So, before spending a large amount of time attempting to reverse engineer what those packages are up to, I'd like to know if anyone out there has had any luck deploying image similarity algorithms over hundreds of millions of images. Are there other image fingerprinting systems I should look at?

The one idea I was curious about is how the results might be if you did something like store phash values and implement a Solr comparison function so you could rank results using the ph_crosscorr function. That said, LIRE looks like it would be both a lot of work and possibly quite rewarding to integrate into Solr.

In general, though, I think John's advice about using an external service makes a lot of sense - whether that's a commercial service, open-source tool or an in-house app it seems like it'd be a good idea to figure out how you might merge results from multiple sources since I suspect it's going to be a necessary evil for awhile.

@anjackson: Yep! The liresolr plugin I linked you to on Twitter does *very* rough similarity. Mostly based upon color usage. If you look at the results on their demo site: http://demo-itec.uni-klu.ac.at/liredemo/ It's really dicey - and certainly not a duplicate finder, by any stretch (it's going to have problems with color changes, cropping, etc.) This might be, at most, a "nice to have" - along the lines of what the Cooper Hewitt guys are doing: http://labs.cooperhewitt.org/2013/giv-do/

@acdha: Generally speaking I've been very unimpressed with the quality of match that pHash provides. I did what you mentioned, loading pHashes into a DB and querying the results but they weren't very good so I quickly abandoned it. There simply isn't enough data being extracted from the pHash function to come up with a meaningful match, unfortunately!

2 Answers

+6 votes

Best answer

This is an incredibly complicated issue (as you're discovering). To start, what you're attempting to achieve (finding some sort of universal hash that'll be reliably query-able in SOLR) is probably not going to happen. Unless the images you're working with are highly similar to each other, all properly cropped and color-balanced and of similar subject matter, any sort of fingerprinting is going to be close to a toss-up. I've done this before, generating a pHash for images and feeding them into a SOLR instance for querying, and it seemed to return pretty-much random results.

It sounds like what you want is not so much a "similarity" so much as a "de-duplication" algorithm. Finding images that are depicting exactly the same subject matter. Additionally you want to do this at scale (millions of images!). There are two tools that get you close to this, and I've used both.

First imgSeek might be able to work for you, to some degree. It's Open Source and free. It analyzes the images, stores them in its own index, and provides you with a query interface. This will not be integrated with your SOLR instance and you'll have to do two separate queries. Personally I found that it scaled up well to around 200,000+ images and was still able to return matches relatively fast. However -- and here's the massive caveat -- it only returns "similar" images not "duplicate" images. Thus if you query an image it'll literally return a never-ending list of all the possible other matches sorted by a match %. Because of this I had to cut off the results at a certain point (pick a percentage and not show anything beneath that). This is what I used for http://ukiyo-e.org/ initially and the feedback that I got from users what that it was interesting but not good enough. It wasn't able to handle most changes in the cropping of images, changes in color (black and white vs. color, especially), changes in orientation, etc. Additionally the match % is very ambiguous and no matter what % I picked I always had false positives and false negatives.

I eventually moved on to using TinEye's MatchEngine service (a commercial service): https://services.tineye.com/MatchEngine In contrast with imgSeek this is a de-duplication service and is able to find the exact-same subject matter even after transformation. It's *extremely* good. I'm the first person to be hesitant about using black box commercial services but this is precisely the problem that you're trying to solve (especially at scale - it can handle millions of images, I'm not sure that imgSeek can).

I've been using MatchEngine with the http://ukiyo-e.org/ site that I built but I've also been using it with the photo archives of art research libraries, to find similar records and correct mistakes. I've written a paper about this research here:

I also go into great detail on how exactly the matching works, what it's capable of, what it's shortcomings are, and exactly how effective it is. I've just received a grant from the Kress Foundation to support this work further: developing a series of Open Source tools for others to use in doing this sort of image analysis on photo archives (and other image databases). It'll be using the MatchEngine technology (largely because it's the best that I've found).

The TinEye company has been extremely receptive to my work. Most of their clients are not in the humanities, they're doing things like enforcing trademarks. They've been extremely interested in my research and have graciously provided me with free access to the tools. They're also extremely interested in helping to support other interesting humanities projects. If you're interested and want me to put you in touch I can connect you. Just email me at: jeresig@gmail.com

I have managed, for the analysis of the Geocities torrent released by the Archive Team, to create a very primitive and effective way for image similarity search that works on the basis of fulltext indexing.

The main goal was to find related versions of classic graphics web users copied and modified to include in their home pages. It is possible with this to find for example animated versions of previously not animated GIF or the other way round, track different sizes, floodfill color changes, etc. It works well for graphical images, not so well for photographs, will not handle object recognition or cropping.

While I do not know what kind of images you are trying to tackle, my goal was to find those that were used like an alphabet. Think under construction with modified blinking rates, a rotating globe GIF that would also turn the other way round, an animation that was scaled, a cartoon outline that was slightly changed, a static image that was later animated, ... how 1990's web culture worked.

It is not very fancy because this all has to work and be quick on a home computer, but has enabled a lot of research.

What I did was to explode any frame of an image file, then scale these to a standard size. After that, reducing the bit depth per pixel. Converting the result to a string (in the case of animation, several strings) that are attached to the image as metadata. Other important metadata is the pixel dimensions, which can be used for searching as-is or converted to a rounded edge length ratio.

In the case of Geocities, resizing to 4x4 pixels and reducing to 3bpp (= 8 possible colors or gray scales) yielded very good results. Since fulltext search works best with a reduced set of words, this did the trick quite well.

An example fulltext search 'document' for an image could look like this:

I did this with a bit of ImageMagick (which crashes ever few 100'000 images) for a few million image files. Integrates well with my existing way of categorizing files, didn't require a new technology stack, kept the research inside one database, manageable by two people :)

That's a really interesting technique, thanks for posting it @despens! I suspect that this is going to work best for images that haven't been cropped or resized in weird ways (or have had dramatic changes in color). Although if you're looking for virtually identical matches (as in your case -- I love the demo with all the near-duplicate gifs!) then this will work well. I suspect that @anjackson is aspiring towards something more complicated, but this may be "good enough" to at least show almost exactly duplicate images.

Hey John, I tweaked the 'fuzziness' of the search by trying different proxy image pixel sizes and bit depths. It can be made more fuzzy by for example not regarding image size or edge length relation or using 3x3 or 6x6 pixels ... 4x4 3bpp appeared ideal for the case at hand. I ditched the grayscale version because it didn't help for this task, but found that worked very well with photos.

Indeed, 'good enough' for some cases, almost ideal for my case. So use with caution :)