However, you wouldn’t guess that from my latest experiment. I didn’t do any reading or studying. I didn’t try to stand on the shoulders of giants. I just plowed in and tried something. Unsurprisingly, I had mixed results.

Let me build up to the problem…

An Easier Problem: Using hashes to detect duplicates

Suppose you have a large collection of files, and you suspect there are duplicates. How do you find them?

The standard solution would be to create a hash of each file and add references to a hash table, looking for duplicates as you go.

What hash would you use for random files?

The first thing to remember is you need a better hash than you might think because of the Birthday Paradox – there are a lot of pairwise comparisons here – O(n2) – and there are likely to be more clashes than you might first expect.

SHA (or some other cryptographic hash) would probably be the gold-standard, but it is computationally expensive and unnecessarily secure against forgery.

CRC is cheaper, but is intended for checking that files aren’t corrupted. The Python CRC library explicitly warns against using it for hashes. I haven’t completely understood why.

I wonder if XOR could be used as a cheap hash. Divide the file into, say, 64 bit chunks, and XOR them together. It would be hopeless at doing MD5’s job of being unforgeable, and it would be hopeless at doing CRC’s job of detecting missing packets, but it sounds like a cheap and reasonable hash for unknown data. I haven’t investigated further than that.

Using customised hashes to detect approximate matches

So, the next trick is to use special hashes (based on knowledge of the type of data) to detect approximate matches.

I last did this in anger about two years ago, when I used the Soundex hash to look over our customer database. Soundex is a hash that attempts to make like-sounding English strings end up with the same hash value, so you can fix misspellings and the like. I successfully found several cases where customers appeared twice in our database with slight variations to the spellings of their names.

The Real Problem: Detecting Duplicate Photos

So, I have lots of photos, and some are duplicated on several web-sites, with no reference to the original source. I want to detect matches, so I can move them (and their associated meta-data) all to a single yet-to-be-determined destination.

But here’s the snag. Some of the photos have been made into different sizes and qualities for web-viewing. It would save me time if I could automatically detect this. I need an equivalent to SoundEx that works on images, so images have the same hash even if they are resized.

How would you do that? If you answered “Google it, and find what the standard solution is”, you might be smarter than me. This article is about what I did instead.

Remember what the cost is of making a mistake.

If the algorithm makes a false positive (i.e. declares two photos are the same when they are not), I will notice when I sift through the results and I will correct it. Total cost: 5 seconds effort.

If it makes a false negative (i.e. fails to realise two images are the same), it won’t be as easy to find. I will end up with a duplicate photo in the final database, or I will fail to merge meta-data. Total cost: practically nothing; slightly lower quality photo site.

So, with the stakes so low, I just waded in with my eyes shut and I created my own hash function.

Hash Function, Version 1

My first version of the hash function had two parts.

The first part was the height:width ratio. I assumed that any two photos whose height:width ratio differed by more than 2% couldn’t be rescalings of the same photo.

The second part was a 5×5 array of tuples. To produce the array, I first divided each photo into 5×5 equal-sized rectangles, and summed all the Red, Blue and Green values for each pixel in each rectangle.

PIL has a histogram function, which meant the bulk of the work was written in optimised C, so the performance was fine.

In the 5×5 array, I stored the Red:Blue ratio and Green:Blue ratio. I figure that was better storing colour ratios rather than simple brightness, because I feared that there might sometimes be some adjustments to contrast and brightness. I hoped (without evidence) that such adjustments wouldn’t affect the colour ratios. I assumed that any two photos whose colour ratios were within 5% in all 25 boxes were probably the same photo.

If you are noticing a lot of wishful thinking here, you aren’t alone.

I wrote my own “equivalence” method, so two image hashes could be compared. If the hashes revealed compatible height:width ratios and compatible red:blue and green:blue ratios, then the hashes matched.

I tested the result with three images. The first was a photo of a person. The second was a photo of the same person on the same day at a different time. The third was a thumbnail version of the first image.

It correctly detected that Images 1 and 2 were different and Images 1 and 3 were the same!

Not a very thorough unit test, but it passed! Yay!

However, there was a fatal flaw with this hash function. Two equivalent hashes wouldn’t have bitwise equality. These hashes could not be used in a standard hash-table, and O(n2) hash comparisons would be required. With thousands of photos, this would be too slow. Time for a rethink.

Hash Function Version 2

The second version of the hash function was built on top of the first one.

The height:width ratio was discarded.

Each of the 5×5 colour-ratios tuples was numbered 0 to 24, and numbers were sorted based on the size of each of the corresponding ratios. (I made sure the Python sort algorithm was stable.)

The hash was now two tuples (representing the red:blue and green:blue ratios) with 25 elements – each element was in the range 0 to 24.

So, if box 3 had the highest red:blue ratio, and box 5 had the second-highest red:blue ratio, the first of the tuples would be (3, 5, … ).

Those 50 integers characterised the relative colouring of the sections of image (or so I hoped!), and were ready for bitwise comparison to other hashes.

I ran my trivial unit tests again. They failed!

Within Image 1, some boxes had colour ratios very close to other boxes. Within Image 3, during the resizing, the relative ordering wasn’t preserved – a slight increase in one ratio made it larger than another. Of the 50 integers, 48 matched, but two adjacent ones were swapped. A false negative was the result.

I pondered how to fix this; it was tricky, so I started with an initial, quick hack. I dropped the number of boxes per image from 5×5 to 4×4. A smaller number of boxes, with more samples in each box, reduced (but didn’t eliminate!) the chance of the order being changed during rescaling. It should reduce false negatives, at the cost of more false positives.

I re-ran the unit-test and it passed! Woohoo!

Time for more unit-tests? Nah! Time for the system test. I used the Image Hash to run the basic duplicates search algorithm over hundreds of photos.

I did get some false positives, but many fewer than I expected. All of the false positives were black-and-white photos. Black-and-white pictures should have the exact same colour ratio in each of squares, making the hash meaningless.

I also had false negatives, although that was harder to immediately detect, without manually wading through the photos by hand. I detected them by further reducing the hash to a 3×3 array, and that revealed a lot of the images that had been missed (false negatives) by the 4×4 version.

Conclusion

I tried to create an effective hash that detected photos that were approximately the same apart from rescaling. I hoped it would also be agnostic to simple brightness and contrast adjustments.

The result was completely ineffective for black-and-white photos, but apart from that had surprisingly few false positives. However, it did suffer from a moderate number of false negatives.

I suppose I should go read a book on Image Processing and find out how the pros do it. How tedious, compared to reinventing the wheel…

Comments

That’s pretty impressive. Well, it’s impressive to me. Maybe my intuition is far off, the way most people’s intuition is off regarding the Birthday Paradox. But your algorithm seems very successful for how simple and accessible it is. I did a very brief bit of Googling and found people talking about wavelets and bright spots and such. Much more complicated-sounding with, y’know, Real Mathtm required.

Of course, their ideas or requirements about what constitutes “duplicates” may be different than yours. And I think they handle B&W. (Some of them seem geared toward sifting through newspaper photos, for example.)

I do have some questions about your method. When you discovered that a 3×3 grid helps, does that mean it replaces the 4×4, or that it’s a second pass, after the 4×4 has done its work?

Also, were you trying to match photos with slight cropping, and perhaps rotation? (I’m guessing not.) How about photos which are different but similar, like multiple posed group photos taken seconds apart, with different people blinking or smiling? (I’m also guessing not; that if you had any of these, you’d probably already picked one photo to publish from each such set, and only that one “best” one has been subjected to the resizing.)

I wonder if you can do something equally simple to handle your B&W photos, like using brightness of the grid boxes. Or maybe finding the average brightness of the whole photo, and then using “bright” instead of red:blue and “dark” instead of green:blue. Or something.

While I am not serious enough about organizing my own photos to actually implement any kind of duplicate-finding algorithm myself (yet), I suspect what would be optimal for me is a very different and vastly less visual approach. For me, I am pretty sure the most effective routines would make use of the following facts: (1) I don’t rename my image files to descriptive names. I leave them as DSC_nnnn and add little suffixes to them. (2) I let edited versions inherit EXIF information.

Hmm. I was careful this time to make sure my comments weren’t rendered obsolete by other comments that were “quicker to the buzzer,” but instead a part of my commentary is useless based on the original post. If I could delete or edit my own comments, I would get rid of the part about filenames and EXIF data, because you did mention that the images you’ve already posted did not have any available references to the originals.

I’m not an expert at this sort of thing, but I used to do some work in the image processing field. My sense is that the pros would use Wavelets in some appropriate colorspace (HSL maybe). I recall a paper by Jacobs and Salesin from SIGGRAPH in the 90s.

[As is so often the case with OddThinking posts, I can’t help wondering whether Julian just wants to get the job done, or whether he’s more interested in the thrill of the chase. Assuming the latter for now…]

OK so here’s my idea. I am not in any way a graphics expert, so try not to laugh too much.

Basically you want to re-compress each image at a known resolution (meaning pixel dimensions) and quantization. So pick an appropriate width and height, such as that of your largest image. Also pick a quantization matrix – this time probably you’ll need to use the most aggressively compressed image.

Now re-compress all your images. You *could* probably use the hash functions you mentioned above (particularly if you use the YCbCr domain instead of RGB) but I think there’s a better way, provided you are willing to get your hands dirty with JPEG internals.

Basically at the same resolution and quantization you should be able to use the DC coefficient as a rough proxy for that macroblock. So: just add the DC coefficients for all the macroblocks to produce a hash value for the channel/image as a whole.

In fact you don’t need to re-compress the entire image to get this value – just enough to determine the DC coefficient for each macroblock. This has the benefit of being a fair bit quicker and also requiring only a single quantization value and not an entire matrix. The maths up to this point seems fairly easy actually.

This gives you three hash values, one for each channel, and hence allows you to detect certain types of other transforms. As you say, you can just compare the Cb and Cr hashes to detect a change in brightness. Alternatively you could just look at the Y hashes to detect an image that has been converted to B&W.

In the “just get the job done” category of solutions: have you thought about doing a hash of (key fields in) the EXIF data? Or are you concerned about removal of such data as well?

A simple histogram suffers from two problems: it is likely to be distorted by simple operations to adjust the brightness, and it doesn’t allow meaningful bit-wise comparisons – you need to check for “approximately the same” rather than “exactly the same”.

In this case, I was definitely focussed on the thrill of the hunt. However, only while it meant a pleasant horse-ride across the moors. Once the fox hid amongst the thorny brambles of wavelet theory and JPEG compression (neither of which I pretend to understand in the slightest) I got bored.

The correct thing to do now, is to head off to the open-source butcher and buy some prepared code.

No, actually, the correct thing to do is to stop torturing these poor, innocent analogies.

As for EXIF data, I don’t know if Photoshop, Facebook, PIL and PHP’s graphics library all preserve them correctly during transformations. I don’t know if VueScan populates them correctly in the first place. Maybe they all do it perfectly, but I don’t have much trust in them.