Can anyone suggests a good photo duplication detection utility that works well when I am dealing with about a 100gb of data (collected over the years)?

I would prefer something that works on Ubuntu.

Thanks in advance!

Edit: Is there a tool that will help me reorganize my collection and remove duplicates, once they have been detected?

Edit2: The hard part is figuring out what to do once I have the output consisting of thousands of duplicate files (such as the output of fdupes).

Its not obvious if I can still safely delete a directory (i.e. if a directory might contain unique files), which directories are subsets of other directories and so on. An ideal tool for this problem should be able to determine file duplication an then provide a powerful means of restructuring your files and folders. Doing a merge by hardlinking (as fslint does) does indeed free up diskspace but it does not solve the underlying problem which gave rise to the duplication to start with -- i.e. bad file/dir organization.

6 Answers
6

ImageMagick to the rescue. I think the first step to any solution is to reduce the size of your collection. If you want to compare the photos by its content, especially when some are slightly modified versions of one another, a very good start is to reduce them to thumbnails and then compare the thumbnails. This is particular helpful when you want to find almost-alike photos and want to "ignore" unimportant differences during comparison.

My suggestion is, at a high level, that you:
1- Use ImageMagick's mogrify tool to reduce the photos to thumbnails. This will take some time but it will make the actual comparison steps much much faster and more accurate.
2- Use ImageMagick's compare tool that allows you to set a threshold for comparison, i.e. it allow you find photos that are 85% alike. You would want to do a controlled experiment to find out the threshold value that you like most.

I really like this idea of making thumbnails first. What does it do once you have found the duplicates? Does it just display a list? I have 10s of thousands of duplicates and a nice GUI to help resolve these would be very useful.
–
FasterzAug 10 '12 at 16:27

2

Since you use Ubuntu, you automatically have access to a host of specialized tools, each solving a very specific task such as the 2 tasks I mentioned. It's a Lego game, you can do whatever you want, you just need to put the pieces together. Technically, you feeds 2 photos to the 'compare' tool and it will tell you how much one resembles the other. One way to solve your problem is to group all similar photos into folders so you can go thru them to filter out false positives. Then you run 'compare' again on the false positives and repeat the process until all are in their correct places.
–
codyAug 11 '12 at 1:06

The open source photo viewer / organizer Geeqie has a powerful Find Duplicates Feature. It can use several different strategies for finding duplicates:

File name (case sensitive or insensitive)

File size

File date

Image dimensions

MD5 checksum.

Similar image content (to several thresholds)

This gives a results list which can include thumbnails so you can confirm manually.

This will probably be slow for thousands of files, but I think just using it and letting it run for a few days or whatever is probably less effort overall than finding or making something tailored for the case — unless checksum match is all you need.

That sounds nice. What does it do once you have found the duplicates? Does it just display a list? I have 10s of thousands of duplicates and a nice GUI to help resolve these would be very useful.
–
FasterzAug 10 '12 at 16:26

I just tried fslint on a smaller set of pictures (a few gig or so) and its frustrating that it just sits there and spins. No progress indicator, estimate of time left, nothing.
–
FasterzAug 8 '12 at 21:39

1

These tools appear to look for identical files. Even an identical (pixel for pixel) image can be different file contents. I'm guessing you want to match up not only the same look-alike image, but also do so in different formats, and sizes, including crops and other processing you have done, such as to collect all variations of the same photo in one directory. This would be a soft comparison of images that would have a confidence match factor, and could match up different photos of the same scene.
–
SkaperenAug 9 '12 at 3:12

@Skaperen What you suggest is great, but do such tools exist for Ubuntu? I have seen one mentioned somewhere for Windows -- but that seemed to have a hideous interface.. etc.
–
FasterzAug 9 '12 at 16:34

ImageDupeless is a windows app that will catch photos that look alike, but have some differences. It will catch some rotations, crops, resizes, color tint changes, watermarks, etc... you have to scan your library and tell it how much difference you accept, and it will merrily show you the files. BUT it would be extraordinarily cumbersome for hundreds of files, and thousands of files would be terrible. I too am looking for a linux equivalent to ImageDupeless. An app that does wavelets or some other imaging magic to tell when images are similar.
–
TherealstubotAug 9 '12 at 23:37

There's a few versions of dupeGuru (standard, music & picture editions), and the picture edition allows you to find visually similar images via a bitmap blocking comparison algorithm, among other methods (like EXIF original image timestamp, or the files being simply identical).

It has a variety of other useful features like excluded folders, support for iPhoto/Aperture libraries, and considerable customisation of how it detects duplicates and what it does with them.

What do you mean by duplicate photos? Do you mean files that are identical, say just copied an extra time or two? or do you mean photos that "look" to be the same.

If you mean identical files, you can use 'shasum' on all of the files, then order the results and find the unique lines with 'uniq' and run a 'diff' to see what has been eliminated. All easy in a Ubuntu shell.

None of this is easy or convenient. fdupes mentioned below will already do a better job than merely calculating SHA. Now are there unix tools that will look for image similarity? If so, that would be awesome.
–
FasterzAug 9 '12 at 16:31

Easy and convenient for someone used to using the unix tools, which is what uniq, sort, diff, shasum, etc. are. But I agree that if you don't use them regularly, they can be hard to use. I don't know of anything that can do "looks like" Everything I've seen, including in Aperture and Lightroom, do file-is-identical, which is really just a md5 or shasum
–
Pat FarrellAug 9 '12 at 22:33

I regularly use unix tools and I find this answer somewhat silly. First, doing SHA blindly is slow, when a file size comparizon resolves things. Second, SHA or MD5 can collide -- so SHA comparisons alone arent enough. If you factor in both these, then you get to what fdupes does.
–
FasterzAug 10 '12 at 16:18

Also, once you have correctly conjured the incantation that does this, the output is still not very useful. At best you get the output of fdupes which is a just a dump of similar files. In my case I have 10s of thousands and it is very hard to pick through that data to see how I can eliminate the duplicates.
–
FasterzAug 10 '12 at 16:22

1

SHAs do collide in theory, but not in practice. Yes, it takes forever. Nothing that is going to work is going to be fast. But you should be able to kick it off and come back in a day or two. Its just a suggestion, I'm not going to get into a war over it.
–
Pat FarrellAug 11 '12 at 0:00

What does it do once you have found the duplicates? Does it just display a list? I have 10s of thousands of duplicates and a nice GUI to help resolve these would be very useful.
–
FasterzAug 10 '12 at 16:25