I did some testing with 3 documents in a document library, different names ofcourse, but all the same file content. Performed a crawl and used an enterprise search center (keyword query) with a search core results webpart (setting Remove Duplicate Results checked) to perform a search.
I expected just one result while searching on one of the words which all files contain, but I received all 3 documents as result.

I just wanted to test when an item is treated as a duplicate, what percentage of content has to be different or something. Appearently I don't understand the process, can somebody explain this to me?

Ok, my VM died on me and I tested it on my new VM and the documents are displayed as duplicates. That part is solved then. But I'm still curious what percentage of content has to be different before it isn't a duplicate anymore.
–
Anita BoerboomJun 26 '11 at 16:53

I ran into a strange and related issue today. For testing purposes I copied about 130 copies of a document with unique names, but the same content. I then wrote a powershell script that uploads the document to a site collection and applies unique content type and field attributes to it. I loaded two libraries in separate site collections for a total of over 250 documents. After reindexing/resetting my index a couple of times, I finalize realized that search only recognized this as a single document so there is a single result returned.
–
Mike OryszakSep 10 '11 at 20:01

3 Answers
3

I will put my little research as answer (although it is not a real answer).

I followed link provided by @AnitaBoerboom and I believe it applies to 2010 version as well.

Quote:

How does the duplicate document is identified when we do a search?

Document similarity for purposes of identifying duplicates is based
only on a hash of the content of the document. No File properties
(e.g. file name, type, author, create and modify dates) are input to
this hash. The MSSDuplicateHashes table in the SSP’s search database
holds, for each document, all the 64bit hashes necessary to determine
if one document is a near-duplicate of another. This is read while
doing a search if duplicate collapsing is enabled.

This is probably answer to @MikeOryszak strange issue. He uploaded 250 documents with same content - so this is just 1 document and 249 duplicates.

64bit hash is something that puzzles me. This hash is determined when document is crawled by preforming Full crawl.

Per the blog post for 2007, it uses a hash of the entire document and ignores the actual document properties. I did some tests today outlined in my comment above which showed all of the documents as duplicates, even when across site collections. Any documents that were changed afterwards showed as unique documents after another crawl.