Solr is being used to search through a database of user-generated listings. These listings are imported into Solr from MySQL via the DataImportHandler.

Problem: Quite often, users report the same listing to the database, sometimes with minor changes to their listing post to avoid being easily detected as a duplicate post.

How should I implement a near-duplication detection with Solr? I do not mind having near-duplicate listings in the Solr index as long as the search results do not contain these near-duplicate listings.

I guess there are 4 possible places to do this near-duplicate detection

In a pinch, you may use solr's "more like this" feature to flag stuff for reviewers, possibly step 3 of your workflow. It returns a coefficient that is 1.0 for a perfect match.
–
aitchnyuOct 7 '12 at 18:10

You may see, that if the result is 1, it means that it's the same sentence OR it uses the same words in a different order.
However, the smaller the value is, the more unique the "sentence" is. This is rather a simple implementation. You may set a limit value for example 0.4. And set the "request" in a queue if it passes this limit. And then take a look manualy at the listing. This is not "efficient". But i gave you the idea, and it's up to you to develop a more complex and automated system/algorithm. And maybe you should also take a look here.

Thank you, this is an eye opener for me. If I want something like the near-duplicate detection of craigslist, do you see running every listing in the table (200,000 rows) through an algorithm like Jaccard's be a serious bottleneck with little processing power, like with only 4 cores on the server?
–
NyxynyxOct 7 '12 at 11:53

Euh well that's the problem, i tried some lorem ipsum text (5 sentences per variable), and tried a for loop 200.000 times, and it took about 20seconds on a quadcore 2.67Ghz (and this is without connecting to the database which makes the process slower)
–
HamZaOct 7 '12 at 12:09

I looked up a little and found Spotsig and LSH (Locality Sensitive Hashing) for near duplicate detection. Cant find any PHP libraries which can implement them. I wonder how Craigslist does it
–
NyxynyxOct 7 '12 at 12:13