I have a table in which I'd like a column to have no duplicate data. I'm familiar with the old standard of using HAVING and GROUP BY to identify exact matches, and the actual hard UNIQUE constraint for future data. This is proving to not be good enough though.

What I'd like is to produce a query/algorithm to identify extremely similar data (ie. if only 2-3 characters are off in a 200 character string, I need to kill one of those rows), so that I can scrub up my data a bit better. It's making my brain spin. Any hints?

Levenshtein functions installed and seem to be doing what they should, looks like this will work, thanks again. To anyone that ends up following after, phpMyAdmin doesn't seem to play nice with stored functions. Here's exactly what I did on the command line to make this work right (max string size is adjusted to 400 characters).

This returns a percentage of how similar the two are, and is working perfectly. From the looks of it, I'll probably look to combine those that have a ratio of 80 or higher once I've gathered the data.

Now, is there a way to realistically check through 100k rows using purely SQL and the levenshtein functions? I'd be eternally grateful to anyone smart enough to dodge the slow polling of a very PHP-heavy / multiple queried cron that I'm burdened with otherwise.

ScallioXTX
—
2011-03-01T22:17:53Z —
#6

What you could try is order all string by length and only compare each string to each peers; say, strings that between 5 shorter and 5 longer than the one you're comparing to.The likeliness of two strings looking similar will get smaller as the difference between the lengths of the strings increases.Not sure if it helps since it's highly dependent on the data at hand, but it's worth a try methinks.

qwidjib0
—
2011-03-01T23:12:25Z —
#7

That helps drastically actually, though I'm still looking at 486 days of polling to get through all the data. Algorithm currently goes:

SELECT current article id, artice (1 query)SELECT all other articles having text length within a few characters (1 query)COMPARE LEVENSHTEIN_RATIO, store duplicates if found in the DB (10k-40k queries, down from 100k+ before reducing results using string length)STORE next article ID in the DB to try in another 5 minutes, starting process over (1 query)

I'm at about 20 seconds in running this query when it compares to 10k other rows, potentially in excess of a minute on others. If I could make this efficient enough to run once every minute, the polling would be done in more like a month, which I may be patient enough for.

ScallioXTX
—
2011-03-01T23:19:40Z —
#8

And if you perform all the levenshtein stuff in your programming language instead of querying MySQL for it each time?That should ~theoretically~ be faster

qwidjib0
—
2011-03-01T23:29:18Z —
#9

ScallioXTX said:

And if you perform all the levenshtein stuff in your programming language instead of querying MySQL for it each time?That should ~theoretically~ be faster

Sure, final verdict: PHP levenshtein comparisons are a great solution for this. There are lots of modified versions of levenshtein() on php.net that all seem to work better though. Here's the one I ended up going with:

I've come up with a solution using this function that's polling millions of rows of fulltext against millions of other rows, now due to finish in right around 40 days of a one-minute cron, and running great at this moment. Thanks again!