SQLServerCentral.com / SQL Server 2008 / T-SQL (SS2K8) / Function for similar phrases / Latest PostsInstantForum.NET v2.9.0SQLServerCentral.comhttp://www.sqlservercentral.com/Forums/notifications@sqlservercentral.comTue, 03 Mar 2015 15:44:49 GMT20RE: Function for similar phraseshttp://www.sqlservercentral.com/Forums/Topic1438793-392-1.aspxWhatever clever function you may find/create the phrase [code="sql"]'US Eastern District Court'[/code]will still be matching [code="sql"]'US Western District Court'[/code]better than [code="sql"]'United States Eastern District Court'[/code]It's a lot of general knowledge of yours which allows you to match a row to this row but not to that one.Unless you pass that knowledge to you database (yep, creating tables, storing samples and patterns) you will always program errors.Mon, 08 Apr 2013 23:58:21 GMTSergiyRE: Function for similar phraseshttp://www.sqlservercentral.com/Forums/Topic1438793-392-1.aspx[quote][b]Sean Lange (4/5/2013)[/b][hr]I knew it was limited but I didn't realize it is that useless. I have never really had a reason to need it so have never really tested the limitations very much.[/quote]I'm probably not giving it enough credit for whatever its intended use may have been but I've really been disappointed by it. From BOL...[i][font="Arial Black"]Vowels are ignored in the comparison. Nonalphabetic characters are used to end the comparison.[/font][/i]In other words, a dash or a space or a digit or just about any punctuation will stop the comparison. That means either it does just the first word or first part of a hyphenated word.These are probably bad examples of what I don't like about it, but it's the best I can do on short notice.SELECT DIFFERENCE('Glif', 'Geoffrey');SELECT DIFFERENCE('Glare', 'Geoffrey');Because none of the vowels (including the letter "y" in this case), are considered, these two comparisons are both given a "3 out of 4" as a match. That's usually nowhere near close enough for my purposes.Then take the following example which is just about spot on. SELECT DIFFERENCE('Geoffrey', 'Mr Geoffrey'); That's only given a "2" because only the "Mr" of the second operand is evaluated because any non-alphabetic character, including a space, will stop the comparison.Mon, 08 Apr 2013 18:20:31 GMTJeff ModenRE: Function for similar phraseshttp://www.sqlservercentral.com/Forums/Topic1438793-392-1.aspxCheck out the following post that I started last July: [url]http://www.sqlservercentral.com/Forums/Topic1337370-391-1.aspx[/url][b]Chris@Home[/b] kindly provided a really good token matching algortihm and it's ultra fast as it's implemented as iTVF.SOUNDX is rubbish. Alternative to the Chri's algorithm in the above post . Also Metaphone, Double Metaphone (both free) and Metaphone3 which I purchased for $40 recently and they're all really good especially with US English/British words.Fri, 05 Apr 2013 09:36:09 GMTAbu DinaRE: Function for similar phraseshttp://www.sqlservercentral.com/Forums/Topic1438793-392-1.aspx[quote][b]Jeff Moden (4/5/2013)[/b][hr][quote][b]deepzzzz (4/5/2013)[/b][hr]Thanks Sean for your post..For this example it works fine..Just consider these examples...'US Eastern District Court' 'Southern District Court of US'For this also it is giving the same value. Actually both are different ..Court District Eastern US C630Court District of Southern US C630So how can I avoid this?[/quote]SOUNDEX is actually pretty limited. I don't remember the exact algorithm it uses but I do remember that it was extremely limited. It was something like the first character of the string followed by some sort of very short checksum that didn't even include the whole phrase.If this is for a law office, you'd be much better off by buying a couple of products called (IIRC from one of my old jobs) "Attenex" and "Equiveo". They're built just for this type of thing and make simple full text searches look nearly as limited as SOUNDEX.[/quote]I knew it was limited but I didn't realize it is that useless. I have never really had a reason to need it so have never really tested the limitations very much.Fri, 05 Apr 2013 07:39:39 GMTSean LangeRE: Function for similar phraseshttp://www.sqlservercentral.com/Forums/Topic1438793-392-1.aspx[quote][b]deepzzzz (4/5/2013)[/b][hr]Thanks Sean for your post..For this example it works fine..Just consider these examples...'US Eastern District Court' 'Southern District Court of US'For this also it is giving the same value. Actually both are different ..Court District Eastern US C630Court District of Southern US C630So how can I avoid this?[/quote]SOUNDEX is actually pretty limited. I don't remember the exact algorithm it uses but I do remember that it was extremely limited. It was something like the first character of the string followed by some sort of very short checksum that didn't even include the whole phrase.If this is for a law office, you'd be much better off by buying a couple of products called (IIRC from one of my old jobs) "Attenex" and "Equiveo". They're built just for this type of thing and make simple full text searches look nearly as limited as SOUNDEX.Fri, 05 Apr 2013 07:16:03 GMTJeff ModenRE: Function for similar phraseshttp://www.sqlservercentral.com/Forums/Topic1438793-392-1.aspxThanks Sean for your post..For this example it works fine..Just consider these examples...'US Eastern District Court' 'Southern District Court of US'For this also it is giving the same value. Actually both are different ..Court District Eastern US C630Court District of Southern US C630So how can I avoid this?Fri, 05 Apr 2013 03:19:47 GMTdeepzzzzRE: Function for similar phraseshttp://www.sqlservercentral.com/Forums/Topic1438793-392-1.aspxYour question is almost verbatim the MSDN definition of SOUNDEX. [url=http://msdn.microsoft.com/en-us/library/ms187384.aspx]http://msdn.microsoft.com/en-us/library/ms187384.aspx[/url]Now with the two phrases you listed the SOUNDEX would not be even close because the order of the words is different. We can leverage the age old DelimitedSplit8K here quite nicely. The idea here is order all words in each phrase alphabetically so we have a consistent order of the words for SOUNDEX.[code]if OBJECT_ID('tempdb..#List') is not null drop table #List create table #List( ListID int identity, Phrase varchar(50))insert #Listselect 'US Eastern District Court' union all select 'Eastern District Court of US';with List as( --First we need to split the values on the words select * from #List cross apply dbo.DelimitedSplit8K(Phrase, ' ')), stuffedList as( --Now we sort the words alphabetically select ListID, STUFF((select Item + ' ' from List l2 where l2.ListID = l1.ListID order by Item for XML PATH('')), 1, 0, '') as FullList from List l1 group by ListID)select *, SOUNDEX(FullList) from stuffedList[/code]Please see the link in my signature about splitting strings. In there you will find the code and the logic for how the DelimitedSplit8K function works.Thu, 04 Apr 2013 08:03:45 GMTSean LangeRE: Function for similar phraseshttp://www.sqlservercentral.com/Forums/Topic1438793-392-1.aspxtwo things come to mind - full text indexingand if you have enterprise edition fuzzy grouping in SSIS advanced data flow transformations.Thu, 04 Apr 2013 07:36:20 GMTgeorge sibbaldFunction for similar phraseshttp://www.sqlservercentral.com/Forums/Topic1438793-392-1.aspxHi All, Is there any built in function available for checking the similarity of the phraseseg: US Eastern District Court is similar to Eastern District Court of US. So if I am comparing both i should get result 1.I need this for checking the duplicate entries in a table having so much data, so that we can delete the duplicates.Thanks In Advance...Thu, 04 Apr 2013 07:26:06 GMTdeepzzzz