There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs.
If this question can be reworded to fit the rules in the help center, please edit the question.

8

Why do you think the two strings in example 2 should compare equal?
–
Sinan ÜnürJun 23 '09 at 21:01

Am I missing something here? Did the original poster give any indication that he was concerned with phonetics, rather than the characters, other than the fact that the first example could be seen to imply phonetic similarity? The second example certainly does not.
–
kcrumleyJun 23 '09 at 21:10

I guess that "Similarity" and "Phonetic" are closest, but are different things. The "Similarity" validation needs to use a "Phonetic" algorithm and "Similarity" algorithm to validate correctly a text.
–
ZanoniJun 24 '09 at 11:37

@kcrumley: The second example is only hypothetical. Strings with some similarity for each found word, can be considered similar.
–
ZanoniJun 24 '09 at 11:40

12 Answers
12

I don't think any of those algorithms take sounds into consideration, however - so "staq overflow" would be as similar to "stack overflow" as "staw overflow" despite the first being more similar in terms of pronunciation.

I've just found another page which gives rather more options... in particular, the Soundex algorithm (Wikipedia) may be closer to what you're after.

FYI, if you happen to be working with the data with SQL Server, it has a SOUNDEX() function.
–
Paul DraperApr 2 '13 at 18:45

1

Also, it should be noted that Soundex is an old algorithm intended mostly for English words. If you want a multi-lingual modern algorithm, consider using Metaphone. This article discusses the differences in greater detail: informit.com/articles/article.aspx?p=1848528
–
Seanny123Nov 5 '13 at 1:24

Here is some code I have written for a project I am working on. I need to know the Similarity Ratio of the strings and the Similarity Ratio based on words of the strings.
This last one, I want to know both the Words Similarity Ratio of the smallest string(so if all words exist and match in the larger string the result will be 100%) and the Words Similarity Ratio of the larger string(which I call RealWordsRatio).
I use the Levenshtein algorithm to find the distance. The code is unoptimised, so far, but it works as expected. I hope you find it useful.

Levenshtein distance has also been suggested, and it's a great algorithm for a lot of uses, but phonetic matching is not really what it does; it only seems that way sometimes because phonetically similar words are also usually spelled similarly. I did an analysis of various fuzzy matching algorithms which you might also find useful.

To deal with 'sound alikes' you may want to look into encoding using a phonetic algorithm like Double Metaphone or soundex. I don't know if computing Levenshtein distances on phonetic encoded strings would be beneficial or not, but might be a possibility. Alternately, you could use a heuristic like: convert each word in the string to its encoded form and remove any words that occur in both strings and replace them with a single representation before computing the Levenshtein distance.

Metaphone 3 is the third generation of the Metaphone algorithm. It
increases the accuracy of phonetic encoding from the 89% of Double
Metaphone to 98%, as tested against a database of the most common
English words, and names and non-English words familiar in North
America. This produces an extremely reliable phonetic encoding for
American pronunciations.

Metaphone 3 was designed and developed by Lawrence Philips, who
designed and developed the original Metaphone and Double Metaphone
algorithms.