Introduction

This article describes a way of capturing the similarity between two strings (or words). String similarity is a confidence score that reflects the relation between the meanings of two strings, which usually consists of multiple words or acronyms. Currently, in this approach I am more concerned on the measurement which reflects the relation between the patterns of the two strings, rather than the meaning of the words.

I implemented this algorithm when I was developing a tool to make the matching between XML schemas semi-automatic.

Preparing the ground

In this paper, I have implemented two algorithms:

Levenshtein algorithm[1]

The Kuhn-Munkres algorithm (also known as the Hungarian method)[2]

Without going "deep into theory", if you want to understand these algorithms, please read about them in the algorithm books. Other information about them can be reached at:

Problem

The string similarity algorithm was developed to satisfy the following requirements:

A true reflection of lexical similarity - strings with small differences should be recognized as being similar. In particular, a significant sub-string overlap should point to a high level of similarity between the strings.

A robustness to changes of word order- two strings which contain the same words, but in a different order, should be recognized as being similar. On the other hand, if one string is just a random anagram of the characters contained in the other, then it should (usually) be recognized as dissimilar.

Language independence - the algorithm should work not only in English, but also in many different languages.

Solution

The similarity is calculated in three steps:

Partition each string into a list of tokens.

Computing the similarity between tokens by using a string edit-distance algorithm (extension feature: semantic similarity measurement using the WordNet library).

Computing the similarity between two token lists.

Tokenization

A string is a list of words or abbreviations, it may be composed to follow the Camel or Pascal casing without separator characters. Take an example: ‘fileName’ is being tokenized as “file” and “Name”. This work is done by the tokeniser.cs class.

Compute the similarity between two words

The first method uses an edit-distance string matching algorithm: Levenshtein. The string edit distance is the total cost of transforming one string into another using a set of edit rules, each of which has an associated cost. Levenshtein distance is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of (single-phone) insertion, deletion and substitution. In the simplest version substitutions cost about two units except when the source and target are identical, in which case the cost is zero. Insertions and deletions costs half that of substitutions.

Similarity between two token lists

After splitting each string into token lists, we capture the similarity between two strings by computing the similarity of those two token lists, which is reduced to the bipartite graph matching problem. A related classical problem on matching in bipartite graphs is the assignment problem, which is the quest to find the optimal assignment of workers to jobs that maximizes the sum of ratings, given all non-negative ratings Cost[i,j] of each worker i to each job j.

The problem can now be described as follows

Given a graph G(V,E), G can be partitioned into two sets of disjoint nodes X(left) and Y (right) such that every edge connects a node in X with a node in Y, and each edge has a non-negative weight.

X is the set of the first list of tokens.

Y is the set of the second list of tokens.

E is a set of edges connecting between each couple of vertex (X ,Y), the weight of each edge which connects an x1 to a y1 is computed by the similarity of x1 token and y1 token (using the GetSimilarity function).

The task is to find a subset of node-disjoint edges that has the maximum total weight. The similarity of two strings is computed by the number of matched strings.

Initialize the weight of the edges

The results of GetSimilarity function are used to compute the weight of edges.

w(x,y) = GetSimilarity(token[x], token[y])

Connecting the edges to maximize the total weight

We use the Hungarian method to solve the maximum total weight of bipartite matching problem. The class used to implement this algorithm is MatchsMaker.cs.

Future improvements

Computing semantic similarity between words using the WordNet

The above section allows us to get the similarity score between patterns of strings. However, sometimes we need a semantic measurement. This problem leads us to find a semantic similarity. Now, I am experimenting with using WordNet[1] to compute the similarity between words inside the English dictionary.

The WordNet

WordNet is a lexical database which is available online and provides a large repository of English lexical items. The WordNet was designed to establish connections between four types of POS (Parts of Speech), noun, verb, adjective, and adverb. The smallest unit in WordNet is synset, which represents a specific meaning of a word. It includes the word, its explanation, and the synonyms of its meaning. The specific meaning of one word under one type of POS is called a sense. Each sense of a word is in a different synset. Synsets are equivalent to senses = structures containing sets of terms with synonymous meanings. Each synset has a gloss that defines the concept it represents. For example the words night, nighttime and dark constitute a single synset that has the following gloss: the time after sunset and before sunrise while it is dark outside. Synsets are connected to one another through explicit semantic relations. Some of these relations (hypernymy, hyponymy for nouns and hypernymy and troponymy for verbs) constitute kind-of and part-of (holonymy and meronymy for nouns) hierarchies. For example, tree is a kind of plant, tree is a hyponym of plant and plant is hypernym of tree. Analogously trunk is a part of tree and we have that trunk is meronym of tree and tree is holonym of trunk. For one word and one type of POS, if there are more than one sense, WordNet organizes them in the order of the most frequently used to the least frequently used.

Base on WordNet and its .NET API (that is provided by Troy Simpson); I use synonym and hypernym relations to capture the semantic similarities of tokens. Given a pair of words, once a path that connects the two words is found, I determine their similarity based on two factors: the length of the path and the order of the sense involved in this path.

To find a path connection between words we use the breadth-first search to check if they are synonyms. Searching the connection between two words in WordNet is an expensive operation due to the large searching space. We define two restrictions in order to reduce the computational time. The first one is that only synonym relations are considered (hyponym and hypernym will be considered later), since exhausting all the relations is too costly. This restriction is also adopted in some related works. Another restriction is to limit the length of the searching path. If the searcher cannot find a path that has been connected within a limited length, we stop searching and give the result as no path found. Hence, this case should have a substitution, the edit-distance instead.

The similarity score

Due to the synonym consideration, first we use the following formula to calculate the semantic similarity of words score:

WordSim(s,t)=SenWeight(s)*SenWeight(t) / PathLength

Where s and t: denote the source and target words being compared.

SenseWeight: denotes a weight calculated according to the order of this sense and the count of total senses.

PathLength: denotes the length of the connection path from s to t.

(This work is an experiment and is not available in this version, it will soon be released.)

Using the code

Add the similarity project to your workspace. Create a method like the following:

References

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Share

About the Author

I'm still alive...but temporarily moved to work on mobile & web stuffs(j2me/brew/php/flash...something not M$). things have just been very busy, and probably will continue...so don't have chance to maintain & respond. Hope will have time to try to write again, because many ideas with WPF &silver light are waiting. wish me luck

Comments and Discussions

Dear Thanh Dao,
Hope you are doing fine. I was told to add the question here in comments for a reply.
I had gone through your details provided in the below link, which is similar to what I am looking for.
<ahref="http://www.codeproject.com/Articles/11157/An-improvement-on-capturing-similarity-between-str">An improvement on capturing similarity between strings</a>[<ahref="http://www.codeproject.com/Articles/11157/An-improvement-on-capturing-similarity-between-str"target="_blank"title="New Window">^</a>]
My solution uses levenshtein and the accuracy % set and compares against the available set of names which is around 800,000 names, and it is growing. The solution takes around 2 minutes to compare a single name against the 800,000 names which I would like to improve. For improving it, I was looking at different sites when I came across your site. Would like to know if there is any assistance that you could provide us. My developer is not much experienced and has already taken a lot of time. I would appreciate if you could let us know how you could assist us in developing this logic.
Best Regards
Jackson

First of all, I am really excited about this functionality, it is much better than my own attempts to do the same thing. However I am experiancing a strange bug. I use Visual Studio 2003, and the code works flawlessly when my build configuration is 'Debug', but when i recompile in 'Release' mode, the similarity scores are completely wrong, even though they were right before. Does anyone have any idea why this could be happening.
Thanks
-Ty

-- modified at 17:17 Tuesday 23rd May, 2006

Here are some more details.
I have isolated the build options that affect how result of the similarity calculation. In order for it to work correctly, my project needs to have the Optimize Code option set to false, and the Generate Debugging Information option set to true.

Strange Solution?
I got it working finally.. though I don't quite understand how.
I achieved success when i commented out the following lines of code in the GetTotal() method of the BipartiteMatcher class.

These lines don's seem to do anything but set the value of a variable that isn't used anywhere... yet somehow it was changing the outcome of the calculation ... but only when compiled in release mode. I am happy to have this working but no less confused as to the cause.

am unable to run the code that is avaiable for download with your article. It's not that there is a problem in the code. It's just that I do not know how to go about it. I am using Visual studio.net 2003. Please get me started. Thanks a lot.

Dear Sir,
This article("http://www.thecodeproject.com/csharp/improvestringsimilarity.asp") ,is an interesting concept and idea. Was a good read and thoroughly enjoyed going through it. Although i have to admit didn't understand most of the coding in the program, being an englishs' student at uni myself , but managed to get in running in the end somehow by my brothers help.

I need to write an article in the next two weeks, and think this would be an interesting topic for someone not from the computing field and a good discussion point, also because this a working model, so its not just some proposal but it has actually been implemented.

I would like to request you, if you would be grateful enough and if its possible for you, that could send me a sort of brief step by step explanation of the stages that this program works in/ processes data and then reaches to the end result between 0-1, on purely conceptual basis(simple english) in context to this program.
I would indeed be quite grateful to you.
My mail id: paulajone@googlemail.com
Thanking you.
- Paula.

I glad that you enjoyed it. As you can read in the article, here are three major steps:
- Split up a sentence into word/token. (assuming that in most natural language, a sentence is a sequence of words and delimiters.)
- Compute the similarity of two words. (you know what does the similarity mean... )
- Compute the similarity of two bag of words.

firstly I want to explain little bit about string edit distance problem.
Let give a simple game: given two words. and you have three operators to do with a word : replace, insert, delete. If you can find out the least number of steps to transform the first word to the second one, you win. So how will we go about finding the best solution ? Mr, Levenstein has done it.

private int ComputeDistance (string s, string t)

{

int n=s.Length;

int m=t.Length;

Step 0: Create an NxM matrix called the distance matrix in which each element [i,j] represents the cheapest cost (least number of edit step) of transforming the sequence of first i character (in the s string) to another one sequence of first j character (in the t string).

int[,] distance=new int[n + 1, m + 1]; // matrix

int cost=0;

if(n == 0) return m;

if(m == 0) return n;

Step 1: Initialize the distance matrix [i, 0] :For each row of the distance matrix, calculate value for the element distance[i, 0] = i. That means the cheapest cost to transform a string of length "i" to a empty string is the total steps of using the delete operation.

for(int i=0; i <= n; distance[i, 0]=i++);

Step 2: Initialize the distance matrix [0, j] :For each row of the distance matrix, calculate value for the element distance[ 0, j] = j. That means the cheapest cost to transform a string of length "j" to a empty string is the total steps of using the delete operation.

for(int j=0; j <= m; distance[0, j]=j++);

Step 3: Finding the solution for the problem of size [i,j] . The idea behind this code is bottom-up dynamic programming technique. For each element [i,j] we compute the solution for problem of [i,j] bases on the previous smaller problem. We trarvese the matrix programmatically from smaller to bigger. Repeat step 4 until reach the problem size NxM

distance[i,j]=Min3(distance[i - 1, j] + 1, // if we base the problem [i-1, j] we need to insert one character. for example : from ABC to ABCD we need to append "D" to the first string.
distance[i, j - 1] + 1, // if we base the problem [i, j - 1] we need to delete one character. for example : from ABCD to ABC we need to delete "D".

distance[i - 1, j - 1] + cost);//if we base the problem [i - 1, j - 1] , if the S_characterAt(i-1) = S_characterAt(j-1) we have nothing to do, otherwise if they are different then we need to replace. for example ABCD to ABCE we need to replace "D" to "E".

}

}

return distance[n, m];

}

The smaller number of distance [n, m] the more similar two strings are.
There are several ways to compute the similarity value, may use some co-efficient. In example : Sim = (maxLen - distance[n, m]) / maxlen.

The idea of dynamic programming is quite simple. If the perfect solution for the smaller (previous) problem does not affect to the solution for the bigger problem (next) we can compute the prefect solution of the bigger problem based on solution of the previous smaller problems. Let you imagine the representation of problems on a hierarchical tree. A parent node is bigger problem, and a smaller is child node. To compute the solution for the parent node you do not need to look at its descendant just need to care its children.

Let me know if I am misunderstand your request and bear with me. I will be back to say about maximize bipartite matching algorithm. (I am currently sinking into research paper ).

I am unable to run the code that is avaiable for download with your article. It's not that there is a problem in the code. It's just that I do not know how to go about it. I am using Visual studio.net 2003. Please get me started. Thanks a lot.