The ZenCodeo Case – Fuzzy String Solution

A simple solution to the problem of company name deduplication. No machine learning, just data prep.

3

votes

Data Understanding

The telephones and addresses don’t matter for the solution described here, because it only looks at the client names and tries to find patterns there. To explore the client names, just import the csv files in excel and delete the other columns. Remove duplicates, sort and you’ll see that a lot of the names group nicely together. There’s no need to use anything fancy to achieve that, but now we have to make the computer comprehend that some of the strings look really similar.

To do that, we can use a python library called difflib. It allows us to do approximate string matching. The solution described here involves no machine learning, just a data prep script which tries to cluster client names with similar spelling.

Data Preparation

Read the unique client name strings from the 3 source csv files.

Extract the first tokens of these strings (everything up until the first space). Nearly all of the client names start the same, the variation appears to be mainly after the first word. If we want to find similarities efficiently, we shouldn’t complicate the job of the fuzzy string matching function, by passing the full strings to it.

Do approximate string comparisons between all of the tokens. If it turns out that some tokens are really close, assign the same number to them.

Tokens which weren’t matched to anything else but themselves are assigned to cluster zero. If manual curation of the results has to be done, cluster zero is where it should start, because that’s where all the strings which didn’t fit anywhere else are.

The result is saved to a tab-delimited csv file. It has two columns, the client names and the cluster numbers assigned to them by the script.

The difflib comparison function uses a number between 0 and 1 to indicate the strength of a match.
The documentation of difflib says ratios above 0.6 indicate that the pair of strings are close. Ratios above 0.7 were used in this solution and this is the only parameter, which is intended to be tuned.

The cluster csv file can be used as a mapping table to assign unique IDs to the clients. It can be done as part of the solution, but keeping the cluster file separate has one benefit. If uploaded to a database, it can easily be joined with the client tables via the name column. The strings have been lowercased, but besides that they are the same as the originals, misspellings and all. This way there can be a buffer between the clusterization process and the operational business environment.

Possible improvements

Looking at the results, it’s obvious that there are a lot of people names mixed in with the company names. The only reliable way to identify and remove these would be to lookup each string in a list, which enumerates human names. This step isn’t part of the solution shown here, but it can be added if necessary.

Another way to improve the output would be to tokenize the words in a more sophisticated manner, instead of just splitting by space. It’s simple and efficient, but it leads to results like this: