Fuzzy Matching Logic

Fuzzy matching is one of Automated Auditor’s core strengths. Fuzzy matching is the ability to join phrases that either look or sound alike but are not spelled the same. For example, “Elizabeth Banks” and “Banks, Liz E.” are close enough to the human eye and ear that they should be counted as similar. How is fuzzy matching performed, and why is it important?

Benefits of Fuzzy Matching

Fuzzy matching is the art and science of linking disparate words and phrases with one another. The benefits of utilizing fuzzy matching are too numerous to list, but here are the Top 3 Reasons to harness the power of Fuzzy Matching:

1-Increased ability to make linkages:It is widely believed that a misspelling of deceased Boston Bomber Tamerlan Tsarnaev’s name thwarted FBI’s efforts to track Tsarnaev’s 2011 trip to Russia. Could fuzzy matching logic have helped flag Tsarnaev’s trip in intelligence files?

2-Increased ability to merge disparate files: In a recent accounts receivables project, our analysts had to merge several disparate accounting files that lacked a common identifier to link upon. Our data mining experts utilized fuzzy-matching to link the files together by customer’s names. We were able to completely reconstruct accounts receivables year-end reports using fuzzy-matching as the key.

3-Increased ability to aggregate accurately: Aggregating by any kind of name is problematic when there are variations of the spelling of that name. For example, suppose you have several different Hewlett Packard vendors, with the following spellings:

Before Fuzzy Matching

After Fuzzy Matching

HP

Hewlett Packard

Hewlett-Packard

Hewlett Packard

Hewlett-Packerd

Hewlett Packard

H.P. Inc

Hewlett Packard

Hewlet Packard

Hewlett Packard

HP Corp

Hewlett Packard

/* Here you can add custom CSS for the current table */ /* Lean more about CSS: https://en.wikipedia.org/wiki/Cascading_Style_Sheets */ /* To prevent the use of styles to other tables use "#supsystic-table-1" as a base selector for example: #supsystic-table-1 { ... } #supsystic-table-1 tbody { ... } #supsystic-table-1 tbody tr { ... } */

Using fuzzy matching, we consolidate these names into one standardized name so that an accurate aggregation can be made for this corporation.

How Do We Perform Fuzzy Matching?

There are several ways to perform fuzzy matching. We describe several methods that Automated Auditors commonly uses to tie disparate data together.

1) Levenshtein Distance – The Levenshtein Distance measures the “distance” between two phrases by counting the number of insertions, deletions, and substitutions it takes to make one phrase look like the other phrase. For example, see the Levensthein distances calculated here between two phrases. We are using the name of deceased Boston Bomber Tsarnaev to show that misspellings can be caught, disparate data bases can be joined, resulting in increased data completeness and accuracy. Could fuzzy matching have saved lives?

Name 1

Name 2

Levenshtein Score

What the score represents

tsarnaev

tsarnaev

0

indicates an exact match

tsarnaev

sarnaev

1

indicates 1 deletion

tsarnaev

tsarnayev

1

indicates 1 insertion

tsarnaev

tamerlan tsarnaev

7*

indicates 7 insertions

*Notice how the function fails at comparing phrases but is good for comparing words

/* Here you can add custom CSS for the current table */ /* Lean more about CSS: https://en.wikipedia.org/wiki/Cascading_Style_Sheets */ /* To prevent the use of styles to other tables use "#supsystic-table-6" as a base selector for example: #supsystic-table-6 { ... } #supsystic-table-6 tbody { ... } #supsystic-table-6 tbody tr { ... } */

2) TriGram Function – The Trigram function, developed by Automated Auditors, returns the number of trigrams that two phrases have in common. A trigram is a consecutive 3-letter substring of a phrase. For example, the word “AUDITOR” has 5 trigrams:

AUD – UDI – DIT – ITO – TOR

Each word or phrase has (N-2) trigrams, where N is the length of the word or phrase. We also compare the percentage of trigrams that two phrases have in common, as shown in the below below. The PctTriGram function is particularly useful for phrase matching and address matching, but the example below shows a simple representation of how the function works:

Row #

Name

Address

Bank Account

Phone #

A

John Roberts

101 S. Main St

10122346

703-356-1101

B

Jon Roberts

12235 Regal Circle

10122347

703-356-1101

C

Mohammed Habib

789 Wheeler Way

10122347

954-227-1234

D

Haraj Tourec

2245-A Tamiami Tr

09781405

954-227-1234

/* Here you can add custom CSS for the current table */ /* Lean more about CSS: https://en.wikipedia.org/wiki/Cascading_Style_Sheets */ /* To prevent the use of styles to other tables use "#supsystic-table-8" as a base selector for example: #supsystic-table-8 { ... } #supsystic-table-8 tbody { ... } #supsystic-table-8 tbody tr { ... } */

3) Generalized Edit Distance – The generalized edit distance algorithm is a variation of the Levenshtein algorithm, and is used widely for comparing phrase similarity. SAS software contains a function called COMPGED that we utilize to match phrases. The COMPGED function in SAS returns the generalized edit distance between string-1 and string-2. The generalized edit distance is the minimum-cost sequence of operations for constructing string-1 from string-2.

4) Jaro-Winkler Function – The Jaro-Winkler function is commonly utilized for matching words, but not necessarily phrases. The version of the Jaro-Winkler function we have is tailored for SAS software, and very accurately identifies similarities between two words. This function returns a value between 0 and 1, with 1 representing an exact match.

It is important to note here that all of the above functions are useful for comparing single words – – but not necessarily phrases. Our team has developed a series of fuzzy matching algorithms that compare complex phrases, text, and addresses.

5) Phrase Matching – For intricate phrase and address matching, we leverageall of these functionsto provide very comprehensive and accurate PHRASE AND ADDRESS fuzzy matching. Most of these functions, individually, are great for comparing single words, but not phrases. Our analysts harness the power of these functions in a very unique way to arrive at the best phrase matching algorithms.