Match datasets using Fuzzy Joins in SQL Server 2017 with R

Problem

Matching one dataset with another is a very basic and frequent operation while
designing query or data processing logic in any database management systems including
SQL Server. Mostly database systems support
exact joins like inner join, outer join,
left join, right join, etc. which depend on an exact match of one value in one or
more fields in one dataset with an exact match of the same value in another field
of another dataset. While working with a lot of textual data, it is a very normal
use-case where one may need to perform an inexact matching or in other words fuzzy
/ heuristic matching of data values. For example, while entering free text comments
or description, it is possible the users may make spelling mistakes and there may
be a need to match these words with a dictionary to find the matching words and
correct the spelling mistakes. Such data cleansing use-cases heavily rely on inexact
matching and the relational joins would not come to much use in this case. The need
is to have a joining method which can support inexact / fuzzy matching. In this
tip we will learn how to solve this problem.

Solution

Fuzzyjoin R package can be used in SQL Server 2017 and higher to perform different
kinds of fuzzy joins of data.

First, let's understand what distinct types of fuzzy joins are supported by
this package. Below is a list of distinct types of inexact matching supported by
the fuzzyjoin package along with the associated function name. We will be using
some of these functions in the upcoming steps to understand the use.

Numeric values that are within some tolerance (difference_inner_join)

Strings that are similar in Levenshtein/cosine/Jaccard distance, or other
metrics from the stringdist package (stringdist_inner_join)

A regular expression in one column matching to another (regex_inner_join).
Other Regex based join functions include regex_left_join (include all rows of
left table), regex_right_join (include all rows of right table), regex_full_join
(include all rows in each table), regex_semi_join (filter left table for rows
with matches), regex_anti_join (filter left table for rows without matches)

Genomic intervals, which include both a chromosome ID and (start, end) pairs,
that overlap (genome_inner_join)

Example of Fuzzy Matching with SQL Server 2017 and R

Now the we know the inventory of different join functions supported by the fuzzyjoin
package, we can start with the exercise to understand the working and usage of some
of these functions. Follow the steps as shown below.

Step 1 - It is assumed the SQL Server 2017 as well as R is installed
on the development machine. If you are new to R in SQL Server, you can refer the
R tutorial to get up to speed on working with R in SQL Server.

Step 2 - Its assumed that
SQL Server Management Studio (SSMS) in installed on the development
machine. We need to install the "dplyr" package for output redirection (%>%)
operator and "fuzzyjoin" package for using the above-mentioned functions. You can
read more about installing R packages in SQL Server from
here.

Step 3 - The first step is to open a new query window and ensure
that the packages are installed successfully and accessible in the SSMS environment.
Execute the below code, and if the code executes successfully, that would mean that
the library is installed correctly and accessible in SSMS. We are referencing the
packages in the library function and executing the script using sp_execute_external_script
stored procedure.

Step 4 - Now to simulate the problem in question, we need two
data frames which consists of misspellings as well as a dictionary list of correct
spellings. And then join the two to find inexact matches. In the below code we are
creating two data frames named misspelling and words which are the dictionary words,
and adding some sample data to demonstrate and test the matching. We are creating
a new vector named joined and assigning it the output of the fuzzyjoin. We are using
the stringdist_inner_join function to match the two data frames and specifying that
the distance between misspellings and dictionary words can be 1 at max by setting
value of 1 for the max_dist parameter. Execute the below code.

Step 5 - The output of the code would be as shown below. The
would "sould" got matched with multiple words where change or addition or removal
of one character would make the word match with the word in the dictionary. You
can also notice that the word "charactery" from the dictionary, and the word "nomatch"
from the misspelling is missing in the output.

In this way, we can a get a tableplot for multi-variate high-voluminous data analysis
primarily to understand distribution and relationships between different variables
in a dataset.

Step 6 - Now let's change the code a little bit and increase
the distance to 2. Execute the below code.

Step 7 - This time you would find that the word "charactery"
matched the word "charactors", "fleed" matched "field" and is in the output. But
the word "nomatch" is still missing from the misspelling list. The reason is that
there is no match for this in the dictionary list of words.

Step 8 - You can add an additional parameter distance_col = "distance"
to the stringdist_inner_join function, and the distance will be in the output as
shown below.

Step 9 - We have been doing inner joins until now. In case we
need all the words from one dataset, for example the misspellings data frame in
this case, we need to use a left join. Change the code as shown below by using the
stringdist_left_join function to get all the values from misspellings irrespective
of whether it inexactly matches any words from the dictionary list.

Step 10 - The output would look as shown below. This time the
"nomatch" word in the misspelling is in the output and the corresponding match in
the dictionary does not exists, so it shows the value as <NA>.

Step 11 - Let's say we have a situation where we anticipate
that we have a high volume of words that won't have a match in the dictionary. In
this case instead of doing a left join and filtering the data, we can use a different
type of join called an anti-join which returns data that does not match. Execute
the antijoin function as shown below.

Step 12 - The output from this code would look as shown below,
where the only word in the entire list which does not have an equivalent match in
the dictionary is shown.

In this way, distinct functions from the fuzzyjoin package can be used for inexact
matching. In a real life scenario, data would come from a table in SQL Server for
both datasets that would be joined for the fuzzy matching by using the InputDataSet
data frame to pass the data from SQL Server to the R Script. And after the fuzzy
join is applied, the output can be extracted from the R script using the OutputDataSet
data frame. If you are new to this, you can read more about this in the
R tutorial.

Next Steps

Consider using the regular expression joins which provides the maximum flexibility
for fuzzy joining of datasets.

About the author

Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name
*Email
Email me updates

Signup for our newsletter
I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.

I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.