On the Road with SQL Saturdays

Blog Administrator

This Saturday in Cleveland (SQLSAT#60), I will be presenting several code samples based on a series of articles and posts identifying similar records (Fuzzy Matching) between two different sources or grouping of records from a single source, based on existing column string of values.

We will have a booth at Cleveland. Please stop by to chat with me or get a demo of our newest Fuzzy Matching SSIS Components and a free 45-day demo DVD.

We will implement several code samples based on a series of articles and posts identifying similar records between two different sources or grouping of records from a single source, based on existing column string of values.

We will define an approach, review actual implementations with various SQL tools (TSQL, VB, SSIS and MDS). There are many facets to review. I would like to start by drawing concepts from several articles from experienced practitioners, either from academia or the commercial market.

They will be footnoted at the end of the session. Although we are discussing matching, we need to address several steps prior to getting to the actual use of matching algorithms. For this post, we will focus on the pragmatic steps required to accomplish identifying the relationships between records, and address how these steps fit into an overall data governance and data quality of MDM architecture.

We will review real world implementations of several hundred million records. The steps required are shown here in a basic diagram.

The steps are as follows:

1. Cleansing and standardization

2. Group records

3. Split records

4. Compare records and determine scores

5. Split into separate match categories

6. Analyze results of matches

7. Evaluate using match tools to determine if best algorithms have been combined. We will then implement this approach in TSQL and SSIS as well as explore the use of Fuzzy Matching in MDS.

I will also be presenting at SQLSAT #47 Phoneix and the SQL Rally in Orlando in May.

Recently I have had a client that required migrating and re-hosting a data warehouse from the Oracle environment to the Netezza Appliance. The application had hundreds of tables and almost 2900 Source to Target Mappings. This would have taken thousands of hours of manual coding. For this client, we implemented a reusable data-driven architecture, via SSIS, that relies on a metadata mart. The end result a greatly reduced TCO (development effort) for generating the code required vs. manual coding.

In addition, the process leaves behind a metadata mart to report on for auditing and other Data Governance efforts. We generated over 100,000 lines of code and over 3,000 load scripts. After implementation the application can be maintained via the tables and automatically regenerated, providing a pro-active or self-healing application architecture to respond to changes in incoming source files or target changes.

Post navigation

Share This Story

Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website

Search

Search for :

Subscribe

Name:

Email :

Leave this field empty if you're human:

Archive

Archive

Categories

Categories

About This Site

All data goes bad (up to 25% per year), whether due to data entry errors or the simple fact that consumers change jobs, move, update email addresses, marry, etc. At Melissa, we help companies harness the value of their Big Data, legacy data, and people data (names, addresses, phone numbers, and emails) to drive insight, maintain data quality, and support global intelligence