Disclaimer:

These are my personal views and are meant for Informational purpose only. Please verify the Information via Professional help or via Official references before acting upon the information provided in this Blog.

1) While performing Knowledge Discovery activity

1A: In the Discover step:

1b. Also in the manage domain values step:

While profiling gives you statistics at the various stages in the Data Cleaning or Matching process, it is important to understand what you can do with it. With that, Here are the statistics that we can garner at the knowledge discovery activity:

Newness

Uniqueness

Validity

Completeness

2) While Performing Cleansing activity:

2A: on the cleansing step:

2b: Also on the mange and view results step:

Here the profiler gives you following statistics:

Corrected values

Suggested Values

Completeness

Accuracy

Note the Invalid records under the “source statistics” on left side. In this case 3 records didn’t pass the domain rule.

And a composite domain: Full Address which contains domains: Address line, city, state and zip in the composite domains.

6) And for the composite domain Full Address – this is how you configure reference data-sets:

7) After creating the Knowledge Base, start a new DQS project. Here the cleaning happens at the composite domain level and this is a demo of how DQS uses third party reference data-set to classify records as suggested, new, invalid, corrected or correct:

You can see that DQS corrected few records and mentions that the reason in this case was Melissa data’s address verify reference data-set:

Mantresh Jain is a C Level Executive at SMB in manufacturing domain based out of India. He has bachelor’s degree from a business school. And he holds a special interest in how businesses can leverage newest Information Technology Tools for optimizing business processes. He is working on a company-wide ERP implementation and is a single point of contact for the implementation process. He spends his free time on computer games of all kinds! Link with him here: http://www.linkedin.com/pub/mantresh-jain/43/562/749

Some months ago, Mantresh approached me to see if I knew any tool that would help him deal with “messy” data. On Further questions, I learned that

– Messy data = lots of duplicates

– Uses SQL Server Express & do NOT have plans to upgrade to SQL Server versions that include Data Quality Services and/or Master Data Services. Remember the context here: They are a small and medium size business.

– Do use Excel – a lot!

– Do not have folks w/ “SQL” knowledge

With this requirements, I asked him to see if an add-in for excel called “Fuzzy Lookup” meets their need. After trying it out: here’s Mantresh’s experience of using Fuzzy Lookup add-in for Excel in their organization:

Summary:

In my company we are implementing ERP software. I faced a problem of Data migration from two fox Pro based software’s to SQL (for ERP)

More Details:

Two fox Pro Software’s worked independently form each other. And as a result each of them had their Separate Databases.

Lets Call them FX1 and FX2.

Now I wanted to import Account Master Data from them to SQL, Here are the fields in our Account Master data:

Name, Address, Bank Details, Phone Number among other fields

Problem

Both systems had issues of data Duplication and Data Inconsistency

To give you an example, I faced following problems:

1) FX1 had around 3500 entries and FX2 had 2400 entries

Now in FX1 out of 3500 around 2000 were same as FX2

Also FX2 had around 2000 entries same as FX1

Now i wanted to import only unique Account Master gathered by “combining” the two systems to SQL.

Example:

FX1 has “VMS Industires” while FX2 has “V.M.S Industries”

Solution

Fuzzy Look up add-in for Excel.

Step 1) Import data from both databases to excel

Step 2) Using Fuzzy Look up to find data matching to each other based on variable conditions that we select.

Step 3) It reorganizes data as

FX1 entry

1st matching FX2 Entry

2nd Matching FX2 entry

…

…

This is how we find Duplicate entries and then clean our data-set

Benefit

If not for Fuzzy Look I would have had to manually match each entry to each other which would have taken estimated 60 to 100 Man Hours but with Fuzzy Look-up, we did the job in 24 Man Hours Only.