Clean data is happy data

Before you can start mining your dataset for insights, it’s paramount that you have all your ducks in a row.

Recently I had the opportunity to pull together data from several systems where the main identifier was found to be inaccurate in some cases. This caused me to put the following helpful steps together on how to tackle this task.

Data Merging and Cleansing process

1. COPY – Start with a copy of the data you need
At some point in the process, you will always need to go back to make sure assumptions still ring true. This is also helpful when merging multiple sources, as different combinations can be attempted to increase success.

2. DEFINE – Create a plan on how to match
The same matching criteria likely won’t work for each record. So, it’s important to define the different combinations to match on and the order to attempt.

Start with absolute conditions, followed by fuzzy match conditions. Also, talk with data subject matter experts to build a list of assumptions.

Bonus points if you can assign a probably of success based on the match used for later.

3. MERGE – Follow the plan and pull it together
This is where things start to come together. Whether you’re using advanced Excel features or ETL flows, automate as much as you can based on the plan.

4. VALIDATE – Trust nothing and ensure accuracy
You likely can’t depend on manually verifying, so find a way to automate. This is where a match accuracy probability per row can be very helpful.

Remember that you are not any checking the data, but proving the assumptions in this step. Based on findings, you may return to step 3 to attempt a different approach.

5. CLEAN – Remove duplicates and erroneous data
At this point, things look good. This pass through you will remove any duplicates.

As well, mark and segregate anything that can’t be confirmed. This happens more than people admit and may require manual merging by data subject matter expert.

And finally, step back and validate again.

When done correctly, this process implements two key practices.

Identify and removes major errors and inconsistencies in single data sources and when combining multiple sources