Data quality, software tips & tricks, stories from the field

DQIT Series Part 2: Manual Cleansing—No More Heavy Lifting

We all know that data quality is a continuous process. You also know, at a fundamental level, that the process is going to include manual cleansing—it’s necessary and it’s unavoidable. But don’t despair, because thanks to modern tools and technology, it doesn’t have to be a dirty job.
In Part 1 of the DQIT series, we discussed some basic aspects: what an issue is, what constitutes a workflow, common issue types, how the DQIT application appears, and some features that are offered. It was stated then at the end that in Part 2 we’d begin to get our hands dirty. A promise is a promise so let’s dive in.

So there you are. A list of issues is staring you in the face. If you are like some other data stewards, the questions in your mind begin racing:

How could this have happened?

Where did it all come from?

How will I correct the problems?

Let’s first answer a logical question. How does a manual cleansing issue distinguish itself? Well, a cleansing issue type simply consists of a record requiring manual editing. DQIT stores the “faulty” input source record, the associated list of detected problems, and the recommended cleansed values.

The following figure shows a typical manual cleansing issue type entry. To answer the question of where the issue comes from, simply view the System and Entity fields, which pinpoint the exact source of the issue (applicable to all issue types).

Let’s take a closer look at this issue involving the name field data. Clicking on the issue in the Issue List Screen opens the Issue Resolution Screen below. On the left, we have our source system data. The same record is on the right but in a “cleansed” form.

One of the name conflicts that arise includes multiple pattern possibilities, as in the above example. Peter John, John Peter…? Which one is it supposed to be? Well, here’s why the issue was created, along with the fact that the last name—if it is the last name— looks to be misspelled. The DQ tool needs some guidance on decisions like this. Upon clicking Get Proposals, one can see why the system had such a tough time figuring it out on its own. Not only could the names be arranged either way, but the “John” portion of the name needs updating. It’s now up to the data steward to sort this confusion out and select the record that corrects the error(s) and best represents the applicable real-life entity. The selected data then appears in the “Corrected data” (right) column of the Issue Resolution Screen. Clicking on Validate confirms the data quality of the corrected data, once again, utilizing the DQ tool.

In another example, we have an issue within an address entity.

It’s pretty logical to assume there are multiple possibilities in New York involving “Main St.”. That’s exactly the issue that came up in the preceding example. The DQ tool could not judge which address (in this case, city/postal code) was the correct data, based on the other information in the record. The error message on the left confirms this.

Another method to correct record is to directly click on individual Corrected data fields, delete the incorrect data, and type in the correct data. In the case of a simple gender mismatch, this method would be more suitable, but for most other cases that are not so obvious, selecting from the proposals is the correct procedure.

Depending on your configured export frequency or method, the resolved issues are then sent back to their original source system to be used going forward as the model data. On rare occasion, it is possible that the issue was not fixable due to the source system not supporting the changes to the data. In this case, the data steward arranges for the “resolved” issue to be sent to an exceptions list. This exception then serves as a reference for the DQ tool in the event the issue arises in the future, preventing the issue from re-entering DQIT.

So, with DQIT, manual cleansing doesn’t have to involve hours and hours of sifting through individual records to find inconsistencies and errors; the system does it for you. It turns what had been hard labor into a quick and easy job.