Disclaimer:

These are my personal views and are meant for Informational purpose only. Please verify the Information via Professional help or via Official references before acting upon the information provided in this Blog.

Data Quality Services

Missing Data:

Some of the attributes of a field are missing: Like Postal Code in an address field

Non-standardized:

Check if all the values are standardized: Google, Google Inc & Alphabet might need to be standardized and categorized as Alphabet

Different Date formats used in the same field (MM/DD/YYYY and DD/MM/YYYY)

Incomplete:

Total size of data (# of rows/columns): Sometimes you may not have all the rows that you were expecting (for e.g. 100k rows for each of your 100k customers) and if that’s not the case then that tells us that we don’t complete dataset at hand

Erroneous:

Outlier: If someone;s age is 250 then that’s an outlier but also it’s an error somewhere in the data pipeline that needs to be fixed; outliers can be detected using creating quick data visualization

Data Type mismatch: If a text field is in a field where other entries are integer that’s also an error

Duplicates:

Duplicates can be introduced in the data e.g. same rows duplicated in the dataset so that needs to be de-duplicated

What is the title these days for a person that assures data quality?
(I need to hire a person to make sure my data is as good as it can be. They need to inspect the data for issues, create logic for how it can be found and fixed, and finally, court the project through application development for a robust solution to stop it from occurring in the first place.)

Answer:

Quality of the data shouldnt be a responsibility of just one person — ideally, you want all members of the team (and broader business community) to care and own some part of it. But i like the idea of one person owning the “co-ordination” of how this gets done. It might not be a full time gig in a small org but can see this as a full time role in bigger orgs and enterprises. Some titles:

1) While performing Knowledge Discovery activity

1A: In the Discover step:

1b. Also in the manage domain values step:

While profiling gives you statistics at the various stages in the Data Cleaning or Matching process, it is important to understand what you can do with it. With that, Here are the statistics that we can garner at the knowledge discovery activity:

Newness

Uniqueness

Validity

Completeness

2) While Performing Cleansing activity:

2A: on the cleansing step:

2b: Also on the mange and view results step:

Here the profiler gives you following statistics:

Corrected values

Suggested Values

Completeness

Accuracy

Note the Invalid records under the “source statistics” on left side. In this case 3 records didn’t pass the domain rule.