The Caveman’s Guide to Data Profiling: Part 3

In Part 2, we discussed how to create a script to generate SQL to check for potentially invalid source data prior to beginning work on an ETL project. In our final installment, we’ll discuss some of the other data issues that you will want to evaluate and what your final meta-data script might look like.

As we said at the beginning of this series, when available, automated tools for profiling are certainly the best as they will offer you a menu of data profiling options to choose from. However, when they are not available, you will need to pick what metadata questions matter the most to your project. Here are the most likely candidates:

Range of values. Beyond the extreme dates of birth example we explored in Part 2, you may to evaluate other questions such as “What’s the biggest and smallest account?” or “Who’s been a customer the longest?” Fortunately, the template we used in Part 2 can be the starting point for all of those questions.

What is the cardinality of the data in both senses of the word? Both the relationships between parent and child data and the distribution of any code values in the data. And while it’s beyond the scope of this series, with some more work, a histogram of the source data could also be generated.

And finally, do the values in the source data match the values described in the source documentation?

Let’s explore than last bullet point first, since there are two ways to do this. The first way is to simply combine this task with the task outlined in the second bullet point and manually check for any invalid values later. Remember, you’re going to need to document your migration/ETL rules element by element so “eyeballing” the data is actually more practical than it might seem. The second way is to exclude the values listed in the documentation and display only the invalid values in the WHERE clause. The pseudo code for that will look something like this:

SELECT ‘SELECT DISTINCT’ || column_name

|| ‘FROM ‘ || v_table_name

|| ‘WHERE ‘ || column_name || ‘ NOT IN (<list of values>);’

FROM table_columns_with_datatypes – this is your data dictionary view

So let’s say your client has told you that they only employ salaried staff and outsource all other functions. But your subject matter expert says that no, housekeeping and cafeteria staff used to be employed directly. Let’s see who’s right.

SELECT DISTINCT job title

FROM product

WHERE column_name NOT IN (‘ACCOUNTANT’,’CONSULTANT’,’MANAGER’,’EXECUTIVE’);

And you get the following output.

job_title

JANITOR

COOK

CASHIER

SERVER

Whoa! It turns out that one point they did have hourly workers. You’ll need to develop a business rule for handling that legacy data. But there’s one problem with this approach. You’ll need to modify the query almost on a case by case basis. This might be useful if you have reason think there is a specific problem.

The other method is to simply combine the Cardinality check with the valid values check. The technique for that is this:

SELECT ‘SELECT ’ || column_name || ‘, COUNT(*),

|| ‘FROM ‘ || v_table_name

|| ‘GROUP BY || column_name

FROM table_columns_with_datatypes – this is your data dictionary view

Which results in ….

SELECT beverage_code, count(*)

FROM product

GROUP BY beverage_code;

And you get the following output ….

beverage_code count(*)

MANAGER 6

ACCOUNTANT 12

CONSULTANT 14

EXECUTIVE 3

JANITOR 2

COOK 4

SERVER 3

CASHIER 5

Note that if you take this approach, you’ll have to add additional logic in the WHERE clause to ensure that you are only polling look up table values. So let’s put this all together into one final script and add that logic.

There will of course be many, MANY more tweaks and updates both as you explore your data on your own, and as you elicit feedback from your team and your client, but this will get you started on the concept. Good Luck!