Subscribe

Follow Us

Data Profiling with Informatica Data Quality

One of the first steps in solving a data quality problem is to perform data profiling. As seen in Jason Hover’s article, Data Profiling: What, Why and How?, data profiling allows you to analyze your data to determine what it looks like and what problems exist in the data. Manual data profiling can be performed; however, using software such as Informatica Data Quality allows both data stewards and developers to collaboratively profile the data in a common repository more quickly, often yielding a more thorough analysis.

Informatica Data Quality offers profiling capabilities within both the Informatica Analyst and Informatica Developer tools. Within Analyst, data stewards can quickly create a profile of the data:

Data Profiling With Informatica Data Quality Image (A.1)

As seen in image (A.1), a data steward can see that there are 4 records with a NULL value in column ADDR2. The data steward can also see that the values are mixed case. Here, the data steward may determine that NULL values are not valid and that the values should be standardized to uppercase. The values of San Diego, SD, SAN DIEGO should also all be SAN DIEGO. The data steward can add this as a comment for the developer to create a rule (A.2).

Data Profiling With Informatica Data Quality Image (A.2)

In the Developer tool, the same profile can be viewed with the comments provided by the data steward (A.3).

Data Profiling With Informatica Data Quality Image (A.3)

The developer can now apply profile rules to the ADDR2 column based on the business rules provided. Applying profiling rules allows the data steward to identify possible inconsistencies in the data and potentially correct the data in the source system. Note: The output of the profile rule is a virtual column in the profile; the rule has not been applied to the data physically until a mapping has been created in Developer.

Informatica Data Quality provides a number of pre-built rules, which can be applied directly to the profile within either the Analyst or Developer tool. In this example, we will use the pre-built rules of rule_Completeness and rule_UpperCase and create a new rule to standardize the values for SAN DIEGO.

Before creating the rule to standardize values for SAN DIEGO, we need to create a list of valid values. In Informatica Data Quality, the list of valid values can be created as a reference table. The reference table will be used as a lookup to determine what value should be returned. TIP: A quick way to create a reference table with valid values is to add the values from the data that has been profiled. Right click on the value and select Add to Reference Table. (A.4).

Data Profiling With Informatica Data Quality Image (A.4)

Once the table is created, the list of valid values can be added (A.5). The ADDR2 column serves as the valid value. If the value in column1, column2, or column3 exists, it will return the valid value found in column ADDR2. The reference table can then be maintained through the Analyst tool by the data steward (where an audit trail is captured).

Data Profiling With Informatica Data Quality Image (A.5)

TIP: If the data steward does not want to edit the table to include additional columns, the same valid value in column ADDR2 can be added and the additional values can be listed in column1, column2, column3 as highlighted in Image (A.6).

Data Profiling With Informatica Data Quality Image (A.6)

Now we can create a rule in the Developer tool to use the reference table. After applying the rule in the profile, we can see that there are now 3 records with the value of SAN DIEGO (A.7).

Data Profiling With Informatica Data Quality Image (A.7)

Although we are using 3 different profile rules, we can take it one step further and combine the rules into one profile rule. The rules are seen as mapplets in Developer. In this example, we will create one mapplet named rule_ADDR2 that will include 3 mapplets (one for each rule previously identified) (A.8).

Data Profiling With Informatica Data Quality Image (A.8)

Taking the output from rule_Completeness, we will determine if the record is valid (A.9).

Data Profiling With Informatica Data Quality Image (A.9)

TIP: Make sure to validate the mapplet as a Rule so that it will appear in the Analyst tool (A.10).

Data Profiling with Informatica Data Quality Chart (A.10)

After this rule is applied in the profile and re-run, the data steward can see how many records are valid or invalid. The data steward can then drilldown on the invalid records.

Data Profiling with Informatica Data Quality Chart (A.11)

As seen in Image (A.11), there are 4 invalid records because they are NULL. The data steward can then export this data, perform further analysis, update the source system, update the reference table, and/or work with the developer to update the rule. Once profiling of the data is complete and all rules have been identified/updated, then the rules can be physically applied to the data and cleansed.

This data profiling example illustrates one of many techniques that can be used within Informatica Data Quality. Collaboration amongst the data steward and developer during the data profiling process are key in taking the first steps in solving a data quality problem.

PARTNER WITH US

Do you have a current Data Quality initiative? How is your Data Quality project progressing? We’d like to hear from you. Take a few minutes to send us an email and let us know about your project. If you’d like further assistance with your current project, feel free to contact us directly. We’d be happy to talk with you and develop a proposal around your project.