AdSense

Tuesday, 14 January 2014

Data Quality in ETL and BI - Reasons Impacts Solutions and Operations

The amount of Data that an organization stores and processes in the
current time has increased many folds. This has also exposed the
associated problems of poor quality of data. If the quality of data is
bad then the information created by that data is not useful. Lot of
efforts and money is being put in by organizations to improve the
quality of data.
The Quality of data refers to the following

Accuracy

Consistency

Integrity

Uniqueness

Reasons for data quality issues:

Inaccurate data entry

No process or rules in application to validate data entry

Lack of Master Data Management (MDM) strategy

Examples:

Phone number having values as 1111111111, 0000010100

Customer Name as "ABC", "ZZZZ"

Two records in a table like below:

Timothy, Kenny. 10 East Avenue

Tim, Kenny. 10 East Avenue

Impact of poor data quality:

Incomplete and misleading analysis

Increase in spending on incorrect data

Financial impacts when data is related to accounts and finance

Targeted market campaigns impacted adversely

Purchasing of data quality tools like First Logic, Trillium, etc.

Complicated ETL

Additional Cleansing in ETL process results in longer time to complete ETL cycle

Solutions for data quality:

Stringent data validations by means of rules in applications at source

Avoiding duplicate master entries by use of MDM solutions

If above not done, then using ETL process to handle data quality issues

Send feedback for bad data quality and correct at source, then reload

Maintain audit for data quality issues emerging from source systems

Data Cleansing Operations:

Removing invalid characters

remove extra and special characters from addresses, phone numbers, etc