Case Study

Data Cleansing and Validation

Challenge

A Fortune 500 financial organization had a legacy data pipeline that performed poorly, required multiple teams to keep it operating, and was costly to maintain. Therefore, the company decided to revamp the pipeline such that it would perform data validation and corrections. Thus, a modernized pipeline was developed by in-house Java programmers using numerous complex technologies as well as CDAP, with the goal of data cleansing and validation. These developers tested and ran the replacement pipeline using the drag-and-drop visual interface in CDAP. The new data pipeline required limited coding to integrate custom regular expressions, and with it over three billion records were processed using the following procedures:

Using CDAP, the company was able to extract data from Netezza and other SQL sources, perform complex joins and transformations, and load it into HDFS. They were then able to perform different aggregations and joins to generate the final report. Loading the final report data back into Netezza was seamless. The company’s in-house team built a data pipeline in less than a week using the drag and-drop visual interface in CDAP and was able to schedule it to run daily and report on errors, giving them the visibility into the data they needed. Beyond those capabilities, they were able to build a pipeline-level dashboard that provided them deep insights into how the offloading and report generation process was functioning.

Benefits of Cask Solution

Rapid Time to Value

The in-house team built, tested, and deployed the custom data pipeline in just three days.

The development team required only four hours of training on CDAP before launching the project.