The 4 Paradigms of Data Prep for Analytics and Machine Learning

Mike White

(Natalya Yudina/Shutterstock)

Data preparation has long been recognized for helping business leaders, analysts, and data scientists to ready and prepare the data needed for analytics, operations, and regulatory requirements. Today, the technology is becoming even more critical to deriving insights as most enterprise data is still not ready to be used by machine learning (ML) applications and involves significant effort to make it usable. In fact, most analytics or data science exercises still require data professionals to expend up to 80 percent of their time on tasks such as ingesting, profiling, cleaning, transforming, combining, and shaping data. While unfortunate, this considerable investment is necessary to ensure that raw data can be converted into reliable and useful information to drive business decisions, support operations, meet regulatory requirements, or predict optimal outcomes.

More recently, data preparation technology has evolved into a valuable tool that creates ML and data science workflows which enhance applications with machine intelligence, enabling the transformation of data into information on-demand. By empowering every person, process, and system in the organization to be more intelligent, business users who are closest to the data can prepare datasets quickly and accurately, with the help of built-in intelligence and smart algorithms. These users work within an intuitive, visual application to access, explore, shape, collaborate and publish data with clicks, not code, with complete governance and security. IT professionals are able to maintain the scale of data volumes and variety across both enterprise and cloud data sources to support business scenarios for immediate and repeatable data service needs.

However, not all approaches to data prep are the same, so it is important to understand the following four data prep paradigms before choosing the optimal data prep style for your organization.

Paradigm 1: Workflow vs Spreadsheet UI

Data practitioners considering a data preparation solution are confronted with many options, but the first step in the process should focus on whether the solution adopts a workflow-oriented user interface or a spreadsheet-like one. Knowing your data persona type (or the skill set of the user base) along with the type and variability of the data at hand will help you determine the ideal user interface paradigm.

(sasirin pamai/Shutterstock)

A workflow-based interface, also referred to as ETL (Extract Transform Load), provides a canvas for placing components or icons which represent a configurable data preparation task and connecting these components with lines in order to represent dependencies and lineage within the workflow. Due to this abstraction layer, the data content is not viewable until the pre-defined workflow or job is run and the output is browsed. It is important to note that this paradigm assumes the required transformations and joins are known at the time of creation. Multiple iterations and test phases may be needed to assess and validate that the output meets end user requirements.

Alternatively, a spreadsheet-based interface gives its end users a direct view into the data itself and presents each data attribute as a column, often with embedded visual cues for data sparsity, uniqueness, data type mismatch, and other anomalies. This view allows the result of each transformation or step to be seen dynamically throughout the data prep process. Built-in data profiling and data quality issue detection facilitates immediate resolution within the environment. This paradigm inherently reduces the amount of iterations and accelerates data prep cycles in which interactive data validation and transformation is a critical component of the use case.

Paradigm 2:Clicks vs. Code-based Approach

With the proliferation of point-and-click, drag- and-drop business intelligence tools, “ease of use” has become a key differentiator when considering data preparation software. However, the code-based approach remains a popular option for technical data users who prefer flexibility and lower software costs compared to purpose-built applications, which tend to be resistant to customization and come with a larger price tag. Furthermore, a code-based approach inherently requires a higher cost of skilled resources and maintenance. Every change to the code needs to go through a life cycle of development, test, quality assurance, and production.

Paradigm 3: Sample vs. Full Data Perspective

(metamorworks/Shutterstock)

There are use cases which require a complete data population, such as master data migration, regulatory reporting, and fraud analysis. Likewise, there are use cases which are best performed using a relevant sample or subset of the data, such as predictive analytics and marketing segmentation. The business needs and data characteristics of the use case should ultimately drive the decision when adopting a data preparation solution or approach within the organization.

For instance, a sample-based approach will increase the risk of missing some data quality issues, which can have a huge impact depending on the use case, as the size and sophistication of the data sample varies from product to product. Some tools enforce a hard-coded sample limit, while others allow you to select your sample size depending on your use case and available processing resources.

A full data perspective, which provides the ability to work with all data records and column attributes in a given dataset, enables a comprehensive approach to data profiling and data quality. Full dataset visibility can have a significant impact on data accuracy and delivering reliable information quickly depending on the user’s understanding of the business context and the use case scenario.

Paradigm 4: Stand-alone Application vs. Vendor Add-on

Another often overlooked factor is whether the solution exists as a stand-alone offering or as part of a pre-existing BI or analytics application, data science tool, or ETL environment. There are implications for selecting a data preparation offering merely because it is available as an add-on to an in-house application. The risk of limited capabilities to meet specific needs arises and you must determine whether the risks are offset by the benefits provided by an integrated solution. In these cases, the comparison factors and considerations described above should be similarly applied.

Regardless of the which data preparation paradigm makes sense for your organization, it is critical to understand the relative strengths and challenges of the 4 primary styles of data prep. Giving careful consideration to each strength and challenge in light of the use case scenarios, the data characteristics, and the individuals performing the data work will ensure the highest chance of success.

Here’s a chart showing how the strengths and challenges of the different approaches stack up:

Data Prep Approach

Strengths

Challenges

Excel-based Data Prep

Most advantageous for finance & accounting; one-off scenarios for data profiling and data cleaning; and ad-hoc use for scenarios that fit within one million rows.

Organizations today continue to look for ways to prepare data quickly and more accurately to solve their data challenges and to enable machine learning. Data prep technology helps business analysts, data scientists, and ML practitioners rapidly prepare and annotate their data to extend the value of the data across the enterprise for analytic workloads. Regardless of which paradigms are currently in use at your organization, self-service data preparation solutions enable ML and data science workflows which enhance applications with machine intelligence. More importantly, they enable them to transform data into information on-demand to empower every person, process, and system in the organization to be more agile and intelligent.

About the author: Mike White is a Certified Business Intelligence Professional (CBIP) who is currently focused on developing data analyst-centric solutions and content at Paxata. In his former role as a Senior Solutions Consultant, he worked with several prospective customers and partners to demonstrate the value Paxata can provide to their organizations. Prior to joining Paxata in 2014, he worked for Ernst & Young as an enterprise analytics consultant where he managed teams focused on risk and compliance analytics and business performance reporting for clients in the high tech, media & entertainment, and healthcare industries. Mike has a Master’s in Management Information Systems from Brigham Young University.