How to Automate Your Data Pipeline

In this special guest feature, Taylor Barstow, CEO & Co-Founder of Bedrock Data, discusses new tools that have begun to automate how data is prepared for analytics, starting with connecting data sources. These tools address the needs of data scientists who say they view data preparation, getting their raw data ready for analysis, as the least enjoyable part of their work. Taylor is a technology innovator who has a proven track record as both a developer and an engineering executive. Taylor is a co-founder of Bedrock Data, prior to which he was a VP of Engineering for the enterprise graphics workflow solution Workflow by Design. Taylor has been developing his wide-ranging technical skill set for the past two decades, and has a penchant for bringing business focused solutions to market in order to solve real world problems.

It’s a universal truth in technology that if you take a manual process, you’ll be able to save time or money — or create other business value — by automating it. And yet statistics show manual data preparation and ingestion still constitute up to 80% of an analytics project.

Even in 2018, IT teams find themselves having to extract data from various SaaS applications by hand, transform formats with custom code, and load the data into siloed systems. 76% of data scientists say they view data preparation, getting their raw data ready for analysis, as the least enjoyable part of their work.

And it’s not just about making “data people happier” — disparate data holds companies back. 55% of B2B companies cite the “inability to merge data from disparate sources in a timely manner” as the biggest challenge their organization faces in leveraging data to achieve their go-to-market goals.

To solve this problem, new tools have begun to automate how data is prepared for analytics, starting with connecting your data sources.

Connect Data Sources

At most organizations data is spread across many systems. The average small-to-medium enterprise uses 12+ SaaS systems to manage customer data, including Marketing Automation tools like HubSpot and Marketo, CRMs like Salesforce Pardot, Microsoft Dynamics and NetSuite, customer support tools like ZenDesk and HelpScout, in addition to finance, event management, online meetings and webinars, and more. Large enterprises get into the hundreds of applications.

Each of these systems has both its own set of objects (e.g. Lead, Opportunity, Ticket, Campaign, Contact, Account, Activity, Company, Product, Order, User) and relationships to other objects. New tools that automate how data sources connect will keep objects and their relationships intact, both of which are essential for analysis.

Consolidate & Normalize

Now that you have your discrete data sets, you need to get them into a fully integrated, unified data set by consolidating and normalizing the data. Normalized data makes for simpler, faster queries, by organizing records consistently across all tables.

To avoid bringing duplicates into your consolidated data set, be sure to de-duplicate data first. As you automate the de-duplication of data, you will also want to account for inconsistent formats. To create consistency, look at how data is input and create rules for standardizing the formats across systems. This will reduce the amount of normalization needed later.

But what if a customer recorded in two different systems has a different address in each? Because such conflicts are so common, resolving them “by hand” is impractical. Luckily there are two automated approaches, System of Record (SoR) and Most Recently Updated (MRU). With SoR, you can automate which system will override the other to ensure one record gets used for any given piece of information, ranking systems based on your business priorities. In contrast, MRU uses the data that was last modified across systems, for a given field. So if a customer’s phone number is different in your CRM vs. your support system, the most recently updated field would be used.

In addition to the consistency in formats, matching like records across systems with common identifiers is also important for modeling and crafting a standard schema. When matching Contact records, for instance, an email address is a common identifier that offers the highest probability for a unique match across systems. As you automate your data pipeline, it’s possible you might incorporate multi-level de-duplicate keys such as Name, Company, and Address.

Warehouse the Data

Oftentimes, data from each system will then enter its own warehouse. But this siloed approach constrains analysts. Rather than run SQL queries against all of these systems in one, analysts will need to be familiar with how each separate database is organized. Parsing which idiosyncrasies belong to which system makes joining tables time-consuming and analyzing data laborious. Even if your data are immaculate, the BI dashboard you use for visualization can only access one of the many SaaS applications you use at one time.

One approach worth considering is a fused database, which gives access to multiple datasets simultaneously. This way, when data inputs invariably change, connected data sources won’t get out of sync across silos.

Another advantage to this approach is creating a universal schema of all objects and their relationships to one another. Mapping relationships between these matched records into a single schema is best done using a graph database, which users can pull and query, without having to map distinct relationships themselves.

Feed Analytics, Reports & Dashboards

Unlike the traditional analytics stack, a master data set combines data prep, ETL, warehousing, and modeling functions into a single layer, so data is instantly ready for analysis.

This is especially helpful for IT directors and data analysts because customer data are accurate, up-to-date, and much easier to join. Automating your data pipeline therefore has several major advantages. It connects siloed data sources, cleans data, saves teams from the traditionally tedious processes of data integration, preparation and ingestion, and gives the entire business quick access to dashboards and business intelligence (BI) tools they can trust.

Resource Links:

Industry Perspectives

In this special guest feature, Amnon Drori, Co-founder and CEO of Octopai, discusses how for many organizations, GDPR may be the biggest data challenge they have ever faced – but it also provides organizations with an opportunity to truly own their data. By implementing a smart system that will ensure that they are able to find the data in their systems at will, organizations will ensure that they are GDPR-compliant – and have the opportunity to utilize all their data to help their organizations run more efficiently and profitably. [Read More...]

White Papers

Python stands out as the language best suited for all areas of the data science and machine learning framework. Designed as a flexible general purpose language, Python is widely used by programmers and easily learnt by statisticians. Download the new guide from ActiveState that provides a summary of Python’s attributes, as well as considerations for implementing the programming language to drive new insights and innovation from big data.