Building a Modern Data Pipeline Means Making Big Decisions

In this special guest feature, John Hammink, Developer Advocate at Aiven.io, discusses how there are numerous ways to go about designing and maintaining a viable data pipeline, and there is no silver-bullet solution for every organization. John Hammink was an early engineer in a variety of roles at Skype and F-Secure, and after stints at Nokia, Cisco and Mozilla, eventually became a content creator and developer evangelist at Arm Treasure Data. Since then he’s focused on producing content for early stage startups, including Algorithmia, RadixDLT, and Alooma. He’s recently become the Developer Advocate at Aiven.io.

Data pipelines may not be useful unless they
connect with where the data is housed — a frustration that engineers know all
too well.

Here’s how this might look: imagine running a
site that tracks commodities, but you’re limited to batch importing the
commodity prices once every 24 hours. No savvy trader will trust your platform
to make accurate, well-timed decisions about when to buy or sell.

Or picture being tasked to develop a mobile
game that monetizes players’ progress on a level-by-level basis. What if the
game suddenly rockets to the top of the app store, and you quickly need to
collect and accommodate variable-length event data like level progress along
with points, player name, rank and position matched with a timestamp? Can the
MySQL server sitting under your desk really keep the pipeline connected and
flowing to keep up with demand?

These scenarios shouldn’t seem unfamiliar.
Data from disparate sources flowing into countless silos with piecemeal
permissions is a challenge encountered by nearly every data engineer or
developer. Until recently, viewing, interpreting and analyzing in-transit data
hasn’t been possible. Instead, it was processed and collected in batches on a
nowhere near real-time basis.

The volume and velocity of data has
accelerated at a pace way faster than homespun pipelines were built to handle.
This has led to painful restart cycles, inconsistent data formats and a host of
other challenges that create painful consequences for organizations. Insights
are stale and inconclusive as a result, latency is untenable and overall
performance is bottlenecked.

Developers simply want to get their data where
they want it to go in order to achieve a singular, canonical store that is
primed for analysis. But in the face of limited resources and higher priorities
for engineering teams, this isn’t always possible.

The top
three challenges in creating and maintaining a modern data pipeline

Before we can identify the challenges that
developers and engineers must tackle to build and optimize a data pipeline, we
need to define the term “data pipeline.” For our purposes, it’s a set of
automated workflows that extract data from multiple sources, and connect to
those sources with a certain level of elasticity and schema flexibility that
enables data mobility, transformation and visualization.

Within the pipeline, what, where and how data
is collected should be clearly defined, as should the process for automatically
extracting, transforming, combining and preparing the data for deeper analysis
and visualization.

Keeping this (complex!) definition in mind,
let’s examine some common challenges to consider when designing a data pipeline.

1. Getting data where (and how) you need it to be

Achieving a complete
picture of your data means getting it to a state where you can draw insights
from combined information. Your tools must support connections to as many data
formats and sources as possible, including unstructured data. The challenge
here is to identify the data that is needed in the first place, and the
strategy you’ll use to ingest, combine and augment the data within your
pipeline.

2. Finding a home for your data

Once a pipeline exists, it has to take all of this newly combined data somewhere. Will it deposit it to an on-prem location? If so, there are a litany of choices you’ll have to make, including where precisely the data will reside and in what format, a determination as to whether there should be redundancy within the system, the performance benchmark needed to meet the service level agreement (SLA), etc.

Your data solution could also utilize managed services, which can be more expensive but are far less variable (and customizable) than running on-prem. Managed cloud services do, however, offer more tailored support along with scalable storage and memory.

3. Future proofing your pipeline

Some enterprises are still importing data in all-or-nothing static batches. But looking to the future is critical when building pipelines that handle data, especially considering the mind-blowing rate at which data is being created.

Your organization
might currently draw data from one device, system or set of sensors to power
your applications, but it’s unlikely that this will be the case forever. Again,
if choosing to host your solution on-prem, you have to consider the viability
of that system far into the future.

There are numerous ways to go about designing
and maintaining a viable data pipeline, and there is no silver-bullet solution
for every organization. Those that choose a managed service route to address
these challenges may realize the most future-proof and direct path to a
solution, while those that keep things on-prem can — with the right
competencies and resources — architect
the most custom path forward. But whichever path you choose, it’s important to
make decisions quickly and strategically.

Resource Links:

Industry Perspectives

In this special guest feature, Sean McDermott, CEO and founder of Windward Consulting Group and RedMonocle, offers what enterprises need to know about the five levels of AIOps maturity. When maneuvering through each level, keep the long-term AIOps strategy and goals at the center to achieve the true potential of AIOps.

Latest Video

White Papers

This whitepaper provides an introduction to Apache Druid, including its evolution,
core architecture and features, and common use cases. Founded by the authors of the Apache Druid database, Imply provides a cloud-native solution that delivers real-time ingestion, interactive ad-hoc queries, and intuitive visualizations for many types of event-driven and streaming data flows.