What is a data pipeline?

With so many choices now available for moving data, the key differentiators can get lost in the noise. The question ‘what is a data pipeline?’ becomes especially relevant when needing to move mission-critical data from transactional data stores. The question grows more relevant in the cloud because requests for data copies become more frequent.

These 5 tips may help clarify which attributes of a data pipeline are most important in your next project.

Tip #1: Without transaction semantics, you have bad data

Kafka is a good example of a data pipeline that compromises on transaction semantics. Many of its sources and targets don’t need the strict event ordering required to keep transactional data correct. That works for click data and various asynchronous message streams. However, a data pipeline that fails to honor transaction semantics can destroy transactional data and cause misleading results. This is true even for analytics applications and NoSQL targets.

For transactional databases, Griddable preserves the order of transactions throughout its data pipeline. Its relay service reads transactions in original order using the system change numbers (SCN) from the source database redo log. The Change History service persists changes, eliminating the need for multiple reads of the redo log. The consumer component of the data pipeline uses the same SCN to pull changes in event order. The bottom line: the target on a Griddable data pipeline is always timeline-consistent with the source.

Tip #2: Capacity must be elastic

Once in the cloud, user needs for data grow exponentially because they are no longer limited by infrastructure. Paradoxically, many data pipelines are one-to-one, point-to-point tools that simply do not scale. Many use cases share data from one source to many targets simultaneously, dynamically adding targets as demand arises. Data pipeline capacity must be elastic to grow with new data subscriptions and shrink when projects complete.

To fit the dynamic cloud environment, the Griddable data pipeline runs on a portable and elastic Kubernetes infrastructure. Kubernetes clusters automatically grow on demand to add cluster nodes with additional compute capacity when required. Further, the Kubernetes graphical UI provides complete visibility to the operational metrics of the cluster.

Tip #3: Look for redundancy

A data pipeline will quickly become one of the most important items in your cloud infrastructure. An interruption in real-time data could be critical to production. In such a situation, look for data pipelines that design for failure by including high availability directly in the product. Redundancy is the key. Data pipelines with failover passive or active components ensure operation in the event of a failure in the underlying infrastructure.

The Griddable data pipeline sets up relays close to data sources and consumers close to data targets. Both relays and consumers support scalability and high availability by activating pairs or multiples of similar components. These additional relays and consumers can be in stand-by or active mode. When active, a Griddable policy partitions traffic across relays or consumers to scale the combined effect of multiple components. This linear scaling capability of Griddable components is how the data pipeline tackles extremely large source or target databases.

Dynamic scale and high availability is only possible because of the Griddable Kubernetes infrastructure. The Griddable Kubernetes cluster automatically adds compute nodes to grow capacity when required. As migration projects complete and container resources are freed, the cluster in-turn releases infrastructure as well.

Tip #4: Point-to-point is just the beginning

Many use cases today require modernizing data to multiple targets simultaneously. Whether replacing a legacy Oracle database or re-architecting for microservices, a data pipeline must efficiently replicate to multiple destinations. In addition, it must support unique data customizations for each destination. Using Griddable, schema and data migration can be achieved through the Griddable policy engine. Griddable rearchitects monolithic databases containing multiple schemas into many databases which each contain the relevant data.

Tip #5: Protect regulated data

Today nearly every database contains regulated data because all personally-identifying information must be secure. The penalties for data breaches are becoming extremely punitive and the public exposure is even worse. Even with legal safeguards, it’s simply not advisable to copy personal data from the country of origin. The data pipeline must mask, encrypt, selectively replace or remove regulated data without impacting the operation of the data pipeline.