Most data scientists and database experts know that transactional data is unique. Transactional data is valuable only when it exists in an ecosystem of transactional semantics. Yet, transactional data is still sometimes shared using buses or queues that treat it like click data or asynchronous messages.

Pipeline components

Each Griddable data pipeline architecture utilizes one or more instances of three basic components: relays, consumers and change history servers.

First, relays pull change data using a source-dependent protocol. These protocols include JDBC for Oracle using LogMiner, or the binlog protocol for MySQL and MariaDB. The relay applies replication policies to each event to determine which events to transmit to consumers. Finally, the relay publishes events to a circular in-memory buffer where consumers pull them for entry into downstream target databases.

Consumers continuously pull replication events from relays over HTTP. Then, they store events in consumer event buffers until processing for entry in the target database. The consumer maintains the original order of transactions as it pulls them from the relay. When a relay no longer contains a required event, the consumer pulls the change from a persistent change history server. As it pulls and processes transactions, it keeps its own state as the last successfully processed transaction.

Preserving transaction consistency

All components in Griddable.io’s data pipeline architecture follow the commit timeline defined by the source database. This means that they see changes in the source database commit order and preserve transaction boundaries. Pull-based data pipeline architectures have several advantages:

Resilience to unavailable, slow or faulty components

Easy sharing of state with downstream components

Dramatic scalability by adding additional components where capacity is needed

Following this model, every component in the data pipeline architecture expresses its state as simply the last transaction successfully processed. Each database assigns change numbers slightly differently. For example, Oracle databases assign a sequential System Change Number (SCN) to each transaction. Thus, the SCN succinctly describes every component’s progress in processing the incoming stream of change data.

The Griddable data pipeline architecture evolves gracefully as requirements for scalability and availability change. Additional relay instances work together to replicate high transaction counts from larger and faster source databases. And like relays, additional consumer instances work together to each process a portion of the incoming replicated events. The data pipeline architecture scales by adding consumers because each consumer is completely independent of relays.

To deploy additional relays or consumers on demand, the Griddable data pipeline architecture runs on an elastic Kubernetes (K8S) infrastructure. Kubernetes resizes the infrastructure automatically as needed and provides portability for the infrastructure to all major public clouds. Further, the Kubernetes graphical management dashboard shows all nodes in the cluster and its operational state.

Next step

Push the “Live demo” button to see the Griddable data pipeline architecture in action.