The success of quantitative analysis is largely dependent on the ability to capture, store, and process data. Providing timely and trustworthy insight to business decision makers increases the chances of success for a big data project.

Batch processing: Batch processing is used on a large amount of static data, which is scalable and distributed.

Real-time processing: Real time processing is mainly used for continuous unbounded streams of data, which is distributed and has a high velocity.

Hybrid computation model: This model is the combination of batch and real-time processing, which results in the processing of high volume and velocity data.

Big Data Engineering is time-consuming and requires niche skills for addressing data acquisition and data processing, as these aspects are necessary for most solutions. Pivotal has introduced Spring XD and Spring Cloud Dataflow to reduce the overhead in big data engineering. This article will provide a brief overview of Spring XD and a more detailed view point on the aspects of the latest version of this technology i.e. Spring Cloud Data Flow.

The first round of innovation came in the form of Spring XD, which provides a readily consumable solution for common tasks related to data processing. Spring XD is built on top of proven Spring technology and provides support for data ingestion, movement, processing, deep analytics, stream processing, and batch processing.

Spring XD provides a sophisticated, stable, scalable framework for real-time and batch processing. Picking up data and moving it from various sources to targets is much easier with Spring XD.

Spring XD-based Architecture:

Spring XD-based architecture is depicted in the below diagram. With the help of modules described below, we can create, run, deploy, and destroy data pipelines and perform any kind of data processing on them.

The main components of SpringXD are Admin and Container.

Admin UI is used for sending a request to be processed to the server, and the server processes the request with the relevant module performing the task requested. Here, a module is a component which creates the Spring application context.

All modules require an XD container to run and execute the associated task performed by that module.

Following are the key modules in Spring XD architecture.

Source: Creation of a stream always starts with a source module. Source can use a polling mechanism or event driven mechanism and only give an output.

Processor: It takes input message and results the output message after performing some type of processing on input.

Sink: As the name suggests, this module terminates the stream and sends the output to an external resource, e.g. HDFS.

The changing nature of applications and requirements have revealed gaps in Spring XD and the need for a new round of innovation. Below are the most important requirements driving the need for a new framework.

Cloud technology has helped tremendously in easily achieving operational and non-functional requirements at the platform level. Engineering efforts w.r.t. complying with NFR (non-functional requirements) at the application level is still a challenge.

Today’s platforms increasingly see the need for ability to migrate/move over to a cloud vendor of choice. Microservice-based cloud architecture is more apt for this objective but Spring XD does not directly support microservice-based architectures.

Spring XD supports big data scenarios, but there is still a huge portion of projects which do not require Hadoop for storing and processing data.

As a second round of innovation, Pivotal has introduced Spring Cloud Data Flow as a replacement for Spring XD. Spring Cloud Data Flow inherits the advantages of Spring XD and provides a more scalable, solution by leveraging the cloud native approach. Spring Cloud Data Flow is a hybrid computational model that unifies stream and batch data processing. Developers can leverage Spring Cloud Data Flow to create and orchestrate data pipelines for common use cases like data ingestion, real time analytics, and batch processing. Spring Cloud Data Flow is intended to make data engineering easy and focus energy on analysis and specific problems. Spring Cloud Data Flow offers a managed service model only.

Spring Cloud Data Flow Architecture

Spring XD has been revamped to Spring Cloud Data Flow, making fundamental changes in how the functionality is structured and how it can help scale up an application using cloud native architecture.

Spring Cloud Data Flow moves from the traditional component based architecture and adopts a message driven micorservices architecture more suitable for cloud native applications. Spring XD modules are now replaced with microservices deployed on cloud.

Major changes are observed in the following areas:

To take advantage of cloud native platforms, a new service provider interface (SPI) has been introduced in Spring Cloud Data Flow which replaces the Spring XD runtime layer.

User interfacing and integration elements like Admin REST API, shell and UI layer are the same as Spring XD but the underlying architecture is revamped.

Service provider interface or SPI replaces Zookeeper based runtime. Now SPI coordinates with other systems such as Pivotal Cloud Foundry or Yarn for monitoring and launching microservice based applications.

Components of Spring Cloud Data Flow are explained in the below table:

Component

Purpose

Core domain Modules

Core domain modules are the primary building blocks of any data flow. It includes modules like source, sink, stream, and task for batch jobs and processing. All these modules are Spring Boot Data microservice applications.

Module Registry

It maintains the available modules using Maven.

Module Deployer SPI

Abstraction layer for deploying the modules across different runtime environments like Lattice, Cloud Foundry, Yarn, and Local.

Admin

Admin is a Spring Boot application which provides a REST API and UI.

Shell

Using Shell we can run DSL commands to create, process, and destroy the streams and also perform other simple tasks, in turn it connects to Admin’s Rest API.

The above diagram depicts a typical data flow created using the Spring Cloud Data Flow model.

Source, job, sink, and processor are the Spring Boot microservices which can be deployed on either a Cloud Foundry, Lattice, or Yarn cluster. Using these microservices deployed on a cloud native platform we can create data pipelines and ingest into Yarn, Lattice or Cloud Foundry-based targets. Platform-specific SPI (Service Provider Interface) is used for microservices binding, discovery, and channel bindings based on the deployment platform.

Use Case

Real benefits of Spring Cloud Data Flow is the ability to quickly setup and configure , build a data ingestion and processing steps using a unified framework, so that developer bandwidth can be focused on specific problem.

We will consider high level overview of kind of changes required for constructing usecase for a non existing source like Facebook data. The objective of this exercise is to analyze Facebook posts. We do not have a Facebook data source readily available in Spring Cloud Data stream modules, so we need to create custom modules for the Facebook source. To create a data stream three main microservices are required: source, processor and sink. Source, sink, and processor interfaces are already provided.

Annotation @EnableBindings(Source.class) detects the binder implementation (need to set in application classpath like Redis) and then binder creates the channel adapters. All the microservices will be developed as Spring Boot applications for simpler dependency management.

The above source will receive the data from the Facebook stream and write it to the console. Sink.class is passed as a parameter to @EnableBinding. Here, @ServiceActivator is connecting the input to the endpoint console in the above example.

Processor microservices will filter the Facebook posts coming from the FBSource microservice based on SPEL expressions given as input. The output of the processor microservice will be the input of FBSink.

Conclusion

Spring Cloud Data Flow uses the Spring Cloud stream modules, with which we can create and run messaging microservices as Spring Boot applications so that they can be deployed on different platforms, run independently, and interact with each other. Spring Cloud Data Flow acts as glue while creating the data pipeline using Spring Cloud stream modules.

There are currently many standalone open source projects for managing data ingestion, real time analysis, and data loading. Spring Cloud Data Flow provides the unified, distributed, and extensible services for data ingestion, real time analytics, batch processing, and data export.