Getting Started with Spring Cloud Data Flow

In this article, I will show you how you can get started with Spring Cloud Data Flow. Spring Cloud Data Flow is an amazing platform for building data integration and processing pipelines. It has a very user-friendly graphical dashboard where you can define your streams, making your work with data an absolute pleasure.

The goal of this article is to have you learn to build some simple data pipelines by the time you are finished reading. Before we get started there are a few system requirements:

You should have JDK 8 installed, as at the time of writing Spring Cloud Data Flow is somewhat tricky to get to work with JDK 9 (missing JAXB libraries)

You should have Docker installed. If you are not sure why this is useful, I have written an article explaining docker use as a development tool. If you still don’t want to install docker, you need to be able to get MySql, Redis, and RabbitMQ accessible from your machine.

Getting Spring Cloud Data Flow Server up and running

As I have mentioned, in order to get the platform running, you need some middleware. The first is RabbitMQ, you could use Kafka for your stream communication, but for the simplicity of this tutorial we are going to go with RabbitMQ:

Running this command will start RabbitMQ Docker container on your machine exposed on the default ports. You will also get a management console that will let you check on the status of your broker.

In order to get analytics from Spring Cloud Data Flow, you will need Redis as well. This is not 100% required, but since it is not much hassle- let’s get it started. If you are running Data Flow in a production deployment you will definitely want it:

docker run --name dataflow-redis -p 6379:6379 -d redis

The last pre-requisite is a MySql instance. If you do not have it, you will end up with an in-memory H2 database powering Data Flow. The problem with that is that you will lose all your data on a server restart. This could be actually desirable for testing, but incredibly frustrating if you invest some time in configuring your Streams only to lose them on a server restart. While creating the container, we will set a custom password and create a database for Data Flow:

We have downloaded the local version of the server. That means that the different applications composing our Streams will be deployed as local Java processes. There are Cloud Foundry and Kubernetes versions of the server available if you want something more production-ready.

Time to start the server. We will pass MySQL and RabbitMQ parameters in this starting command. Redis default properties are good enough:

Spring Cloud Data Flow – First Look

Hopefully, your server started without a problem and you are seeing something like this in your console:

The beautiful logo on the start of Spring Cloud Data FlowData Flow successfully started

If you want you can look into your MySQL instance where you should see a bunch of tables created:

Data Flow creates multiple tables to keep track of things

Time to see Spring Cloud Data Flow itself! Go to http://localhost:9393/dashboard to see the dashboard:

Spring Cloud Data Flow dashboard

It looks a bit empty! This is because we did not load any starter apps. Spring Cloud Stream App Starters is a project that provides a multitude of ready-to-go starter apps for building Streams. You can read from FTP, HTTP, JDBC, twitter and more, process and save to a multitude of sources. There are three main concepts to which each Application can belong:

Source – These are the available sources of data. You start your streaming pipelines from them.

Processor – These take data and send them further in the processing pipeline. They sit in the middle.

Sink – They are the endpoints for the streams. This is where the data ends in the end.

These are being constantly added, and you can see the up-to-date list on the official project site. Currently we have:

Source

Processor

Sink

file

aggregator

aggregate-counter

ftp

bridge

cassandra

gemfire

filter

counter

gemfire-cq

groovy-filter

field-value-counter

http

groovy-transform

file

jdbc

header-enricher

ftp

jms

httpclient

gemfire

load-generator

pmml

gpfdist

loggregator

python-http

hdfs

mail

python-jython

hdfs-dataset

mongodb

scriptable-transform

jdbc

mqtt

splitter

log

rabbit

tasklaunchrequest-transform

mongodb

s3

tcp-client

mqtt

sftp

tensorflow

pgcopy

syslog

transform

rabbit

tcp

twitter-sentiment

redis-pubsub

tcp-client

router

time

s3

trigger

sftp

triggertask

task-launcher-cloudfoundry

twitterstream

task-launcher-local

task-launcher-yarn

tcp

throughput

websocket

This is an impressive list! So how do we get them into the Spring Cloud Data Server? Could not be easier! First, we are going to use the RabbitMQ + Maven flavour for the starters, as this is how we set up the server. From the project website, the URL for the stable release is http://bit.ly/Celsius-SR1-stream-applications-rabbit-maven. We can supply this to the Data Flow server. First, click:

And then populate the URI and click the Import button:

Bulk Importing the starter apps

If all went well, then you should see multiple starter apps available.

Multiple starter apps available for use in Data Flow

Building our first Data Flow Stream

We are now ready to build the first Data Flow Stream. To do this we will head to the Streams tab on the Dashboard and click the Create Stream button:

Click the highlighted button to create a new Stream

Here, we will create a stream that reads from HTTP endpoint, upper-cases the content and saves it all to a file in c:/dataflow-output (if you are on Windows, otherwise you can choose different directory). The aim of this exercise is to show you how Source, Processor and Sink connect together and how seamless it all is! Let’s drag and drop the following into the workspace:

Source – HTTP

Processor – transform

Sink – file

You should see the following:

The three components that will create our stream

As you can see there are red exclamation marks displayed. That means that the Stream is not healthy. You should click on the tiny squares in the graphical representation to connect the streams, or alternatively, in the text field, you can specify how the Stream should be composed:

http | transform | file

With that, we just need to configure our stream accordingly. This can be done by either clicking on the graphical representation cog-wheel icon that appears when selected:

Click on the highlighted icon to configure the component

Or by using the text field. One thing that you get from that graphical interface is quite a nice way of inputting the properties. For example to configure that HTTP source we can simply set the port like that:

We can see what other properties are configurable

Lets set the remaining properties via the text-field. The final Stream description should look like this:

Great! We have the first stream, now lets click Create Stream button visible just above the input text-field and set the stream name to Upper-Case-Stream:

Creating our first stream

I have ticked the Deploy stream(s) box to have the stream automatically deployed. The stream should be deployed shortly:

You can see the stream successfully deployed

Trying out the Stream

It would not be much fun to just create the stream and not try it! To do that you can get Postman running to send a few requests to that HTTP endpoint:

Sending example request

You will quickly see that there are relevant queues and exchanges created in the connected RabbitMQ instance:

Queues are created, and so are Exchanges. Other processes could listen to those!

And finally looking into that file and directory that we wanted to save the results of our Stream:

The upper-cased results of processing in all its glory!

Congratulations! You have made it through the creation of your first Spring Cloud Data Flow Stream!

Where from here?

I hope that reading this introduction got you excited about using Spring Cloud Data Flow- I certainly enjoyed writing about it! You should be aware that there is also Spring Cloud Data Flow Shell available if you need to work with the platform in a shell only environment (or if you prefer to!).

There is much more to Spring Cloud Data Flow. You can create your own Sources, Processors, and Sinks. You can create Tasks (run on demand) processes rather than Streams. There are complicated processing workflows that you can design. This all is to be discovered and used- hopefully with the knowledge from this article you are ready to start exploring yourself.

9 thoughts on “Getting Started with Spring Cloud Data Flow”

Thanks for useful article for version 1.5.2.RELEASE bulk application import url do not work. It is asking “Please provide a valid URI pointing to the respective properties file.” Do you know what is the reason?

Hi Bartosz, thanks for your comment finally I fixed that by using this url : http://bit.ly/Darwin-GA-stream-applications-rabbit-maven. I found this in the updated documentation. I have other question regarding that we are using docker image , each time after restart it will undo the changes (bulk import and stream definitions). Do you have any suggestion about how I can easily apply changes each time in docker image ? by the way if Spring Cloud Data Flow using mysql database to store data why always it reseting after restart?

I mean based on what I found in the documentation I need to pass database parameters for jar file like this :
java -jar spring-cloud-dataflow-server-local/target/spring-cloud-dataflow-server-local-{project-version}.jar \
–spring.datasource.url=jdbc:mysql://localhost:3306/mydb \
–spring.datasource.username= \
–spring.datasource.password= \
–spring.datasource.driver-class-name=org.mariadb.jdbc.Driver
But what if I want to set mysql as default database and do not need to pass always this parameters?

Sorry for asking so many questions , I faced this issue also after deploying the stream when I want to send the request to http using postman I’m getting java.net.ConnectException: Connection refused (Connection refused), Do you have any idea what’s going wrong?

Hey shachi, it seems that you are trying to use the Kafka binder rather than RabbitMQ as in this tutorial. If you are planning on using Kafka, that’s fine, but you will need to learn about this technology. If you are going to use RabbitMQ, you will need to make sure to put the correct properties as in this tutorial or the documentation: https://cloud.spring.io/spring-cloud-stream-app-starters/

Post navigation

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here:
Cookie Policy

About

E4developer is a place where I share my open and honest views on software development, technology and working with people. The name – e4 comes from a chess move, this is how I start most of my games. Follow me on twitter – @e4developer