Reference Guide

Introduction

Overview

Spring XD is a unified, distributed, and extensible service for data ingestion, real time analytics, batch processing, and data export. The Spring XD project is an open source Apache 2 License licenced project whose goal is to tackle big data complexity. Much of the complexity in building real-world big data applications is related to integrating many disparate systems into one cohesive solution across a range of use-cases.
Common use-cases encountered in creating a comprehensive big data solution are

High throughput distributed data ingestion from a variety of input sources into big data store such as HDFS or Splunk

Download Spring XD

If you want to try out the latest build of Spring XD, You can download the snapshot distribution from the spring snapshots repository. You can also build the project from source if you wish. The wiki content should also be kept up to date with the current snapshot so if you are reading this on the github website, things may have changed since the last stable release.

Unzip the distribution which will unpack to a single installation directory. All the commands below are executed from this directory, so change into it before proceeding.

$ cd spring-xd-1.3.3.BUILD-SNAPSHOT

Install Spring XD

Spring XD can be run in two different modes. There’s a single-node runtime option for testing and development, and there’s a distributed runtime which supports distribution of processing tasks across multiple nodes. This document will get you up and running quickly with a single-node runtime. See Running Distributed Mode for details on setting up a distributed runtime.

You can also install Spring XD using homebrew on OSX and RPM on RedHat/CentOS.

Start the Runtime and the XD Shell

The single node option is the easiest to get started with. It runs everything you need in a single process. To start it, you just need to cd to the xd directory and run the following command

xd/bin>$ ./xd-singlenode

In a separate terminal, cd into the shell directory and start the XD shell, which you can use to issue commands.

The shell is a more user-friendly front end to the REST API which Spring XD exposes to clients. The URL of the currently targeted Spring XD server is shown at startup.

Note

If the server could not be reached, the prompt will read

server-unknown:>

You can then use the admin config server <url> to attempt to reconnect to the admin REST endpoint once you’ve figured out what went wrong:

admin config server http://localhost:9393

You should now be able to start using Spring XD.

Tip

Spring XD uses ZooKeeper internally which typically runs as an external process. XD singlenode runs with an embedded ZooKeeper server and assigns a random available port. This keeps things very simple. However if you already have a ZooKeeper ensemble set up and want to connect to it, you can edit xd/config/servers.yml:

Also, sometimes it is useful in troubleshooting to connect the ZooKeeper CLI to the embedded server. The assigned server port is listed in the console log, but you can also set the port directly by setting the property zk.embedded.server.port in servers.yml or set JAVA_OPTS before starting xd-singlenode.

$export JAVA_OPTS=-Dzk.embedded.server.port=<port>

Create a Stream

In Spring XD, a basic stream defines the ingestion of event driven data from a source to a sink that passes through any number of processors. You can create a new stream by issuing a stream create command from the XD shell. Stream definitions are built from a simple DSL. For example, execute:

xd:> stream create --name ticktock --definition "time | log" --deploy

This defines a stream named ticktock based off the DSL expression time | log. The DSL uses the "pipe" symbol |, to connect a source to a sink. The stream server finds the time and log definitions in the modules directory and uses them to setup the stream. In this simple example, the time source simply sends the current time as a message each second, and the log sink outputs it using the logging framework at the WARN logging level. Since the --deploy flag was provided, this stream will be deployed immediately. In the console where you started the server, you will see log output similar to that listed below

To stop the stream, and remove the definition completely, you can use the stream destroy command:

xd:>stream destroy --name ticktock

It is also possible to stop and restart the stream instead, using the undeploy and deploy commands. The shell supports command completion so you can hit the tab key to see which commands and options are available.

Explore Spring XD

Learn about the modules available in Spring XD in the Sources, Processors, and Sinks sections of the documentation.

ZooKeeper - Provides all runtime information for the XD cluster. Tracks running containers, in which containers modules and jobs are deployed, stream definitions, deployment manifests, and the like, see XD Distributed Runtime for an overview on how XD uses ZooKeeper.

Spring Batch Job Repository Database - An RDBMS is required for jobs. The XD distribution comes with HSQLDB, but this is not appropriate for a production installation. XD supports any JDBC compliant database.

A Message Broker - Used for data transport. XD data transport is designed to be pluggable. Currently XD supports
Rabbit MQ and Redis for messaging during stream and job processing, and Kafka for messaging during stream processing
only. Please note that support for job processing using Kafka as transport is not currently available. A production installation must configure one of these transport options.

xd-admin command line args:

analytics - The data store that will be used to store the analytics data. The default is redis

help - Displays help for the command args. Help information may be accessed with a -? or -h.

httpPort - The http port for the REST API server. Defaults to 9393.

mgmtPort - The port for the management server. Defaults to the admin server port.

Also, note that it is recommended to use fixed http port for XDAdmin(s). This makes it easy to know the admin server addresses the REST clients (shell, webUI) can point to. If a random port is chosen (with server.port or $PORT set to 0), then one needs to go through the log and find which port admin server’s tomcat starts at.

xd-container command line args:

analytics - How to persist analytics such as counters and gauges. The default is redis

groups - The assigned group membership for this container as a comma delimited list

hadoopDistro - The Hadoop distribution to be used for HDFS access. HDFS is not available if not set.

help - Displays help for the command args. Help information may be accessed with a -? or -h.

mgmtPort - The port for the management server. Defaults to the container server port.

Setting up a RDBMS

The distributed runtime requires an RDBMS. The XD distrubution comes with an HSQLDB in memory database for testing purposes, but an alternate is expected. To start HSQLDB:

$ cd hsqldb/bin
$ ./hsqldb-server

To configure XD to connect to a different RDBMS, have a look at xd/config/servers.yml in the spring:datasource section for details. Note that spring.batch.initializer.enabled is set to true by default which will initialize the Spring Batch schema if it is not already set up. However, if those tables have already been created, they will be unaffected.

If the provided schemas are customized, other values may need to be customized. In the xd/config/servers.yml the following block exposes database specific values for the batch job repository.

A special handler for large objects. The default is usually fine, except for some (usually older) versions of Oracle. The default is determined from the data base type.

Used to determine what id incremented to use. The default is usually fine, except when the type returned by the datasource should be overridden (GemfireXD for example).

Configures how large the maximum message can be stored in a VARCHAR type field.

Prefix for repository tables.

Flag to determine whether to check for an existing transaction when a JobExecution is created. Defaults to true because it is usually a mistake, and leads to problems with restartability and also to deadlocks in multi-threaded steps.

Flag that indicates if the database tables should be created on startup.

Setting up ZooKeeper

Currently XD does not ship with ZooKeeper. At the time of this writing, the compliant version is 3.4.6 and you can download it from here. Please refer to the ZooKeeper Getting Started Guide for more information. A ZooKeeper ensemble consisting of at least three members is recommended for production installations, but a single server is all that is needed to have XD up and running.

You can configure the root path in Zookeeper where an XD cluster’s top level nodes will be created. This allows you to run multiple independent clusters of XD that share a single ZK instance. Add the following to servers.yml to configure. You can also set as an environment variable, system property in the standard manner.

Additionally, various time related settings may be optionally configured for ZooKeeper:

Setting up Redis

Redis is the default transport when running in distributed mode.

Installing Redis

If you already have a running instance of Redis it can be used for Spring XD. By default Spring XD will try to use a Redis instance running on localhost using port 6379. You can change that in the servers.yml file residing in the config/ directory.

If you don’t have a pre-existing installation of Redis, you can use the Spring XD provided instance (For Linux and Mac) which is included in the .zip download. If you are installing using brew or rpm you should install Redis using those installers or download the source tarball and compile Redis yourself. If you used the .zip download then inside the Spring XD installation directory (spring-xd) do:

$ cd redis/bin
$ ./install-redis

This will compile the Redis source tar and add the Redis executables under redis/bin:

redis-check-dump

redis-sentinel

redis-benchmark

redis-cli

redis-server

You are now ready to start Redis by executing

$ ./redis-server

Tip

For further information on installing Redis in general, please checkout the Redis Quick Start guide. If you are using Mac OS, you can also install Redis via Homebrew

Troubleshooting

Redis on Windows

Presently, Spring XD does not ship Windows binaries for Redis (See XD-151). However, Microsoftis actively working on supporting Redis on Windows. You can download WindowsRedis binaries from:

Redis is not running

If you try to run Spring XD and Redis is NOT running, you will see the following exception:

11:26:37,830 ERROR main launcher.RedisContainerLauncher:85 - Unable to connect to Redis on localhost:6379; nested exception is com.lambdaworks.redis.RedisException: Unable to connect
Redis does not seem to be running. Did you install and start Redis? Please see the Getting Started section of the guide for instructions.

Using RabbitMQ

Installing RabbitMQ

If you already have a running instance of RabbitMQ it can be used for Spring XD. By default Spring XD will try to use a Rabbit instance running on localhost using port 5672. The default account credentials of guest/guest are assumed. You can change that in the servers.yml file residing in the config/ directory.

If you don’t have a RabbitMQ installation already, head over to http://www.rabbitmq.com and follow the instructions. Packages are provided for Windows, Mac and various flavor of unix/linux.

Starting Spring XD in Distributed Mode

You can start the xd-container and xd-admin servers individually as follows:

xd/bin>$ ./xd-admin
xd/bin>$ ./xd-container

Choosing a Transport

Spring XD uses data transport for sending data from the output of one module to the input of the next module. In general, this requires remote transport between container nodes. The Admin server also uses the data bus to launch batch jobs by sending a message to the job’s launch channel. Since the same transport must be shared by the Admin and all Containers, the transport configuration is centrally configured in xd/config/servers.yml.
The default transport is redis. Open servers.yml with a text editor and you will see the transport configuration near the top. To change the transport, you can uncomment this section and change the transport to rabbit or any other supported transport. Any changes to the transport configuration must be replicated to every XD node in the cluster.

Note

XD singlenode also supports a --transport command line argument, useful for testing streams under alternate transports.

#xd:
# transport: redis

Note

If you have multiple XD instances running share a single RabbitMQ server for transport, you may encounter issues if each system contains streams of the same name. We recommend using a different RabbitMQ virtual host for each system. Update the spring.rabbitmq.virtual_host property in $XD_HOME/config/servers.yml to point XD at the correct virtual host.

Choosing an Analytics provider

By default, the xd-container will store Analytics data in redis. At the time of writing, this is the only supported option (when running in distributed mode). Use the --analytics option to specify another backing store for Analytics data.

The back off multiplier (previous interval x multiplier = next interval)

The maximum number of retry attempts

Other Options

There are additional configuration options available for these scripts:

To specify the location of the Spring XD install other than the default configured in the script

export XD_HOME=<Specific XD install directory>

To specify the http port of the XDAdmin server,

xd/bin>$ ./xd-admin --httpPort <httpPort>

The XDContainer nodes by default start up with server.port 0 (which means they will scan for an available HTTP port). You can disable the HTTP endpoints for the XDContainer by setting server.port=-1. Note that in this case HTTP source support will not work in a PaaS environment because typically it would require XD to bind to a specific port. Both the XDAdmin and XDContainer processes bind to server.port $PORT (i.e. an environment variable if one is available, as is typical in a PaaS).

Using Hadoop

Spring XD supports the following Hadoop distributions:

hadoop27 - Apache Hadoop 2.7.1 (default)

phd21 - Pivotal HD 2.1 and 2.0

phd30 - Pivotal HD 3.0

cdh5 - Cloudera CDH 5.3.0

hdp22 - Hortonworks Data Platform 2.2

To specify the distribution libraries to use for Hadoop client connections, use the option--hadoopDistro for the xd-container and xd-shell commands:

XD-Shell in Distributed Mode

If you wish to use a XD-Shell that is on a different machine than where you deployed your admin server.

1) Open your shell

shell/bin>$ ./xd-shell

2) From the xd shell use the "admin config server" command i.e.

admin config server <yourhost>:9393

Running on YARN

Introduction

The Spring XD distributed runtime (DIRT) supports distribution of
processing tasks across multiple nodes. See
Running Distributed Mode for
information on running Spring XD in distributed mode. One option is to
run these nodes on a Hadoop YARN cluster rather than on VMs or
physical servers managed by you.

Running YARN on some Ubuntu distributions and Mac OS X has shown to have issues when YARN applications are killed. It seems that the kill command
doesn’t always succesfully kill the corresponding OS process and you end up with application processes still running. See HADOOP-9752 for more details.

You need a supported transport, see
Running Distributed Mode for
installation of Redis or Rabbit MQ. Spring XD on YARN currently uses
Redis as the default data transport.

You also need Zookeeper running. If your Hadoop cluster doesn’t have
Zookeeper installed you need to install and run it specifically for
Spring XD. See the
Setting up
ZooKeeper section of the "Running Distributed Mode" chapter.

Lastly, you need an RDBMs to support batch jobs and JDBC operations.

Download Spring XD on YARN binaries

In addition to the regular spring-xd-<version>-dist.zip files we
also distribute a zip file that includes all you need to deploy on
YARN. The name of this zip file is spring-xd-<version>-yarn.zip. You
can download the zip file for the current release from
Spring release repo or a milestone build from the
Spring milestone repo. Unzip the downloaded file and you should see a
spring-xd-<version>-yarn directory.

Configure your deployment

Configuration options are contained in a config/servers.yml file in
the Spring XD YARN install directory. You need to configure the hadoop
settings, the transport choice plus redis/rabbit settings, the
zookeeper settings and the JDBC datasource properties.

Depending on the distribution used you might need to change the
siteYarnAppClasspath and siteMapreduceAppClasspath. We have
provided basic settings for the supported distros, you just need to
uncomment the ones for the distro you use.

XD options

For Spring XD you need to define how many admin servers and containers
you need using properties spring.xd.adminServers and spring.xd.containers
respectively. You also need to define the HDFS location using property
spring.yarn.applicationDir where the Spring XD binary and config
files will be stored.

Setting hadoop topology.script.file.name property is mandatory if
more sophisticated container placement is used to allocate XD admins
or containers from a spesific hosts or racks. If this property is not
set to match a one used in a hadoop cluster, allocations using hosts
and racks will simply fail.

XD Admin port

On default the property server.port which defines the used port for
embedded server is set to 9393 but it can be overridden by changing the value in servers.yml.

#Port that admin-ui is listening on
#server:
# port: 9393

On YARN it is recommended that you simply set the port to 0 meaning
that server will automatically choose a random port. This is advisable
simply because it will prevent port collission which are usually a
little difficult to track down from a cluster. See more instructions
in the section Connect xd-shell to YARN runtime managed admins for how to connect
xd-shell to admins managed by YARN.

#Port that admin-ui is listening on
server:
port: 0

Adding custom modules

The recommended approach for custom modules is to define the module registry location as a directory
in HDFS. This will allow the most flexibility and the modules will automatically be available to all XD containers
running in the Hadoop cluster. The xd.customModule.home property is by default set to the value
${spring.hadoop.fsUri}/xd/yarn/custom-modules for YARN deployments. This can be modified, but we recommend keeping
it to a location on HDFS within the same Hadoop cluster.

Customizing module configurations

The configurations for all standard XD modules can be customized by modifying the
file modules.yml in the config directory and then adding it to the modules-config.zip
archive in the same directory.

You can run the following command from the config directory to
achieve this:

jar -uf modules-config.zip modules.yml

Modify container logging

Logging configuration for XD admins and containers are defined in
files config/xd-admin-logger.properties and
config/xd-container-logger.properties respectively. These two files
are copied over to hdfs during the deployment. If you want to modify
logging configuration either modify source files and do a deployment
again or modify files in hdfs directly.

Control XD YARN application lifecycle

Change current directory to be the directory that was unzipped
spring-xd-<version>-yarn. To read about runtime configuration and more
sophisticated features see section
Working with container groups.

Pay attention to APPLICATION ID listed in output because that is an
id used in most of the control commands to communicate to a specific
application instance. For example you may have multiple XD YARN
runtime instances running.

Configuring YARN memory reservations

YARN Nodemanager is continously tracking how much memory is used by
individual YARN containers. If containers are using more memory than
what the configuration allows, containers are simply killed by a
Nodemanager. Application master controlling the app lifecycle is given
a little more freedom meaning that Nodemanager is not that aggressive
when making a desicion when a container should be killed.

Lets take a quick look of memory related settings in YARN cluster and
in YARN applications. Below xml config is what a default vanilla Apache
Hadoop uses for memory related settings. Other distributions may have
different defaults.

Enables a check for physical memory of a process. This check if
enabled is directly tracking amount of memory requested for a YARN
container.

yarn.nodemanager.vmem-check-enabled

Enables a check for virtual memory of a process. This setting is one
which is usually causing containers of a custom YARN applications to
get killed by a node manager. Usually the actual ratio between
physical and virtual memory is higher than a default 2.1 or bugs in
a OS is causing wrong calculation of a used virtual memory.

yarn.nodemanager.vmem-pmem-ratio

Defines a ratio of allowed virtual memory compared to physical memory.
This ratio simply defines how much virtual memory a process can use
but the actual tracked size is always calculated from a physical
memory limit.

yarn.scheduler.minimum-allocation-mb

Defines a minimum allocated memory for container.

Note

This setting also indirectly defines what is the actual physical
memory limit requested during a container allocation. Actual physical
memory limit is always going to be multiple of this setting rounded to
upper bound. For example if this setting is left to default 1024 and
container is requested with 512M, 1024M is going to be used.
However if requested size is 1100M, actual size is set to 2048M.

yarn.scheduler.maximum-allocation-mb

Defines a maximum allocated memory for container.

yarn.nodemanager.resource.memory-mb

Defines how much memory a node controlled by a node manager is allowed
to allocate. This setting should be set to amount of which OS is able
give to YARN managed processes in a way which doesn’t cause OS to
swap, etc.

Tip

If testing XD YARN runtime on a single computer with a multiple VM
based hadoop cluster a pro tip is to set both
yarn.nodemanager.pmem-check-enabled and
yarn.nodemanager.vmem-check-enabled to false, set
yarn.scheduler.minimum-allocation-mb much lower to either 256 or
512 and yarn.nodemanager.resource.memory-mb 15%-20% below a
defined VM memory.

We have three memory settings for components participating XD YARN
runtime. You can use configuration properties
spring.xd.appmasterMemory, spring.xd.adminMemory and
spring.xd.containerMemory respectively.

Working with container groups

Container grouping and clustering is more sophisticated feature which
allows better control of XD admins and containers at runtime. Basic
features are:

Control members in a groups.

Control lifecycle state for group as whole.

Create groups dynamically.

Re-start failed containers.

XD YARN Runtime has a few built-in groups to get you started. There
are two groups admin and container created by default which both
are lauching exactly one container chosen randomly from YARN cluster.

In above example we used option -w which is a shortcut for defining
YARN allocation which uses a wildcard requests allowing containers to
be requested from any host.

Create a new group

When you want to create a new group that is because you need to
add new XD admin or container nodes to a current system with a
different settings. These setting usually differ by a colocation of
containers. More about built-in group configuration refer to section
Built-in group configurations.

Introduction to YARN resource allocation

This section describes some background of how YARN resource allocation
works, what are the limitations of it and more importantly how it
reflects into XD YARN runtime.

Note

More detailed info of resource allocation can be found from a Spring
for Apache Hadoop reference documentation.

YARN as having a strong roots from original MapReduce framework is
imposing relatively strange concepts of where containers are about to
be executed. In a MapReduce world every map and reduce tasks are
executed in its own container where colocation is usually determined
by a physical location of a HDFS file block map or reduce tasks are
accessing. This is introducing a concepts of allocating containers on
any hosts, specific hosts or specific racks. Usually YARN is
trying to place container as close as possible to a physical location
to minimize network IO so i.e. if host cannot be chosen, rack is
chosen instead assuming a whole rack is connected together with a fast
switch.

For custom YARN applications like XD YARN runtime this doesn’t
necessarily make that much sense because we’re not hard-tied to HDFS
file blocks. What makes sense is that we can still place containers on
different racks to get better high availability in case whole rack
goes down or if specific containers needs to exist on specific hosts
to access either custom physical or network resources. Good example of
having a need to execute something on a specific host is either a disk
access or outbound internet access if cluster is highly secured.

One other YARN resource allocation concept worth mentioning is
relaxation of container locality. This simply means that if resources
are requested from hosts or racks, YARN will relax those requests if
resources cannot be allocated immediately. Turning relax flag off
guarantees that containers will be allocated from hosts or racks.
Though these requests will then wait forever if allocation cannot be
done.

Application Configuration

Introduction

There are two main parts of Spring XD that can be configured, servers and modules.

The servers (xd-singlenode, xd-admin, xd-container) are Spring Boot applications and are configured as described in the Spring Boot Reference documentation. In the most simple case this means editing values in the YAML based configuration file servers.yml. The values in this configuration file will overwrite the values in the default application.yml file that is embedded in the XD jar.

Note

The use of YAML is an alternative to using property files. YAML is a superset of JSON, and as such is a very convenient format for specifying hierarchical configuration data.

For modules, each module has its own configuration file located in its own directory, for example source/http/http.properties. Shared configuration values for modules can be placed in a common modules.yml file.

For both server and module configuration, you can have environment specific settings through the use of application profiles and the ability to override values in files by setting OS environment variables.

In this section we will walk though how to configure servers and modules.

Server Configuration

The startup scripts for xd-singlenode, xd-admin, and xd-container will by default look for the file $XD_HOME\config\servers.yml as a source of externalized configuration information.

The location and name of this resourse can be changed by using the environment variables XD_CONFIG_LOCATION and XD_CONFIG_NAME. The start up script takes the value of these environment variables to set the Spring Boot properties spring.config.location and spring.config.name. Note, that for XD_CONFIG_LOCATION you can reference any Spring Resource implementation, most commonly denoted using the prefixes classpath:, file: and http:.

It is common to keep your server configuration separate form the installation directory of XD itself. To do this, here is an example environment variable setting

Note: the file path separator ("/") at the end of XD_CONFIG_LOCATION is necessary.

Profile support

Profiles provide a way to segregate parts of your application configuration and change their availability and/or values based on the environment. This lets you have different configuration settings for qa and prod environments and to easily switch between them.

To activate a profile, set the OS environment variable SPRING_PROFILES_ACTIVE to a comma delimited list of profile names. The server looks to load profile specific variants of the servers.yml file based on the naming convention servers-{profile}.yml. For example, if SPRING_PROFILES_ACTIVE=prod the following files would be searched for in the following order.

XD_CONFIG_LOCATION/servers-prod.yml

XD_CONFIG_LOCATION/servers.yml

You may also put multiple profile specific configuration in a single servers.yml file by using the key spring.profiles in different sections of the configuration file. See Multi-profile YAML documents for more information.

Database Configuration

Spring XD saves the state of the batch job workflows in a relational database. When running xd-singlenode an embedded HSQLDB database instance is started automatically. When running in distributed mode a standalone HSQLDB instance can be used. A startup script hsqldb-server is provided in the installation directory under the folder hsqldb/bin. It is recommended to use HSQLDB only for development and learning.

When deploying in a production environment, you will need to select another database. Spring XD is primarily tested on HSQLDB, MySql and Postgres but Apache Derby and Oracle are supported as well. All batch workflow tables are automatically created, if they do not exist, when you use HSQLDB, MySQL, Postgres, Apache Derby or Oracle. The JDBC driver jars for HSQLDB and Postgres are already on the XD classpath. If you use MySQL, Apache Derby or Oracle for your batch repository database, then you would need to copy the corresponding JDBC jar into the lib directory under $XD_HOME ($XD_HOME/lib) before starting Spring XD.

Note

If you access any database other than HSQLDB or Postgres in a stream module then the JDBC driver jar for that database needs to be present in the $XD_HOME/lib directory.

The provided configuration file servers.yml located in $XD_HOME/config has commented out configuration for some commonly used databases. You can use these as a basis to support your database environment. XD also utilizes the Tomcat jdbc connection pool and these settings can be configured in the servers.yml.

Note

Until full schema support is added for Sybase and other databases, you will need to put a .jar file in the xd/lib directory that contains the equivalent functionality as these DDL scripts.

Note

There was a schema change in version 1.0 RC1. Use or adapt the the sample migration class to update your schema.

HSQLDB

When in distributed mode and you want to use HSQLDB, you need to change the value of spring.datasource properties. As an example,

The properties under hsql.server are substituted in the spring.datasource.url property value. This lets you create short variants of existing Spring Boot properties. Using this style, you can override the value of these configuration variables by setting an OS environment variable, such as xd_server_host. Alternatively, you can not use any placeholders and set spring.datasource.url directly to known values.

MySQL

When in distributed mode and you want to use MySQL, you need to change the value of spring.datasource.* properties. As an example,

A comma-separated list of RabbitMQ server addresses (a single entry when not clustering).

A comma-separated list of RabbitMQ management plugin URLs - only used when nodes contains more than one entry.
Entries in this list must correspond to the corresponding entry in addresses.

A comma-separated list of RabbitMQ node names; when more than one entry, used to locate the server address where
a queue is located.
Entries in this list must correspond to the corresponding entry in addresses.

The user name.

The password.

The virtual host.

True to use SSL for the AMQP protocol.

The location of the SSL properties file, when certificate exchange is used.

Discrete SSL configuration properties as an alternative to a properties file.

The channel cache size, this should be set to a large enough value to avoid opening/closing channels at a high
rate.
Examine the channel creation via the RabbitMQ Admin UI to determine a suitable value. Defaults to 100.

To override these settings set an OS environment variable such as spring_rabbitmq_host to the value you require.

When configuring a clustered environment, with
High Availability Queues, it is possible to configure the
bus so that it consumes from the node where the queue is located.
This is facilitated by the LocalizedQueueConnectionFactory which determines the node for a queue.
To enable this feature, add the list of nodes to the spring.rabbitmq.nodes property.
These nodes correspond to the broker addresses in the corresponding place in the spring.rabbitmq.addresses property.
The size of these lists must be identical (when the nodes property has more than one entry).
The spring.rabbitmq.adminAddresses property contains the corresponding URLs for the admins on those same nodes.
Again, the property list must be the same length.

In addition, the following default settings for the rabbit message bus can be modified in servers.yml…​

When the bus (or a stream module deployment) is configured to compress messages, specifies the compression level. See java.uti.zip.Deflater for available values; defaults to 1 (BEST_SPEED)

RabbitMQ headers longer than this value are not converted to String; instead they are made available as a
DataInputStream; these are currently not properly re-converted during output conversion.
If you expect headers longer than this, increase this setting appropriately if you wish them to pass to downstream
modules.

When true, the bus will automatically declare dead letter queues and binding for each bus queue. The user is responsible for setting a policy on the broker to enable dead-lettering; see Message Bus Configuration for more information. The bus will configure a dead-letter-exchange (<prefix>DLX) and bind a queue with the name <original queue name>.dlq and route using the original queue name

The time in milliseconds before retrying a failed message delivery

The maximum time (ms) to wait between retries

The back off multiplier (previous interval x multiplier = next interval)

When batching is enabled, the size of the buffer that will cause a batch to be released (overrides batchSize)

True to enable message batching by producers

The number of messages in a batch (may be preempted by batchBufferLimit or batchTimeout)

The minimum number of consumer threads receiving messages for a module

When true queues for subscriptions to publish/subscribe named channels (tap:, topic:) will be declared as durable and are eligible for dead-letter configuration according to the autoBindDLQ setting.

The maximum number of delivery attempts. Setting this to 1 disables the retry mechanism and requeue must be set to false if you wish failed messages to be rejected or routed to a DLQ. Otherwise deliveries
will be attempted repeatedly, with no termination. Also see republishToDLQ

The maximum number of consumer threads receiving messages for a module

A prefix applied to all queues, exchanges so that policies (HA etc) can be applied

The number of messages to prefetch for each consumer

Determines which reply headers will be transported

By default, failed messages after retries are exhausted are rejected. If a dead-letter queue (DLQ) is configured, rabbitmq will route the failed message (unchanged) to the DLQ. Setting this property to true instructs the bus to republish failed messages to the DLQ, with additional headers, including the exception message and stack trace from the cause of the final failure. Note that the republish will occur even if maxAttempts is only set to 1. Also see autoBindDLQ

Determines which request headers will be transported

Whether rejected messages will be requeued by default

Whether the channel is to be transacted

The number of messages to process between acks (when ack mode is AUTO).

Kafka

If you want to use Kafka as a data transport, the following connection settings, as well as defaults for the kafka
message bus can be modified in servers.yml. Starting with release 1.2, Spring XD only supports Kafka 0.8.2 or higher.

Note

To ensure the proper functioning of the Kafka Message Bus, you must eanble log cleaning in your Kafka
configuration. This is set using the configuration variable log.cleaner.enable=true.
See the Kafka documentation for additional
configuration options for log cleaning.

Note

At this time, the Kafka message bus does not support job processing.

Note

The Kafka message bus does not support count=0 for module deployments, and therefore, it does not support
direct binding of modules. This feature will be available in a future release. In the meantime, if direct communication
between modules is necessary for Kafka deployments, composite modules should be used instead.

How the bus handles headers and serialization: embeddedHeaders supports Spring Integration header embedding and
bus-managed serialization based on embedded content type headers, whereas raw mode will operate only with byte array
data, will not embed headers and will leave the handling of serialization to the user.

Where the bus stores offsets: kafkaTopic is similar to pre-1.3 behaviour of using a dedicated Kafka topic, and
kafkaNative relies on the native topic-based Kafka offset storage support compatible with Kafka 0.8.2 and later.

A list of custom headers to be transported by the bus.

The name of the topic where the Kafka Message Bus will store offsets (must be a compacted topic - Spring XD will
attempt to create a compacted topic by default).

The segment size for the offset topic if offsetManagement is kafkaTopic.

The retention time for the offset topic if offsetManagement is kafkaTopic.

The number of required acks for the offset topic if offsetManagement is kafkaTopic.

The maximum fetch size when reading from the offset topic if offsetManagement is kafkaTopic.

The batch size (in bytes) for the producers writing to the offset topic if offsetManagement is kafkaTopic.

Upper bound for the offset topic producer delay for batching

The frequency (in milliseconds) with which offsets are saved (mutually exclusive with offsetUpdateCount)

The frequency (in message counts) with which offsets are saved (mutually exclusive with offsetUpdateTimeWindow)

The timeout for shutting down offset management and ensuring that the latest offset updates have been pushed.

The amount of data (in bytes) that the producer will try to buffer before sending data to brokers.

Timeout (in milliseconds) for batching data on the producer side. A value of zero (default) means that data will be
sent out immediately as available.

The replication factor of the topics created by the message bus. At least as many brokers must be in the cluster
when the topic is being created.

The maximum number of consumer threads receiving messages for a module. The total number of threads actively
consuming partitions across all the instances of a specific module cannot be larger than the partition count of a
transport topic - therefore, if such a situation occurs, some modules instances will, in fact, use less consumer
threads.

The number of required acks when producing messages, i.e. how many brokers have committed data to the logs and
acknowledged this to the leader. Special values are -1, meaning all in-sync replicas, and 0 indicating that no
acks are necessary.

Enables compression for the bus and sets the compression codec.

The maximum size of the internal message queue (in messages), per consumer processing thread. It must be a power
of 2.

The maximum amount of time that the consumers will wait to fetch data from a broker (if less than fetchSize is
available)

The maximum amount of data that the consumers will try to fetch, per broker, in one polling cycle.

The minimum number of partitions that will be used by a bus topic.

When true, a synchronous producer is used

If <27> is true, set the timeout to wait for Kafka delivery (in ms). Is ⇐ 0, wait forever.

Admin Server HTTP Port

The default HTTP port of the xd-admin server is 9393. To change the value use the following configuration setting

Admin Server Security

By default, the Spring XD admin server is unsecured and runs on an unencrypted HTTP connection. You can secure your administration REST endpoints, as well as the Admin UI by enabling HTTPS and requiring clients to authenticate.

Enabling HTTPS

By default, the administration, management, and health endpoints, as well as the Admin UI use HTTP as a transport. You can switch to HTTPS easily, by adding a certificate to your configuration in servers.yml

The settings are applicable only to the admin server (regardless whether it’s started in single-node mode or as a separate instance).

The alias (or name) under which the key is stored in the keystore.

The path to the keystore file. Classpath resources may also be specified, by using the classpath prefix: classpath:path/to/keystore

The password of the keystore.

The password of the key.

The path to the truststore file. Classpath resources may also be specified, by using the classpath prefix: classpath:path/to/trust-store

The password of the trust store.

Note

If HTTPS is enabled, it will completely replace HTTP as the protocol over which the REST endpoints and the Admin UI interact. Plain HTTP requests
will fail - therefore, make sure that you configure your Shell accordingly.

Enabling authentication

By default, the REST endpoints (administration, management and health), as well as the Admin UI do not require authenticated access. By turning on authentication on the admin server:

the REST endpoints will require Basic authentication for access;

the Admin UI will be accessible after signing in through a web form.

Note

When authentication is set up, it is strongly recommended to enable HTTPS as well, especially in production environments.

You can turn on authentication by adding the following to the configuration in servers.yml:

The username for authentication (must be used by REST clients and in the Admin UI). Will default to user if not explicitly set.

The password for authentication (must be used by REST clients and in the Admin UI). If not explicitly set, it will be auto-generated, as described in the Spring Boot documentation.

LDAP authentication

Spring XD also supports authentication against an LDAP server, in both direct bind and "search and bind" modes. When the LDAP authentication option is activated, the default single user mode is turned off.

In direct bind mode, a pattern is defined for the user’s distinguished name (DN), using a placeholder for the username.
The authentication process derive the distinguished name of the user by replacing the placeholder and use it to authenticate a user against the LDAP server, along with the supplied password.
You can set up LDAP direct bind as follows:

The distinguished name (DN) pattern for authenticating against the server.

The "search and bind" mode involves connecting to an LDAP server, either anonymously or with a fixed account, and searching
for the distinguished name of the authenticating user based on its username, and then using the resulting value and the supplied password for binding to the LDAP server.
This option is configured as follows:

Each map "value" is made of a password and one or more roles, comma separated

Customizing authorization

All of the above deals with authentication, i.e. how to assess the identity of the user. Irrespective of the option chosen, you can
also customize authorizationi.e. who can do what.

The default scheme uses three roles to protect the REST endpoints that Spring XD exposes:

ROLE_VIEW for anything that relates to retrieving state

ROLE_CREATE for anything that involves creating, deleting or mutating the state of the system

ROLE_ADMIN for boot management endpoints.

All of those defaults are written out in application.yml, which you can choose to override viaservers.yml. This takes the form
of a YAML list (as some rules may have precedence over others) and so you’ll need to copy/paste the whole list and tailor it to your needs (as there is no way to merge lists). Always refer to your version of application.yml, as the snippet reproduced below may be outdated. The default rules are as such:

each of those separated by one or several blank characters (spaces, tabs, etc.)

Be mindful that the above is indeed a YAML list, not a map (thus the use of - dashes at the start of each line) that lives under the security.authorization.rules key.

Cross-origin resource sharing (CORS)

Cross-origin resource sharing (CORS) is a mechanism that allows restricted resources (e.g. fonts) on a web page to be requested from another domain outside the domain from which the resource originated.

We do set a default value of http://localhost:9889 in the internal application.yml file that is embedded inside the Spring XD jars.

application.yml

xd:
…
ui:
…
allow_origin: "http://localhost:9889"
…

In order to customize this, set the xd.ui.allow_origin property in your server.yml file for the admin server profile by adding the following section:

server.yml

---
spring:
profiles: admin
xd:
ui:
allow_origin: "*"
---

For example, if you set the value to "*" (asterisk), Spring XD should accept requests from any domain. Please make sure to wrap the asterisk with double quotes.

Under the hood the value will set the CORS Access-Control-Allow-Origin header in the AccessControlInterceptor via the RestConfiguration class.

Local transport

Local transport uses a QueueChannel to pass data between modules. There are a few properties you can configure on the QueueChannel

xd.local.transport.named.queueSize - The capacity of the queue, the default value is Integer.MAX_VALUE

xd.local.transport.named.polling - Messages that are buffered in a QueueChannel need to be polled to be consumed. This property controls the fixed rate at which polling occurs. The default value is 1000 ms.

Serialization

Serialization is used by remote transport. Please see the section on Optimizing Serialization for a
detailed discussion of configuration options.

Module Configuration

Modules are configured by placing property files in a nested directory structure based on their type and name. The root of the nested directory structure is by default XD_HOME/config/modules. This location can be customized by setting the OS environment variable XD_MODULE_CONFIG_LOCATION, similar to how the environment variable XD_CONFIG_LOCATION is used for configuring the server. If XD_MODULE_CONFIG_LOCATION is set explicitly, then it is necessary to add the file path separator ("/") at the end of the path.

Note

If XD_MODULE_CONFIG_LOCATION is set to use explicit location, make sure to copy entire directory structure from the default module config location xd/config/modules into the new module config location. The XD_MODULE_CONFIG_LOCATION can reference any Spring Resource implementation, most commonly denoted using the prefixes classpath:, file: and http:.

As an example, if you wanted to configure the twittersearch module, you would create a file

Values in XD_MODULE_CONFIG_LOCATION\<type>\<name>\<name>.properties can be property placeholder references to keys defined in another resource location. By default the resource is the file XD_MODULE_CONFIG_LOCATION\modules.yml. You can customize the name of the resource by using setting the OS environment variable XD_MODULE_CONFIG_NAME before running a server startup script.

The modules.yml file can be used to specify the values of keys that should be shared across different modules. For example, it is common to use the same twitter developer credentials in both the twittersearch and twitterstream modules. To avoid repeating the same credentials in two property files, you can use the following setup.

Profiles

When resolving property file names, the server will look to load profile specific variants based on the naming convention <name>-{profile}.properties. For example, if given the OS environment variable spring_profiles_active=default,qa the following configuration file names for the twittersearch module would be searched in this order

Also, the shared module configuration file is refernced using profile variants, so given the OS environment variable spring_profiles_active=default,qa the following shared module configuration files would be searched for in this order

XD_MODULE_CONFIG_LOCATION\modules.yml

XD_MODULE_CONFIG_LOCATION\modules-default.yml

XD_MODULE_CONFIG_LOCATION\modules-qa.yml

Batch Jobs or modules accessing JDBC

Another common case is access to a relational database from a job or the JDBC Sink module.

As an example, to provide the properties for the batch job jdbchdfs the file XD_MODULE_CONFIG_LOCATION\job\jdbchdfs\jdbchdfs.properties should contain

A property file with the same keys, but likely different values would be located in XD_MODULE_CONFIG_LOCATION\sink\jdbc\jdbc.properties.

Encrypted Properties

If you wish encrypt passwords and other secret values stored in application configuration file, you must provide a Spring bean that implements
TextEncryptor and
implement the decrypt method. The bean, annotated with @Component, or a @Configuration class providing the bean definition,
must be present under the base package spring.xd.ext.encryption.

DSL Guide

Introduction

Spring XD provides a DSL for defining a stream. Over time the DSL is likely to evolve significantly as it gains the ability to define more and more sophisticated streams as well as the steps of a batch job.

Pipes and filters

A simple linear stream consists of a sequence of modules. Typically an Input Source, (optional) Processing Steps, and an Output Sink. As a simple example consider the collection of data from an HTTP Source writing to a File Sink. Using the DSL the stream description is:

http | file

A stream that involves some processing:

http | filter | transform | file

The modules in a stream definition are connected together using the pipe symbol |.

Module parameters

Each module may take parameters. The parameters supported by a module are defined by the module implementation. As an example the http source module exposes port setting which allows the data ingestion port to be changed from the default value.

http --port=1337

It is only necessary to quote parameter values if they contain spaces or the | character. Here the transform processor module is being passed a SpEL expression that will be applied to any data it encounters:

transform --expression='new StringBuilder(payload).reverse()'

If the parameter value needs to embed a single quote, use two single quotes:

Named channels

Instead of a source or sink it is possible to use a named channel. Normally the modules in a stream are connected
by anonymous internal channels (represented by the pipes), but by using explicitly named channels it becomes
possible to construct more sophisticated flows. In keeping with the unix theme, sourcing/sinking data from/to a particular channel uses the > character. A named channel is specified by using a channel type, followed by a : followed by a name. The channel types available are:

queue - this type of channel has point-to-point (p2p) semantics

topic - this type of channel has pub/sub semantics

Here is an example that shows how you can use a named channel to share a data pipeline driven by different input sources.

queue:foo > file

http > queue:foo

time > queue:foo

Now if you post data to the http source, you will see that data intermingled with the time value in the file.

The opposite case, the fanout of a message to multiple streams, is planned for a future release. However, taps are a specialization of named channels that do allow publishing data to multiple sinks. For example:

tap:stream:mystream > file

tap:stream:mystream > log

Once data is received on mystream, it will be written to both file and log.

Support for routing messages to different streams based on message content is also planned for a future release.

Labels

Labels provide a means to alias or group modules. Labels are simply a name followed by a :
When used as an alias a label can provide a more descriptive name for a
particular configuration of a module and possibly something easier to refer to in other streams.

Refer to this section of the Taps chapter to see how labels facilitate the creation of taps in these cases where a stream contains ambiguous modules.

Single quotes, Double quotes, Escaping

Spring XD is a complex runtime that involves a lot of systems when you look at the complete picture. There is a Spring Shell based client that talks to the admin that is responsible for parsing. In turn, modules may themselves rely on embedded languages (like the Spring Expression Language) to accomplish their behavior.

Those three components (shell, XD parser and SpEL) have rules about how they handle quotes and how syntax escaping works, and when stacked with each other, confusion may arise. This section explains the rules that apply and provides examples to the most common situations.

Note

It’s not always that complicated

This section focuses on the most complicated cases, when all 3 layers are involved. Of course, if you don’t use the XD shell (for example if you’re using the REST API directly) or if module option values are not SpEL expressions, then escaping rules can be much simpler

Spring Shell

Arguably, the most complex component when it comes to quotes is the shell. The rules can be laid out quite simply, though:

a shell command is made of keys (--foo) and corresponding values. There is a special, key-less mapping though, see below

a value can not normally contain spaces, as space is the default delimiter for commands

spaces can be added though, by surrounding the value with quotes (either single ['] or double ["] quotes)

if surrounded with quotes, a value can embed a literal quote of the same kind by prefixing it with a backslash (\)

Other escapes are available, such as \t, \n, \r, \f and unicode escapes of the form \uxxxx

Lastly, the key-less mapping is handled in a special way in the sense that if does not need quoting to contain spaces

For example, the XD shell supports the ! command to execute native shell commands. The ! accepts a single, key-less argument. This is why the following works:

xd:>! rm foo

The argument here is the whole rm foo string, which is passed as is to the underlying shell.

As another example, the following commands are strictly equivalent, and the argument value is foo (without the quotes):

This uses single quotes to protect the whole argument, hence actual single quotes need to be doubled

But SpEL recognizes String literals with either single or double quotes, so this last method is arguably the best

Please note that the examples above are to be considered outside of the Spring XD shell. When entered inside the shell, chances are that the whole stream definition will itself be inside double quotes, which would need escaping. The whole example then becomes:

SpEL syntax and SpEL literals

The last piece of the puzzle is about SpEL expressions. Many modules accept options that are to be interpreted as SpEL expressions, and as seen above, String literals are handled in a special way there too. Basically,

literals can be enclosed in either single or double quotes

quotes need to be doubled to embed a literal quote. Single quotes inside double quotes need no special treatment, and vice versa

As a last example, assume you want to use the transform module. That module accepts an expression option which is a SpEL expression. It is to be evaluated against the incoming message, with a default of payload (which forwards the message payload untouched).

This uses single quotes around the string (at the XD parser level), but they need to be doubled because we’re inside a string literal (very first single quote after the equals sign)

use single and double quotes respectively to encompass the whole string at the XD parser level. Hence, the other kind of quote can be used inside the string. The whole thing is inside the --definition argument to the shell though, which uses double quotes. So double quotes are escaped (at the shell level)
== Interactive Shell

Introduction

Spring XD includes an interactive shell that you can use create, deploy, destroy and query streams and jobs. There are also commands to help with common tasks such as interacting with HDFS, the UNIX shell, and sending HTTP requests. In this section we will introduce the main commands and features of the shell.

Using the Shell

When you start the shell you can type help to show all the commands that are available. Note, that since the XD shell is based on Spring Shell you can contribute you own commands into the shell. The general groups of commands are related to the management of

Modules

Streams

Jobs

Analytics (Counters, Aggregate Counters, Gauges, etc.)

HDFS

For example to see what modules are available issue the command

xd:>module list

Tip

The list of all Spring XD specific commands can be found in the Shell Reference

The shell also provides extensive command completion capabilities. For example, if you type mod and hit TAB, you will be presented with all the matching commands.

Suppose we want to create a stream that uses the http source and file sink. How do we know what options are available to use? There are two ways to find out. The first is to use the command module info. Pressing TAB after typiing moudle info will complete the command with the --name option and then present all the modules prefixed by their type.

xd:>module info --name sink:file
Information about sink module 'file':
Option Name Description Default Type
----------- ----------------------------------------------------------------- ----------------- --------
binary if false, will append a newline character at the end of each line false boolean
charset the charset to use when writing a String payload UTF-8 String
dir the directory in which files will be created /tmp/xd/output/ String
mode what to do if the file already exists APPEND Mode
name filename pattern to use ${xd.stream.name} String
suffix filename extension to use <none> String
inputType how this module should interpret messages it consumes <none> MimeType

Note that the default value ${xd.stream.name} will be resolved to the name of the stream that contains the module.

Tab completion for Job and Stream DSL definitions

When creating a stream defintion tab completion after --definition will enable you to see all the options that are available for a given module as well as a list of candidate modules for the subsequent module in the stream. For example, hitting TAB after http as shown below

Entering the port number and also the pipel | symbol and hitting tab will show completions for candidate processor and sink modules. The same process of tab completion for module options applies to each module in the chain.

Executing a script

You can execute a script by either passing in the --cmdfile argument when starting the shell or by executing the script command inside the shell. When using scripts it is common to add comments using either // or ; characters at the start of the line for one line comments or use /* and */ for multiline comments

Single quotes, Double quotes, Escaping

There are often three layers of parsing when passing entering commands to the shell. The shell parses the command to recognize -- options, inside the body of a stream/job definition the values are parsed until the first space character, and inside some command options SpEL is used (e.g. router). Understanding the interaction between these layers can cause some confusion. The DSL Guide section on quotes and escaping will help you if you run into any issues.

Admin UI

Introduction

Spring XD provides a browser-based GUI which currently has 4 sections:

Containers Provides the XD cluster view with the list of all running containers

Streams Deploy/undeploy Stream Definitions

Jobs Perform Batch Job related tasks

Analytics Create data visualizations for the various analytics modules

Containers

The Containers section of the admin UI shows the containers that are in the XD cluster. For each container the group properties and deployed modules are shown. More information on the container (hostname, pid, ip address) and for the module (module options and deployment properties) is available by clicking on the respective links. You can also shutdown a container (in distributed mode) by clicking on the shutdown button. You will be asked for confirmation if you select to shutdown.

Figure 2. List of Containers

Streams

The Streams section of the admin UI provides the Definitions tab that provides a listing of Stream definitions. There you have the option to deploy or undeploy those streams. Additionally you can remove the definition by clicking on destroy.

Figure 3. List of Stream Definitions

Jobs

The Jobs section of the admin UI currently has four tabs specific for Batch Jobs

Modules

Definitions

Deployments

Executions

Modules

Modules encapsulate a unit of work into a reusable component. Within the XD runtime environment Modules allow users to create definitions for Streams as well as Batch Jobs. Consequently, the Modules tab within the Jobs section allows users to create Batch Job definitions. In order to learn more about Modules, please see the chapter on Modules.

List available batch job modules

This page lists the available batch job modules.

Figure 4. List Job Modules

On this screen you can perform the following actions:

View details such as the job module options.

Create a Job Definition from the respective Module.

Create a Job Definition from a selected Job Module

On this screen you can create a new Job Definition. As a minimum you must provide a name for the new definition. Optionally you can select wether the new definition shall be automatically deployed. Depending on the selected module, you will also have the option to specify various parameters that are used during the deployment of the definition.

Figure 5. Create a Job Definition

View Job Module Details

Figure 6. View Job Module Details

On this page you can view the details of a selected job module. The pages lists the available options (properties) of the modules.

List job definitions

This page lists the XD batch job definitions and provides actions to deploy, un-deploy or destroy those jobs.

Figure 7. List Job Definitions

List job deployments

This page lists all the deployed jobs and provides option to launch or schedule the deployed job.

Figure 8. List Job Deployments

Launching a batch Job

Once the job is deployed, they can be launched through the Admin UI as well. Navigate to the Deployments tab. Select the job you want to launch and press Launch. The following modal dialog should appear:

Figure 9. Launch a Batch Job with parameters

Using this screen, you can define one or more job parameters. Job parameters can be typed and the following data types are available:

String (The default)

Date (The default date format is: yyyy/MM/dd)

Long

Double

Schedule Batch Job Execution

Figure 10. Schedule a Batch Job

When clicking on Schedule, you have the option to run the job:

using a fixed delay interval (specified in seconds)

on a specific data/time

using a valid CRON expression

Job Deployment Details

On this screen, you can view additional deployment details. Besides viewing the stream definition, the available Module Metadata is shown as well, e.g. on which Container the definition has been deployed to.

Figure 11. Job Deployment Details

List job executions

This page lists the Batch Job Executions and provides the option to restart or stop a specific job execution, provided the operation is available.
Furthermore, you have the option to view the Job execution details.

Figure 12. List Job Executions

The list of Job Executions also shows the state of the underlying Job Definition. The following states can be shown.

The underlying Job Definition was undeployed.

The underlying Job Definition was deleted.

This is a composed Job Definition.

As an example, we have created the following composed job using the Spring XD shell:

This includes a link back to the Job History UI of the Hadoop Cluster.

Figure 18. Job History UI

Important

In case of exceptions, the Exit Description field will contain additional error information. Please be aware, though, that this field can only have a maximum of 2500 characters. Therefore, in case of long exception stacktraces, trimming of error messages may occur. In that case, please refer to the server log files for further details.

Step execution history

Figure 19. Step Execution History

On this screen, you can view various metrics associated with the selected step such as duration, read counts, write counts etc.

Analytics

The Analytics section of the admin UI provides dashboarding and data visualization capabilities for the various analytics modules available in Spring XD:

Counters

Aggregate Counters

Field-Value Counters

Gauges

Rich Gauges

For example, if you have created the springtweets stream and the corresponding counter in the Counter chapter, you can now easily create the corresponding graph from within the Dashboard tab:

Under Metric Type, select Counters from the select box

Under Stream, select tweetcount

Under Visualization, select the desired chart option, Bar Chart

You will see a visualization similar to the following:

Figure 20. Counter Bar Chart

Under Visualization, select the desired chart option, Graph Chart

You will see a visualization similar to the following:

Figure 21. Counter Graph

Using the icons to the right, you can add additional charts to the dashboard, re-arange the order of created dashboards or remove data visualizations.

The remaining 3 tabs Counters, Gauges and Rich-Gauges, provide default visualizations for all Spring XD counters within the system. For example, if you have created the Simple Tap Example (Rich Gauge) in the Counter Chapter, you will see a visualization similar to the following under the Rich-Gauges tab:

Figure 22. Rich Gauge

Architecture

Introduction

Spring XD is a unified, distributed, and extensible service for data ingestion, real time analytics, batch processing, and data export. The foundations of XD’s architecture are based on the over 100+ man years of work that have gone into the Spring Batch, Integration and Data projects. Building upon these projects, Spring XD provides servers and a configuration DSL that you can immediately use to start processing data. You do not need to build an application yourself from a collection of jars to start using Spring XD.

Spring XD has two modes of operation - single and multi-node. The first is a single process that is responsible for all processing and administration. This mode helps you get started easily and simplifies the development and testing of your application. The second is a distributed mode, where processing tasks can be spread across a cluster of machines and an administrative server reacts to user commands and runtime events managed within a shared runtime state to coordinate processing tasks executing on the cluster.

Runtime Architecture

The key components in Spring XD are the XD Admin and XD Container Servers. Using a high-level DSL, you post the description of the required processing tasks to the Admin server over HTTP. The Admin server then maps the processing tasks into processing modules. A module is a unit of execution and is implemented as a Spring ApplicationContext. A distributed runtime is provided that will assign modules to execute across multiple XD Container servers. A single XD Container server can run multiple modules. When using the single node runtime, all modules are run in a single XD Container and the XD Admin server is run in the same process.

DIRT Runtime

A distributed runtime, called Distributed Integration Runtime, aka DIRT, will distribute the processing tasks across multiple XD Container instances. The XD Admin server breaks up a processing task into individual module definitions and assigns each module to a container instance using ZooKeeper (see XD Distributed Runtime). Each container listens for module definitions to which it has been assigned and deploys the module, creating a Spring ApplicationContext to run it.

Modules share data by passing messages using a configured messaging middleware (Rabbit, Redis, or Local for single node). To reduce the number of hops across messaging middleware between them, multiple modules may be composed into larger deployment units that act as a single module. To learn more about that feature, refer to the Composing Modules section.

Support for other distributed runtimes

In the 1.0 release, You can run Spring XD natively, in which case you are responsible for starting up the XD Admin and XD Container instances. Alternately you can run Spring XD on Hadoop’s YARN, see Running XD on YARN. Pivotal Cloud Foundry support is planned for a future release. If you are feeling a adventurous, you can also take a look at our scripts for deploying Spring XD to EC2. These are used as part of our system integration tests.

Single Node Runtime

A single node runtime is provided that runs the Admin and Container servers, ZooKeeper, and HSQLDB in the same process. the single node runtime is primarily intended for testing and development purposes but it may also appropriate to use in small production use-cases. The communication to the XD Admin server is over HTTP and the XD Admin server communicates to an in-process XD Container using an embedded ZooKeeper server.

Figure 24. Single Node Runtime

Admin Server Architecture

The Admin Server uses an embedded servlet container and exposes REST endpoints for creating, deploying, undeploying, and destroying streams and jobs, querying runtime state, analytics, and the like. The Admin Server is implemented using Spring’s MVC framework and the Spring HATEOAS library to create REST representations that follow the HATEOAS principle. The Admin Server and Container Servers monitor and update runtime state using ZooKeeper (see XD Distributed Runtime).

Container Server Architecture

The key components of data processing in Spring XD are

Streams

Jobs

Taps

Streams define how event driven data is collected, processed, and stored or forwarded. For example, a stream might collect syslog data, filter, and store it in HDFS.

Jobs define how coarse grained and time consuming batch processing steps are orchestrated, for example a job could be be defined to coordinate performing HDFS operations and the subsequent execution of multiple MapReduce processing tasks.

Taps are used to process data in a non-invasive way as data is being processed by a Stream or a Job. Much like wiretaps used on telephones, a Tap on a Stream lets you consume data at any point along the Stream’s processing pipeline. The behavior of the original stream is unaffected by the presence of the Tap.

Figure 25. Taps, Jobs, and Streams

Streams

The programming model for processing event streams in Spring XD is based on the well known Enterprise Integration Patterns as implemented by components in the Spring Integration project. The programming model was designed so that it is easy to test components.

A Stream consist of the following types of modules:
* An Input source
* Processing steps
* An Output sink

An Input source produces messages from an external source. XD supports a variety of sources, e.g. syslog, tcp, http. The output from a module is a Spring Message containing a payload of data and a collection of key-value headers. Messages flow through message channels from the source, through optional processing steps, to the output sink. The output sink delivers the message to an external resource. For example, it is common to write the message to a file system, such as HDFS, but you may also configure the sink to forward the message over tcp, http, or another type of middleware, or route the message to another stream.

A stream that consists of a input source and a output sink is shown below

Figure 26. Foundational components of the Stream processing model

A stream that incorporates processing steps is shown below

Figure 27. Stream processing with multiple steps

For simple linear processing streams, an analogy can be made with the UNIX pipes and filters model. Filters represent any component that produces, processes or consumes events. This corresponds to the modules (source, processing steps, and sink) in a stream. Pipes represent the way data is transported between the Filters. This corresponds to the Message Channel that moves data through a stream.

A simple stream definition using UNIX pipes and filters syntax that takes data sent via a HTTP post and writes it to a file (with no processing done in between) can be expressed as

http | file

The pipe symbol represents a message channel that passes data from the HTTP source to the File sink. The message channel implementation can either be backed with a local in-memory transport, Redis queues, or RabbitMQ. The message channel abstraction and the XD architecture are designed to support a pluggable data transport. Future releases will support other transports such as JMS.

Note that the UNIX pipes and filter syntax is the basis for the DSL that Spring XD uses to describe simple linear flows. Non-linear processing is partially supported using named channels which can be combined with a router sink to effectively split a single stream into multiple streams (see Dynamic Router Sink). Additional capabilities for non-linear processing are planned for future releases.

The programming model for processing steps in a stream originates from the Spring Integration project and is included in the core Spring Framework as of version 4. The central concept is one of a Message Handler class, which relies on simple coding conventions to Map incoming messages to processing methods. For example, using an http source you can process the body of an HTTP POST request using the following class

The payload of the incoming Message is passed as a string to the method process. The contents of the payload is the body of the http request as we are using a http source. The non-void return value is used as the payload of the Message passed to the next step. These programming conventions make it very easy to test your Processor component in isolation. There are several processing components provided in Spring XD that do not require you to write any code, such as a filter and transformer that use the Spring Expression Language or Groovy. For example, adding a processing step, such as a transformer, in a stream processing definition can be as simple as

http | transformer --expression=payload.toUpperCase() | file

For more information on processing modules, refer to the Processors section.

Stream Deployment

The Container Server listens for module deployment events initiated from the Admin Server via ZooKeeper. When the container node handles a module deployment event, it connects the module’s input and output channels to the data bus used to transport messages during stream processing. In a single node configuration, the data bus uses in-memory direct channels. In a distributed configuration, the data bus communications are backed by the configured transport middleware. Redis and Rabbit are both provided with the Spring XD distribution, but other transports are envisioned for future releases.

Figure 28. A Stream Deployed in a single node server

Figure 29. A Stream Deployed in a distributed runtime

In the http | file example, the Admin assigns each module to a separate Container instance, provided there are at least two Containers available. The file module is deployed to one container and the http module to another. The definition of a module is stored in a Module Registry. A module definition consists of a Spring XML configuration file, some classes used to validate and handle options defined by the module, and dependent jars. The module definition contains variable placeholders, corresponding to DSL parameters (called options) that allow you to customize the behavior of the module. For example, setting the http listening port would be done by passing in the option --port, e.g. http --port=8090 | file, which is in turn used to substitute a placeholder value in the module definition.

The Module Registry is backed by the filesystem and corresponds to the directory <xd-install-directory>/modules. When a module deployment is handled by the Container, the module definition is loaded from the registry and a new Spring ApplicationContext is created in the Container process to run the module. Dependent classes are loaded via the Module Classloader which first looks at jars in the modules /lib directory before delegating to the parent classloader.

Using the DIRT runtime, the http | file example would map onto the following runtime architecture

Figure 30. Distributed HTTP to File Stream

Data produced by the HTTP module is sent over a Redis Queue and is consumed by the File module. If there was a filter processing module in the stream definition, e.g http | filter | file that would map onto the following DIRT runtime architecture.

Figure 31. Distributed HTTP to Filter to File Stream

Jobs

The creation and execution of Batch jobs builds upon the functionality available in the Spring Batch and Spring for Apache Hadoop projects. See the Batch Jobs section for more information.

Taps

Taps provide a non-invasive way to consume the data that is being processed by either a Stream or a Job, much like a real time telephone wire tap lets you eavesdrop on telephone conversations. Taps are recommended as way to collect metrics and perform analytics on a Stream of data. See the section Taps for more information.

Distributed Runtime

Introduction

This document describes what’s happening "under the hood" of the Spring XD Distributed Runtime (DIRT) and in particular, how the runtime architecture achieves high availability and failover in a clustered production environment. See Running in Distributed Mode for more information on installing and running Spring XD in distributed mode.

This discussion focuses on Spring XD’s core runtime components and the role of ZooKeeper in managing the state of the Spring XD cluster and enabling automatic recovery from failures.

Configuring Spring XD for High Availabilty(HA)

A production Spring XD environment is typically distributed among multiple hosts in a clustered environment. Spring XD scales horizontally when you add container instances. In the simplest case, all containers are replicas, that is each instance is running on an identically configured host and modules are deployed to any available container in a round-robin fashion. However, this simplifying assumption does not address real production scenarios in which more control is requred in order to optimize resource utilization. To this end, Spring XD supports a flexible algorithm which allows you to match module deployments to specific container configurations. The container matching algorithm will be covered in more detail later, but for now, let’s assume the simple case. Running multiple containers not only enables horizontal scalability, but enables failure recovery. If a container becomes unavailable due to an unrecoverable connection loss, any modules currently deployed to that container will be deployed automatically to the other available instances.

Spring XD requires that a single active Admin server handle interactions with the containers, such as stream deployment requests, as these types of operations must be processed serially in the order received. Without a backup, the Admin server becomes single point of failure. Therefore, two (or more for the risk averse) Admin servers are recommended for a production environment. Note that every Admin server can handle all requests via REST endpoints but only one instance, the "Leader", will actually perform requests that update the runtime state. If the Leader goes down, another available Admin server will assume the leader role. Leader Election is an example of a common feature for distributed systems provided by the Curator Framework which sits on top of ZooKeeper.

An HA Spring XD installation also requires that external servers - ZooKeeper, messaging middleware, and data stores needed for running Spring XD in distributed mode must be configured for HA as well. Please consult the product documentation for specific recommendations regarding each of these external components. Also see Message Bus Configuration for tips on configuring the MessageBus for HA, error handling, etc.

ZooKeeper Overview

In the previous section, we claimed that if a container goes down, Spring XD will redeploy any modules deployed on that instance to another available container. We also claimed that if the Admin Leader goes down, another Admin server will assume that role. ZooKeeper is what makes this all possible. ZooKeeper is a widely used Apache project designed primarily for distributed system management and coordination. This section will cover some basic concepts necessary to understand its role in Spring XD. See The ZooKeeper Wiki for a more complete overview.

ZooKeeper is based on a simple hierarchical data structure, formally a tree, and conceptually and semantically similar to a file directory structure. As such, data is stored in nodes. A node is referenced via a path, for example, /xd/streams/mystream. Each node can store additional data, serialized as a byte array. In Spring XD, all data is a java.util.Map serialized as JSON. The following figure shows the Spring XD schema:

A ZooKeeper node is either ephemeral or persistent. An ephemeral node exists only as long as the session that created it remains active. A persistent node is, well, persistent. Ephemeral nodes are appropriate for registering Container instances. When an Spring XD container starts up it creates an ephemeral node, /xd/containers/<container-id>, using an internally generated container id. When the container’s session is closed due to a connection loss, for example, the container process terminates, its node is removed. The ephemeral container node also holds metadata such as its hostname and IP address, runtime metrics, and user defined container attributes. Persistent nodes maintain state needed for normal operation and recovery. This includes data such as stream definitions, job definitions, deployment manifests, module deployments, and deployment state for streams and jobs.

Obviously ZooKeeper is a critical piece of the Spring XD runtime and must itself be HA. ZooKeeper itself supports a distributed architecture, called an ensemble. The details are beyond the scope of this document but for the sake of this discussion it is worth mentioning that there should be at least three ZooKeeper server instances running (an odd number is always recommended) on dedicated hosts. The Container and Admin nodes are clients to the ZooKeeper ensemble and must connect to ZooKeeper at startup. Spring XD components are configured with a zk.client.connect property which may designate a single <host>:<port> or a comma separated list. The ZooKeeper client will attempt to connect to each server in order until it succeeds. If it is unable to connect, it will keep trying. If a connection is lost, the ZooKeeper client will attempt to reconnect to one of the servers. The ZooKeeper cluster guarantees consistent replication of data across the ensemble. Specifically, ZooKeeper guarantees:

Sequential Consistency - Updates from a client will be applied in the order that they were sent.

Atomicity - Updates either succeed or fail. No partial results.

Single System Image - A client will see the same view of the service regardless of the server that it connects to.

Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.

Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

ZooKeeper maintains data primarily in memory backed by a disk cache. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.

In addition to performing basic CRUD operations on nodes, A ZooKeeper client can register a callback on a node to respond to any events or state changes to that node or any of its children. Such node operations and callbacks are the mechanism that control the Spring XD runtime.

The Admin Server Internals

Assuming more than one Admin instance is running, Each instance requests leadership at start up. If there is already a designated leader, the instance will watch the xd/admin node to be notified if the Leader goes away. The instance designated as the "Leader", using the Leader Selector recipe provided by Curator, a ZooKeeper client library that implements some common patterns. Curator also provides some Listener callback interfaces that the client can register on a node. The AdminServer creates the top level nodes, depicted in the figure above:

/xd/admins - children are ephemeral nodes for each available Admin instance and used for Leader Selector

The admin leader creates a DeploymentSupervisor which registers listeners on /xd/deployments/modules/requested to handle module deployment requests related to stream and job deployments, and xd/containers/ to be notified when containers are added and removed from the cluster. Note that any Admin instance can handle user requests. For example, if you enter the following commands via XD shell,

xd>stream create ticktock --definition "time | log"

This command will invoke a REST service on its connected Admin instance to create a new node /xd/streams/ticktock

xd>stream deploy ticktock

Assuming the deployment is successful, This will result in the creation of several nodes used to manage deployed resources, for example, /xd/deployments/streams/ticktock. The details are discussed in the example below.

If the Admin instance connected to the shell is not the Leader, it will perform no further action. The Leader’s DeploymentSupervisor will attempt to deploy each module in the stream definition, in accordance with the deployment manifest, to an available container, and update the runtime state.

Example

Let’s walk through the simple example above. If you don’t have a Spring XD cluster set up, this example can be easily executed running Spring XD in a single node configuration. The single node application includes an embedded ZooKeeper server by default and allocates a random unused port. The embedded ZooKeeper connect string is reported in the console log for the single node application:

For our purposes, we will use the ZooKeeper CLI tool to inspect the contents of ZooKeeper nodes reflecting the current state of Spring XD. First, we need to know the port to connect the CLI tool to the embedded server. For convenience, we will assign the ZooKeeper port (5555 in this example) when starting the single node application. From the XD install directory:

The above reflects the initial state of Spring XD with a running admin and container instance. Nothing is deployed yet and there are no existing stream or job definitions. Note that xd/deployments/modules/allocated has a persistent child corresponding to the id of the container at xd/containers. If you are running in a distributed configuration and connected to one of the ZooKeeper servers in the same ensemble that Spring XD is connected to, you might see multiple nodes under /xd/containers, and xd/admins. Because the external ensemble persists the state of the Spring XD cluster, you will also see any deployments that existed when the Spring XD cluster was shut down.

Note the deployment state shown for the stream’s status node is deployed, meaning the deployment request was satisfied. Deployment states are discussed in more detail here.

Spring XD decomposes stream deployment requests to individual module deployment requests. Hence, we see that each module in the stream is associated with a container instance. The container instance in this case is the same since there is only one instance in the single node configuration. In a distributed configuration with more than one instance, the stream source and sink will each be deployed to a separate container. The node name itself is of the form <module_type>.<module_name>.<module_sequence_number>.<container_id>, where the sequence number identifies a deployed instance of a module if multiple instances of that module are requested.

Module Deployment

This section describes how the Spring XD runtime manages deployment internally. For more details on how to deploy streams and jobs see Deployment.

To process a stream deployment request, the StreamDeploymentListener invokes its ContainerMatcher to select a container instance for each module and records the module’s deployment properties under /xd/deployments/modules/requested/. If a match is found, the StreamDeploymentListener creates a node for the module under /xd/deployments/modules/allocated/<container_id>. The Container includes a DeploymentListener that monitors the container node for new modules to deploy. If the deployment is successful, the Container writes the ephemeral nodes status and metadata under the new module node.

When a container departs, the ephemeral nodes are deleted so its modules are now undeployed. The ContainerListener responds to the deleted nodes and attempts to redeploy any affected modules to another instance.

Example: Automatic Redeployment

For this example we start two container instances and deploy and simple stream:

Now if we kill the remaining container, we see warnings in the xd-admin log:

14:36:07,593 WARN DeploymentSupervisorCacheListener-0 server.DepartingContainerModuleRedeployer - No containers available for redeployment of log for stream ticktock
14:36:07,599 WARN DeploymentSupervisorCacheListener-0 server.DepartingContainerModuleRedeployer - No containers available for redeployment of time for stream ticktock

Batch Jobs

Introduction

One of the features that XD offers is the ability to launch and monitor batch jobs based on Spring Batch. The Spring Batch project was started in 2007 as a collaboration between SpringSource and Accenture to provide a comprehensive framework to support the development of robust batch applications. Batch jobs have their own set of best practices and domain concepts which have been incorporated into Spring Batch building upon Accenture’s consulting business. Since then Spring Batch has been used in thousands of enterprise applications and is the basis for the recent JSR standardization of batch processing, JSR-352.

Spring XD builds upon Spring Batch to simplify creating batch workflow solutions that span traditional use-cases such as moving data between flat files and relational databases as well as Hadoop use-cases where analysis logic is broken up into several steps that run on a Hadoop cluster. Steps specific to Hadoop in a workflow can be MapReduce jobs, executing Hive/Pig scripts or HDFS operations. Spring XD ships with a number of useful job modules out of the box. This section presents an overview for creating, deploying, and running batch jobs.

Workflow

The concept of a workflow translates to a Job, not to be confused with a MapReduce job. A Job is a directed graph, each node of the graph is a processing Step. Steps can be executed sequentially or in parallel, depending on the configuration. Jobs can be started, stopped, and restarted. Restarting
jobs is possible since the progress of executed steps in a Job is persisted in a database via a JobRepository. The following figures shows the basic components of a workflow.

A Job that has steps specific to Hadoop is shown below.

A JobLauncher is responsible for starting a job and is often triggered via a scheduler. Other options to launch a job are through Spring XD’s RESTful administration API, the XD web application, or in response to an external event from and XD stream definition, e.g. file polling using the file source.

Features

Spring XD allows you to create and launch jobs. The launching of a job can be triggered using a cron expression or in reaction to data on a stream. When jobs are executing, they are also a source of event data that can be subscribed to by a stream. There are several type of events sent during a job’s execution, the most common being the status of the job and the steps taken within the job. This bi-direction communication between stream processing and batch processing allows for more complex chains of processing to be developed.

As a starting point, jobs for the following cases are provided to use out of the box

Poll a Directory and import CSV files to HDFS

Import CSV files to JDBC

HDFS to JDBC Export

JDBC to HDFS Import

HDFS to MongoDB Export

These are described in the section below.

The purpose of this section is to show you how to create, schedule and monitor a job.

The Lifecycle of a Job in Spring XD

Before we dive deeper into the details of creating batch jobs with Spring XD, we need to understand the typical lifecycle for batch jobs in the context of Spring XD:

Register a Job Module

Create a Job Definition

Deploy a Job

Launch a Job

Job Execution

Un-deploy a Job

Destroy a Job Definition

Register a Job Module

Register a Job Module with the Module Registry using the XD Shell module upload command. See registering a module for details.

Create a Job Definition

Create a Job Definition from a Job Module by providing a definition name as well as properties that apply to all Job Instances. At this point the job is not deployed, yet.

Deploy the Job

Deploy the Job Definition to one or more Spring XD containers. This will initialize the Job Definitions on those containers. The jobs are now "live" and a job can be created by sending a message to a job queue that contains optional runtime Job Parameters.

Launch a Job

Launch a job by sending a message to the job queue with Job Parameters. A Job Instance is created, representing a specific run of the job. A Job Instance is the Job Definition plus the runtime Job Parameters. You can query for the Job Instances associated with a given job name.

Job Execution

The job is executed creating a Job Execution object that captures the success or failure of the job. You can query for Job Executions associated with a given job name.

Un-deploy a Job

This removes the job from the Spring XD container(s) preventing the launching of any new Job Instances. For reporting purposes, you will still be able to view historic Job Executions associated with the the job.

Destroy a Job Definition

Destroying a Job Definition will not only un-deploy any still deployed Job Definitions but will also remove the Job Definition itself.

Creating Jobs - Additional Options

When creating jobs, the following options are available to all job definitions:

dateFormat

The optional date format for job parameters (default: yyyy-MM-dd)

numberFormat

Defines the number format when parsing numeric parameters (default: NumberFormat.getInstance(Locale.US))

makeUnique

Shall job parameters be made unique? (default: true)

Also, similar to the stream create command, the job create command has an optional --deploy option to create the job definition and deploy it. --deploy option is false by default.

Below is an example of some of these options combined:

job create myjob --definition "fooJob --makeUnique=false"

Remember that you can always find out about available options for a job by using the module info command.

Deployment manifest support for job

When deploying batch job you can provide a deployment manifest. Deployment manifest properties for jobs are the same as for streams, you can declare

The number of job modules to deploy

The criteria expression to use for matching the job to available containers

The above deployment manifest would deploy 3 number of fooJob modules into containers whose group name matches "hdfs-containers-group".

When a batch job is launched/scheduled, the job module that picks up the job launching request message executes the batch job. To support partitioning of the job across multiple containers, the job definition needs to define how the job will be partitioned. The type of partitioning depends on the type of the job, for example a job reading from JDBC would partition the data in a table by dividing up the number of rows and a job reading files form a directory would partition on the number of files available.

The FTP to HDFS and FILE to JDBC jobs support for partitioning. To add partitioning support for your own jobs you should import singlestep-partition-support.xml in your job definition. This provides the infrastructure so that the job module that processes the launch request can communicate as the master with the other job modules that have been deployed. You will also need to provide an implementation of the Partitioner interface.

Launching a job

XD uses triggers as well as regular event flow to launch the batch jobs. So in this section we will cover how to:

Launch the Batch Job Ad-hoc

Launch the Batch Job using a named Cron-Trigger

Launch the Batch Job as sink.

Ad-hoc

To launch a job one time, use the launch option of the job command. So going back to our example above, we’ve created a job module instance named helloSpringXD. Launching that Job Module Instance would look like:

To pause/stop future scheduled jobs from running for this stream, the stream must be undeployed for example:

xd:> stream undeploy --name cronStream

Launch the Batch using a Fixed-Delay-Trigger

A fixed-delay-trigger is used to launch a Job on a regular interval. Using the --fixedDelay parameter you can set up the number of seconds between executions. In the example below we are running myXDJob every 10 seconds and passing it a payload containing a single attribute.

To pause/stop future scheduled jobs from running for this stream, you must undeploy the stream for example:

xd:> stream undeploy --name fdStream

Launch job as a part of event flow

A batch job is always used as a sink, with that being said it can receive messages from sources (other than triggers) and processors. In the case below we see that the user has created an http source (http source receives http posts and passes the payload of the http message to the next module in the stream) that will pass the http payload to the "myHttpJob".

Retrieve job notifications

Spring XD offers the facilities to capture the notifications that are sent from the job as it is executing.
When a batch job is deployed, by default it registers the following listeners along with pub/sub channels that these listeners send messages to.

Job Execution Listener

Chunk Listener

Item Listener

Step Execution Listener

Skip Listener

Along with the pub/sub channels for each of these listeners, there will also be a pub/sub channel that the aggregated events from all these listeners are published to.

In the following example, we setup a Batch Job called myHttpJob. Afterwards we create a stream that will tap into the pub/sub channels that were implicitly generated when the myHttpJob job was deployed.

To receive aggregated events

The stream receives aggregated event messages from all the default batch job listeners and sends those messages to the log.

Removing Batch Jobs

Alternatively, one can just undeploy the job, keeping its definition for a future redeployment:

xd:> job undeploy helloSpringXD

Streams

Introduction

In Spring XD, a basic stream defines the ingestion of event driven data from a source to a sink that passes through any number of processors. Stream processing is performed inside the XD Containers and the deployment of stream definitions to containers is done via the XD Admin Server. The Getting Started section shows you how to start these servers and how to start and use the Spring XD shell

Sources, sinks and processors are predefined configurations of a module. Module definitions are found in the 0 directory. [1]. Modules definitions are standard Spring configuration files that use existing Spring classes, such as Input/Output adapters and Transformers from Spring Integration that support general Enterprise Integration Patterns.

A high level DSL is used to create stream definitions. The DSL to define a stream that has an http source and a file sink (with no processors) is shown below

http | file

The DSL mimics a UNIX pipes and filters syntax. Default values for ports and filenames are used in this example but can be overriden using -- options, such as

http --port=8091 | file --dir=/tmp/httpdata/

To create these stream definitions you make an HTTP POST request to the XD Admin Server. More details can be found in the sections below.

Creating a Simple Stream

The XD Admin server [2] exposes a full RESTful API for managing the lifecycle of stream definitions, but the easiest way to use the XD shell. Start the shell as described in the Getting Started section

New streams are created by posting stream definitions. The definitions are built from a simple DSL. For example, let’s walk through what happens if we execute the following shell command:

xd:> stream create --definition "time | log" --name ticktock

This defines a stream named ticktock based off the DSL expression time | log. The DSL uses the "pipe" symbol |, to connect a source to a sink.

Then to deploy the stream execute the following shell command (or alternatively add the --deploy flag when creating the stream so that this step is not needed):

xd:> stream deploy --name ticktock

The stream server finds the time and log definitions in the modules directory and uses them to setup the stream. In this simple example, the time source simply sends the current time as a message each second, and the log sink outputs it using the logging framework.

If you would like to have multiple instances of a module in the stream, you can include a property with the deploy command:

xd:> stream deploy --name ticktock --properties "module.time.count=3"

You can also include a SpEL Expression as a criteria property for any module. That will be evaluated against the attributes of each currently available Container. Instances of the module will only be deployed to Containers for which the expression evaluates to true.

Other Source and Sink Types

Let’s try something a bit more complicated and swap out the time source for something else. Another supported source type is http, which accepts data for ingestion over HTTP POSTs. Note that the http source accepts data on a different port (default 9000) from the Admin Server (default 8080).

To create a stream using an http source, but still using the same log sink, we would change the original command above to

Of course, we could also change the sink implementation. You could pipe the output to a file (file), to hadoop (hdfs) or to any of the other sink modules which are provided. You can also define your own modules.

Simple Stream Processing

As an example of a simple processing step, we can transform the payload of the HTTP posted data to upper case using the stream definitions

DSL Syntax

In the examples above, we connected a source to a sink using the pipe symbol |. You can also pass parameters to the source and sink configurations. The parameter names will depend on the individual module implementations, but as an example, the http source module exposes a port setting which allows you to change the data ingestion port from the default value. To create the stream using port 8000, we would use

If you know a bit about Spring configuration files, you can inspect the module definition to see which properties it exposes. Alternatively, you can read more in the source and sink documentation.

Advanced Features

In the examples above, simple module definitions are used to construct each stream. However, modules may be grouped together in order to avoid duplication and/or reduce the amount of chattiness over the messaging middleware. To learn more about that feature, refer to the Composing Modules section.

If directed graphs are needed instead of the simple linear streams described above, two features are relevant. First, named channels may be used as a way to combine multiple flows upstream and/or downstream from the channel. The behavior of that channel may either be queue-based or topic-based depending on what prefix is used ("queue:myqueue" or "topic:mytopic", respectively). To learn more, refer to the Named Channels section. Second, you may need to determine the output channel of a stream based on some information that is only known at runtime. To learn about such content-based routing, refer to the Dynamic Router section.

Module Labels

When a stream is comprised of multiple modules with the same name, they must be qualified with labels. See Labels.

Modules

Introduction

Spring XD supports data ingestion by allowing users to define streams. Streams are composed of modules which encapsulate a unit of work into a reusable component. A job in Spring XD must also be implemented as a module.

Modules are categorized by type, typically representing the role or function of the module. Current Spring XD module types include source, sink, processor, and job. The type determines how the modules may be composed in a stream, or used to deploy a batch job. More precisely:

A source polls an external resource, or is triggered by an event and only provides output. The first module in a stream must be a source.

A processor performs some type of task, using a message as input and produces a new message, so it requires both input and output.

A sink consumes input messages and outputs data to an external resource to terminate the stream.

A job module implements a Spring Batch job enabled for Spring XD.

Available Modules

Spring XD ships with a number of pre-packaged modules ready to use for running batch jobs or assembling streams to perform common stream processing tasks using files, HDFS, Spark, Kafka, http, twitter, syslog, GemFire, and more. You can easily assemble these modules into streams to build complex big data applications declaratively, without having to write Java code or know the underlying Spring products on which Spring XD is built. You can use these modules out of the box or as a basis for building your own custom modules.

Modules Included with Spring XD

The following pages provide a detailed description, along with configuration options, and examples for all modules included in the Spring XD Distribution:

In addition to the standard modules included with the Spring XD distribution, you will find community-contributed modules in the spring-xd-modules GitHub repository.

If you are looking for a technical deep dive or want to develop your own modules, the sections below contain the relevant details. If you are interested in developing your own modules, some knowledge of Spring, Spring Integration or Spring Batch is essential. The remainder of this document assumes the reader has some familiarity with these topics.

Stream Modules

Sources, processors, and sinks are built using Spring Integration and are typically perform a single task that they may be easily reused in streams. Alternately, a custom module may be required to perform a specific function, such as integration with a legacy service. In Spring Integration terms:

A source is a valid message flow that contains a direct channel named output which is fed by an inbound adapter, either configured with a poller, or triggered by an event.

A processor is a valid message flow that contains a direct channel named input and a subscribable channel named output (direct or publish subscribe). It typically performs some type of transformation on the message, using its input channel’s message to create a new message on its output channel.

A sink is a valid message flow that contains a direct channel named input and an outbound adapter, or service activator used to provide the message to an external resource, HDFS for example.

For example, take a look at the file source which simply polls a directory using a file inbound adapter and file sink which appends an incoming message payload to a file using a file outbound adapter. On the surface, there is nothing special about these components. They are plain old Spring XML bean definition files.

Notice that modules adhere to an important convention: The input and output channels are always named input and output, in keeping with the KISS principle (let us know if you come up with some simpler names). The Spring XD runtime uses these names to bind these channels to the message transport.

Module Packaging

A module is a packaged component containing artifacts used to create a Spring application context. In general, a module is not aware of its runtime environment. Each module’s application context is configured and connected to other modules via Plugins in order to support distributed processing. In this respect, modules may potentially be applied to purposes other than stream processing. The module types described here (source, processor, sink, and job) are specific to Spring XD, but the Module type is designed to act as a core component of any micro-service architecture built with Spring.

Physically, a Module is somewhat analogous to a war file in Servlet container. The Spring XD container configures and starts a module when it is deployed. Deploying a module in Spring XD terms means activating an instance for processing, not to be confused with deploying a web application in Servlet container. Here we use the terms install or register to refer to uploading the module jar to make it available to the Spring XD runtime. Consistent with the war analogy, a module uses a separate class loader to load its resources, notably the files found in its config in lib directories. Another feature in common with a war file is that web applications are installed in a configured location and must conform to a standard layout. Artifacts are installed in a known location, either in expanded form or as a single archive. Spring XD modules work the same way. The module’s layout has evolved significantly as new features have been added to support custom module development. This evolution has generally led to increased flexibility with respect to individual artifacts. However, the module’s packaging structure is well defined:

For historical reasons, all modules included with Spring XD distribution are provided in expanded form and are commonly configured using XML bean definition files (<module-name>.xml) and property files (<module-name>.properties>. This is subject to change as this convention is no longer required. Meanwhile the out-of-the-box modules provide copious examples of module configuration and packaging.

A module’s contents typically includes:

Application context configuration: If either config/<any_name>.xml or config/<any_name>.groovy are present, it will be used as the source for the module’s application context. At most one of these may be present. If using an @Configuration class, neither of these files should be present.

Module properties file: If the module declares options (e.g. property placeholders whose values must be supplied for each instance when creating a stream), the properties file config/<any_name>.properties may contain an options_class property referencing the fully qualified class name of a Module Options Metadata class. Alternately the properties file may provide in-line Module Option descriptors (see Module Options below). If using @Configuration, the properties file must include a base_packages property containing a comma delimited list of package names to enable Spring component scanning scoped to the module. Note that base_packages will be ignored if a configuration resource (config/*.xml or config/*.groovy) is present.

Note

As of Spring XD 1.1, the names of the module’s bean definition resource (xml or groovy) and properties file are arbitrary. This provides additional flexibilty over requiring a conventional file name, as has been the case in prior releases. Currently, the required top level config directory is the convention. This carries the constraint that no other matching file types may be present in config. Multiple xml, groovy, or properties files matching the pattern, for example, config/*.xml will result in an exception. If you want to combine bean definitions from multiple resources, you may use import declarations and the imported resources must be somewhere else in the module’s class path. This may be a subdirectory of config or any other arbitrary location.

Custom code:
Any root level .class files packaged as in a typical jar file. This may include an @Configuration class, a Module Options Metadata class, and any dependent classes required by the module.

Dependent jar files:
Any required runtime dependencies that are not provided by the Spring XD runtime (in $XD_INSTALL_DIR/xd/lib) are loaded from the module’s /lib directory.

As mentioned previously, a Spring XD module can be installed as an expanded directory tree or an archive. If the module requires dependent jars, which is the typical case, it may be packaged as an uber jar compatible with the Spring Boot layout, and conforming to the above structure. The next section describes Spring XD’s support for module packaging and development.

Creating a Module Project

Spring XD (1.1.x or later) provides build tools for creating a module project to test and package the module either with Maven or Gradle. As described in the above sections, the module jar must export any dependencies that are not provided by the Spring XD container. The build tools address these concerns, packaging your module as an uber jar by wrapping the Spring Boot Maven Plugin or the Spring Boot Gradle Plugin, respectively. The plugins are configured with The MODULE layout for Spring Boot packaging. This does not build an executable jar file, as is normally done, and ensures provided dependencies will not be included in the uber jar.

In addition, the build tools provide Spring XD dependencies necessary to compile and test the module. Specifically, spring-xd-dirt and spring-xd-test provide some useful features for module development. As you would expect, the Spring XD versions match the specified parent pom or plugin version. These provide support for:

In-container testing - You can start an embedded single node container in a test class, create a stream designed to test your module using a Spring XD test framework, deploy the stream, and validate the results.

Note

If your module has no additional dependencies, a plain old jar file conforming to the module layout shown above will work. In this case, you may still benefit from using the build tools to simplify development and testing.

Module dependency management

Normally a module should export only the dependencies not provided by the Spring XD runtime. Runtime dependencies provided by the module are loaded using a separate module class loader when the module is deployed. This potentially can cause class version conflicts. Spring XD build tools are designed to prevent this and allow you to override the default exclusion rules if necessary. Generally we don’t recommend this unless you have a specific requirement for an alternate version. If your module introduces version conflicts, you will see errors such as ClassDefNotFoundException or NoSuchMethodError when you deploy the module. If you encounter such errors, you should manually check the contents of $XD_INSTALL_DIR/xd/lib against the contents of the module jar, or use the dependency analysis tools provided by Maven or Gradle, and make the necessary changes to your build script.

Porting to another Spring XD version

A module project’s build script is configured for a specific Spring XD version. With each new release of Spring XD, its runtime dependencies are subject to change and this directly affects which dependencies will be exported to the module jar. Deploying an existing module to a different Spring XD runtime version may result in version conflicts or unsatisfied dependencies. For this reason, we highly recommend that you rebuild any custom modules to match your target runtime environment. In many cases, this is simply a matter of updating the spring XD version in your build.

Maven requires the parent pom version to be hard coded. Hence, you must edit the build script to target an alternate version of
Spring XD.

To build the module:

$mvn clean package

The parent adds many of the transitive dependencies of spring-xd-dirt (provided) and spring-xd-test(test). Some transitive dependencies are excluded, such as Hadoop. Any easy way to determine what dependencies are included is to run a maven dependency goal, e.g.:

$mvn dependency:list

Any provided dependencies need not be declared as a module dependency. In any case, they will be excluded from the module jar by default.

If you must provide an alternate version of an existing Spring XD dependency, configure the Boot plugin explicitly in your pom to override the default, for example:

Building with Gradle

Start by creating a gradle.properties file defining springXdVersion.

springXdVersion = [the Spring XD version]

This property is required by the spring-xd-module plugin to resolve Spring XD dependencies, and to configure dependent libraries that your module project will need. This property should be used to reference the Spring XD version where needed.

Note

Defining this value in gradle.properties helps with dependency-management when porting to other Spring XD versions. We also recommend adding any additional version references here.

The plugin adds many of the transitive dependencies of spring-xd-dirt (provided) and spring-xd-test(test). Some transitive dependencies are excluded, such as Hadoop. Any easy way to determine what dependencies are included is to run one of Gradle’s dependency tools, e.g.:

$./gradlew dependencies

To build the module:

$./gradlew clean build

This configuration allows you to override springXdVersion on the command line:

$./gradlew clean build -PspringXdVersion=1.3.2.RELEASE

Note

Overriding the property on the Gradle command line does not work if the springXdVersion is hard coded in build.gradle itself, e.g., in an ext closure.

If you must provide an alternate version of an existing Spring XD dependency, override the exported configuration and the configureModule task in build.gradle, for example:

Registering a Module

Registering a module requires you to install to the Spring XD Module Registry. A Module must be registered before it may be deployed as part of a stream or job. Once you have packaged your module, following the instructions in the above section, you can register it using the Spring XD Shell module upload command:

All modules included with Spring XD out-of-the-box are located in the xd/modules directory where Spring XD is installed. The Module Registry organizes modules by type in corresponding sub-directories, so a directory listing will look something like:

modules
├── job
├── processor
├── sink
├── source

Spring XD provides a strategy interface ModuleRegistry used to locate a module of a given name and type. Currently Spring XD implements a ResourceModuleRegistry which is configured to locate modules in the following locations in this order:

The file path given by xd.module.home (${xd.home}/modules by default)

classpath:/modules/ (Spring XD does not provide any module definitions here)

The file path given by xd.customModule.home (${xd.home}/custom-modules by default)

Custom Module Registry

Custom modules are located separately from out-of-the-box modules. The location is given by xd.customModule.home in servers.yml. The location defaults to ${xd.home}/custom-modules but we strongly recommend setting this to an external location on a network file system or using the replicating registry if you are using custom modules in production. There are two reasons for doing this. First, custom modules must be accessible to all nodes on the Spring XD cluster, including the XD Admin node. This allows any container instance to deploy the module. Second, if custom modules are registered within the Spring XD installation, they will not survive an upgrade to the Spring XD distribution and will need to be reinstalled.
By default Spring XD expects MD5 hash files to be present next to the custom module jar. This is done to ensure the module upload has completed successfully before the module is used. Hash files are created automatically when installing modules vie the module upload command. If you wish to disable this requirement,
set xd.customModule.requiresHashFiles to false in servers.yml. This will allow you to manually copy module jars to a file based custom module registry or reuse an existing custom module registry that may not include hash files. This setting does not apply to hdfs registries.

Note

An alternative way for specifying the location of custom modules via servers.yml is using the environment variable XD_CUSTOMMODULE_HOME that must point to the custom modules location.

In cases where you want to start e.g. a single-node runtime with a custom module location you can also define the environment variable right before the executable like this:XD_CUSTOMMODULE_HOME=file\:/path/to/custom-modules bin/xd-singlenode

If you manually deploy your custom modules to XD_CUSTOMMODULE_HOME, since Spring XD (1.2.x or later), you will need to calculate md5md5 -q mymodule.jar > mymodule.jar.md5 and put it together with your module. Otherwise the module will not be loaded and displayed using Spring XD shell command module list

Replicating Module Registry

When running in distributed mode, an alternative to using a shared file system for custom modules is to use the replicating module registry.

If the value of xd.customModule.home does not use the file: protocol, then Spring XD will automatically set up a replicating registry that proxies that remote registry to the local filesystem. This is all done transparently and by default, files are copied down from the central repository only if their contents has changed.

At the time of writing, only the hdfs: protocol is supported. Setting this up is straightforward:

xd:
customModule:
home: hdfs://somehost/root/path/of/registry

Files will be replicated on the local filesystem in a temporary directory, on demand and loaded from there. The XD Admin process will need to have write access to that shared HDFS directory. Intermediary paths (/root/path/of/registry in the example above) are created at startup if they don’t exist yet.

Module Class Loading

Modules use a separate class loader that will first load classes from jars in the module’s /lib (and any class files located in the module’s root path). If not found, the class will be loaded from the parent ClassLoader that Spring XD normally uses (which includes everything under $XD_HOME/lib). Still, there are a couple of caveats to be aware of:

Avoid putting into the module’s lib/ directory any jar files that are already in Spring XD’s class path or you may end up with ClassCastExceptions or other class loading issues.

When not using local transport, any class that is directly or indirectly referenced from the payload type of your messages (i.e. any type in transit from module to module) must be referenced by both the producing and consuming modules and thus should be installed into xd/lib.

Occasionally, a class’s dependencies are not resolved correctly even though all the required jars appear to be on the module classpath. Consider a scenario in which class A depends on class B, and B depends on class C. If A and C are visible to the module class loader but only B is visible to the parent class loader, then you will get a ClassDefNotFoundException for class C if it has not already been loaded, because the parent class loader cannot resolve C. Unfortunately, an automated strategy to resolve this situation is difficult. A workaround is to install the jar containing class C into xd/lib.

Dynamic Module ClassLoader

Starting with Spring XD 1.2, a module can selectively add libraries from paths that are derived from module options. The aim is to support alternate implementations in the same module. This works like the following:

In the module .properties file, specify a value for the module.classloader key. The default is /lib/*.jar,/lib/*.zip, which is consistent with what has been exposed earlier.

The value for that key is a comma separated list of paths (most certainly with Ant-style patterns) that will be searched for additional libraries to add to the module ClassLoader (in addition to the module "Archive" itself, which is always included).

paths that start with a / (as /lib/*.jar in the example above) are considered internal resources to the archives (e.g. nested jars in the über-jar)

paths that do not start with a / (and in particular paths that start with a protocol, such as file:) are loaded with a regular Spring resource pattern resolver

Those paths can contain placeholders of the form ${foo}. Those will be resolved against the visible module options (and other inherited properties). Paths containing unresolvable placeholders are silently ignored.

This allows constructions like this (assuming for example that we want to create a jpa module that supports several JPA providers):

Where the metadata class includes a provider option (of type String) that will take e.g. the values hibernate or eclipse-link. Note the presence of a third ${xd.home}/lib/jpa/${provider}/*.jpa entry that can be used for unforeseen provider implementations.

Module Options

Each module instance is configured using property placeholders which are bound to the module’s options defined via Module Options Metadata. Options may be required or optional, where optional properties must provide a default value. Module Options Metadata may be provided within the module’s properties file or in a Java class provided by the module or one of its dependencies. In addition to binding module options to properties in the module’s application context, options may also be used to activate Spring environment profiles.

For example, here is part of the Spring configuration for the twittersearch source that runs a query against Twitter:

Note the Spring properties such as query, language, consumerKey and consumerSecret. Spring XD will bind values for all of these properties as provided as options for each module instance. The options exposed for this module are defined in TwitterSearchOptionsMetadata.java

For example, we can create two different streams, each using the twittersearch source providing different option values.

In addition to options, modules may reference Spring beans such that each module instance may inject a different implementation of a bean. The ability to deploy the same module definition with different configurations is only possible because each module is created in its own application context. This results in some very useful features, such as the ability to use standard bean ids such as input and output and simple property names without having to worry about naming collisions.

Observe the use of property placeholders with sensible defaults where possible in the above example. Sometimes, a sensible default is derived from the stream name, module name, or some other runtime context. For example, the file source requires a directory. An appropriate strategy is to define a common root path for XD input files (At the time of this writing it is /tmp/xd/input/. This is subject to change, but illustrates the point). A stream definition using the file source may specify the the directory name by providing a value for the dir option. If not provided, it will default to the stream name, which is contained in the xd.stream.name property bound to the module by the Spring XD runtime, see file source metadata. The module info command illustrates this point:

How module options are resolved

As we’ve seen so far, a module is a re-usable Spring Integration or Spring Batch application context that can be dynamically configured through the use of module options.

A module option is any value that the may be configured within a stream or job definition. Preferably, the module provides metadata to describe the available options. This section explains how default values are computed for each module option.

In a nutshell, actual values are resolved from the following sources, in order of precedence:

a key named <optionname> in the properties file <root>/<moduletype>/<modulename>/<modulename>.properties

a key named <moduletype>.<modulename>.<optionname> in the YaML file <root>/<module-config>.yml

where

<root>

is the value of the xd.module.config.location system property (driven by the XD_MODULE_CONFIG_LOCATION env var when using the canonical Spring XD shell scripts). This property defaults to ${xd.config.home}/modules/

<module-config>

is the value of the xd.module.config.name system property (driven by the XD_MODULE_CONFIG_NAME env var). Defaults to xd-module-config

Note that YaML is particularly well suited for hierarchical configuration, so for example, instead of

Note that options in the .properties files can reference values that appear in the modules.yml file (this makes sharing common configuration easy). Also, the values that are used to configure the server runtimes (in servers.yml) are visible to modules.yml and .properties file (but the inverse is not true).

Composing Modules

As described above, a stream is defined as a sequence of modules, minimally a source module followed by a sink module. Sometimes streams may want share a common processing chain. For example, consider the following two streams:

Aside from the source, the two stream definitions are the same. Composite Modules provide a way to avoid this type of duplication by allowing the filter processor and file sink to be combined into a single composite module. Perhaps more importantly, composite modules may improve performance. Each module within a stream represents a unit of deployment. Therefore, stream1 and stream2, as defined above, are each comprised of three such units (a source, a processor, and a sink). In a singlenode runtime with local transport, creating a composite module won’t affect performance since the communication between modules in this case already uses in-memory channels. However, when deploying a stream to a distributed runtime environment, the communication between adjacent modules typically occurs via messaging middleware, as modules are, by default, distributed evenly among the available containers. Often a stream will perform better when adjacent modules are co-located and can avoid middleware "hops", and object marshalling. In such cases, composing modules allows the composite module to behave as a single "black box." In other words, if "foo | bar" are composed to create a new module named "baz", the input and/or output to "baz" will still go over the middleware, but foo and bar will be co-located in a single container instance and wired to communicate via local memory.

Notice that the composed module shows up in the list of sink modules. That is because logically it acts as a sink: It provides an input channel (which is bridged to the filter processor’s input channel), but it provides no output channel (since the file sink has no output). Also notice that the module has a small (c) prefixed to it, to indicate that it is a composed module.

If a module were composed of two processors, it would be classified as a processor:

Based on the logical type of the composed module, it may be used in a stream as if it were a simple module instance. For example, to redefine the two streams from the first problem case above, now that the foo sink module has been composed, you can issue the following shell commands:

When you no longer need a composed module, you may delete it with the module delete shell command. However, if that composed module is currently being used in one or more stream definitions, Spring XD will not allow you to delete it until those stream definitions are destroyed. In this case, module delete will fail as shown below:

When creating a module, if you duplicate the name of an existing module for the same type, you will receive an error. In the example below the user tried to compose a tcp module, however one already exists:

However, you can create a module for a given type even though a module of that name exists but as a different type. For example: I can create a sink module named filter, even though filter already exists as a processor.

Finally, it’s worth mentioning that in some cases duplication may be avoided by reusing an actual stream rather than a composed module. This is possible when named channels are used in the source and/or sink position of a stream definition. For example, the same overall functionality as provided by the two streams above could also be achieved as follows:

This approach is more appropriate for use-cases where individual streams on either side of the named channel may need to be deployed or undeployed independently. Whereas the queue typed channel will load-balance across multiple downstream consumers, the topic: prefix may be used if broadcast behavior is needed instead. For more information about named channels, refer to the Named Channels section.

Getting Information about Modules

To view the available modules use the the module list command. Modules appearing with a (c) marker are composed modules. For example:

Sources

This section describes the source modules included with Spring XD. A source implements a data provider to originate a stream. To run the examples shown here, start the XD Container as instructed in the
Getting Started page.

Future releases will provide support for other currently available Spring Integration Adapters. For information on how to adapt an existing Spring Integration Adapter for use in Spring XD see the section Creating a Source Module.

The following sections show a mix of Spring XD shell and plain Unix shell commands, so if you are trying them out, you should open two separate terminal prompts, one running the XD shell and one to enter the standard commands for sending HTTP data, creating directories, reading files and so on.

File

The file source provides the contents of a File as a byte array by default. However, this can be
customized using the --mode option:

ref Provides a java.io.File reference

lines Will split files line-by-line and emit a new message for each line

contents The default. Provides the contents of a file as a byte array

When using --mode=lines, you can also provide the additional option --withMarkers=true.
If set to true, the underlying FileSplitter will emit additional start-of-file and end-of-file marker messages before and after the actual data.
The payload of these 2 additional marker messages is of type FileSplitter.FileMarker. The option withMarkers defaults to false if not explicitly set.

To log the contents of a file create a stream definition using the XD shell

xd:> stream create --name filetest --definition "file | log" --deploy

The file source by default will look into a directory named after the stream, in this case /tmp/xd/input/filetest

Note the above will log the raw bytes. For text files, it is normally desirable to output the contents as plain text. To do this, set the outputType parameter:

an initial delay when using a fixed delay trigger, expressed in TimeUnits (seconds by default) (int, default: 0)

maxMessages

the maximum messages per poll; -1 for unlimited (long, default: -1)

mode

specifies how the file is being read. By default the content of a file is provided as byte array (FileReadingMode, default: contents, possible values: ref,lines,contents)

pattern

a filter expression (Ant style) to accept only files that match the pattern (String, default: *)

preventDuplicates

whether to prevent the same file from being processed twice (boolean, default: true)

timeUnit

the time unit for the fixed and initial delays (String, default: SECONDS)

withMarkers

if true emits start of file/end of file marker messages before/after the data. Only valid with FileReadingMode 'lines' (Boolean, no default)

The ref option is useful in some cases in which the file contents are large and it would be more efficient to send the file path.

FTP

This source module supports transfer of files using the FTP protocol.
Files are transferred from the remote directory to the local directory where the module is deployed.
Messages emitted by the source are provided as a byte array by default. However, this can be
customized using the --mode option:

ref Provides a java.io.File reference

lines Will split files line-by-line and emit a new message for each line

contents The default. Provides the contents of a file as a byte array

When using --mode=lines, you can also provide the additional option --withMarkers=true.
If set to true, the underlying FileSplitter will emit additional start-of-file and end-of-file marker messages before and after the actual data.
The payload of these 2 additional marker messages is of type FileSplitter.FileMarker. The option withMarkers defaults to false if not explicitly set.

Options

The ftp source has the following options:

autoCreateLocalDir

local directory must be auto created if it does not exist (boolean, default: true)

the time unit for the fixed and initial delays (String, default: SECONDS)

tmpFileSuffix

extension to use when downloading files (String, default: .tmp)

username

the username for the FTP connection (String, no default)

withMarkers

if true emits start of file/end of file marker messages before/after the data. Only valid with FileReadingMode 'lines' (Boolean, no default)

GemFire Continuous Query (gemfire-cq)

Continuous query allows client applications to create a GemFire query using Object Query Language(OQL) and register a CQ listener which subscribes to the query and is notified every time the query’s result set changes. The gemfire_cq source registers a CQ which will post CQEvent messages to the stream.

Options

The gemfire-cq source has the following options:

host

host name of the cache server or locator (if useLocator=true). May be a comma delimited list (String, no default)

port

port of the cache server or locator (if useLocator=true). May be a comma delimited list (String, no default)

query

the query string in Object Query Language (OQL) (String, no default)

useLocator

indicates whether a locator is used to access the cache server (boolean, default: false)

The example is similar to that presented for the gemfire source above, and requires an external cache server as described in the above section. In this case the query provides a finer filter on data events. In the example below, the cqtest stream will only receive events matching a single ticker symbol, whereas the gftest stream example above will receive updates to every entry in the region.

GemFire Source (gemfire)

This source configures a client cache and client region, along with the necessary subscriptions enabled, in the XD container process along with a Spring Integration GemFire inbound channel adapter, backed by a CacheListener that outputs messages triggered by an external entry event on the region. By default the payload contains the updated entry value, but may be controlled by passing in a SpEL expression that uses the EntryEvent as the evaluation context.

Tip

If native gemfire properties are required to configure the client cache, e.g., for security, place a gemfire.properties file in $XD_HOME/config.

Options

host name of the cache server or locator (if useLocator=true). May be a comma delimited list (String, no default)

port

port of the cache server or locator (if useLocator=true). May be a comma delimited list (String, no default)

regionName

the name of the region for which events are to be monitored (String, default: <stream name>)

useLocator

indicates whether a locator is used to access the cache server (boolean, default: false)

Example

Use of the gemfire source requires an external process (or a separate stream) that creates or updates entries in a GemFire region configured for a cache server. Such events may feed a Spring XD stream. To support such a stream, the Spring XD container must join a GemFire distributed client-server grid as a client, creating a client region corresponding to an existing region on a cache server. The client region registers a cache listener via the Spring Integration GemFire inbound channel adapter. The client region and pool are configured for a subscription on all keys in the region.

The following example creates two streams: One to write http messages to a Gemfire region named Stocks, and another to listen for cache events and record the updates to a file. This works with the Cache Server and sample configuration included with the Spring XD distribution:

The useLocator option is intended for integration with an existing GemFire installation in which the cache servers are configured to use locators in accordance with best practice. GemFire supports configuration of multiple locators (or direct server connections) and this is specified by supplying comma-delimited values for the host and port options. You may specify a single value for either of these options otherwise each value must contain the same size list. The following are examples are valid for multiple connection addresses:

The last example creates connections to myhost1:10334, myhost2:10335, myhost3:10336

Note

You may also configure default Gemfire connection settings for all gemfire modules in config/modules.yml:

gemfire:
useLocator: true
host: myhost1,myhost2
port: 10334

Tip

If you are deploying on Java 7 or earlier and need to deploy more than 4 Gemfire modules be sure to increase the permsize of the singlenode or container. i.e. JAVA_OPTS="-XX:PermSize=256m"

Launching the XD GemFire Server

This source requires a cache server to be running in a separate process and its host and port, or a locator host and port must be configured. The XD distribution includes a GemFire server executable suitable for development and test purposes. This is a Java main class that runs with a Spring configured cache server. The configuration is passed as a command line argument to the server’s main method. The configuration includes a cache server port and one or more configured region. XD includes a sample cache configuration called cq-demo. This starts a server on port 40404 and creates a region named Stocks. A Logging cache listener is configured for the region to log region events.

Run Gemfire cache server by changing to the gemfire/bin directory and execute

the name of a custom MessageConverter class, to convert HttpRequest to Message; must have a constructor with a 'MessageBuilderFactory' parameter (String, default: org.springframework.integration.x.http.NettyInboundMessageConverter)

port

the port to listen to (int, default: 9000)

sslPropertiesLocation

location (resource) of properties containing the location of the pkcs12 keyStore and pass phrase (String, no default)

When using https, you may either provide a properties file that references a pkcs12 key store (containing the server certificate(s)) and its passphrase, or set keyStore and keyStorePassphrase explicitly.
Setting --https=true enables https:// and the module uses SSL properties configured in config/modules/source/http/http.properties. By default, the resource classpath:httpSSL.properties is used.
This location can be overridden in config/modules/source/http/http.properties or with the --sslPropertiesLocation property. For example:

Since this properties file contains sensitive information, it will typically be secured by the operating system with the XD container process having read access.

Note

If you set keyStore and keyStorePassphrase in config/modules/source/http/http.properties in lieue of using an external properties file, the passPhrase may be encrypted. See Encrypted Properties for more details.

JDBC Source (jdbc)

This source module supports the ability to ingest data directly from various databases.
It does this by querying the database and sending the results as messages to the stream.

In the example above the user will be polling the testfoo table to retrieve rows in the table that have a "tag" of zero. The update will set the value of tag to 1 for the rows that were retrieved, thus rows that have already been retrieved will not included in future queries.

Note

If you access any database other than HSQLDB or Postgres in a stream module then the JDBC driver jar for that database needs to be present in the $XD_HOME/lib directory.

The jdbc source has the following options:

abandonWhenPercentageFull

connections that have timed out wont get closed and reported up unless the number of connections in use are above the percentage (int, default: 0)

Run this as a Java application; each time you hit <enter> in the console, it will send a message to queue jmstest.

The out of the box configuration is setup to use ActiveMQ. To use another JMS provider you will need to update a few files in the XD distribution. There are sample files for HornetMQ in the distribution as an example for you to follow. You will also need to add the appropriate libraries for your provider in the JMS module lib directory or in the main XD lib directory.

JMS with Options

The jms source has the following options:

acknowledge

the session acknowledge mode (String, default: auto)

clientId

an identifier for the client, to be associated with a durable topic subscription (String, no default)

destination

the destination name from which messages will be received (String, default: <stream name>)

durableSubscription

when true, indicates the subscription to a topic is durable (boolean, default: false)

provider

the JMS provider (String, default: activemq)

pubSub

when true, indicates that the destination is a topic (boolean, default: false)

subscriptionName

a name that will be assigned to the topic subscription (String, no default)

Note

the selected broker requires an infrastructure configuration file jms-<provider>-infrastructure-context.xml in modules/common. This is used to declare any infrastructure beans needed by the provider. See the default (jms-activemq-infrastructure-context.xml) for an example. Typically, all that is required is a ConnectionFactory. The activemq provider uses a properties file jms-activemq.properties which can be found in the config directory. This contains the broker URL.

Kafka

This source module ingests data from a single or comma separated list of Kafka topics.
When using single topic configuration, one can also specify explicit partitions list and initial offset to fetch data from.
Also note that for the stream with the given name or kafka source with the given groupId, the offsets for the configured topics aren’t deleted when the stream is undeployed/destroyed. This
allows the re-deployed stream read from where it left when it was undeployed/destroyed.

The kafka source has the following options:

autoOffsetReset

strategy to reset the offset when there is no initial offset in ZK or if an offset is out of range (AutoOffsetResetStrategy, default: smallest, possible values: smallest,largest)

encoding

string encoder to translate bytes into string (String, default: UTF8)

fetchMaxBytes

max messages to attempt to fetch for each topic-partition in each fetch request (int, default: 1048576)

fetchMaxWait

max wait time before answering the fetch request (int, default: 100)

fetchMinBytes

the minimum amount of data the server should return for a fetch request (int, default: 1)

Then send an email to yourself and you should see it appear inside a file at /tmp/xd/output/mailstream

Note: If the username/password have special characters like @, <space> then you need to enter appropriate unicode characters for them.
For example the character @ can be specified with its unicode %40 as in the above definition.

The full list of options for the mail source is below:

The mail source has the following options:

charset

the charset used to transform the body of the incoming emails to Strings (String, default: UTF-8)

delete

whether to delete the emails once they’ve been fetched (boolean, default: true)

the polling interval used for looking up messages (s) (int, default: 60)

folder

the folder to take emails from (String, default: INBOX)

host

the hostname of the mail server (String, default: localhost)

markAsRead

whether to mark emails as read once they’ve been fetched (boolean, default: false)

maxMessages

the maximum messages per poll; -1 for unlimited (long, default: 1)

password

the password to use to connect to the mail server (String, no default)

port

the port of the mail server (int, default: 25)

properties

comma separated JavaMail property values (String, no default)

propertiesFile

file to load the JavaMail properties (String, no default)

protocol

the protocol to use to retrieve messages (MailProtocol, default: imap, possible values: imap,imaps,pop3,pop3s)

usePolling

whether to use polling or not (no polling works with imap(s) only) (boolean, default: false)

username

the username to use to connect to the mail server (String, no default)

Warning

Of special attention are the markAsRead and delete options, which by default will delete the emails once they are consumed. It is hard to come up with a sensible default option for this (please refer to the Spring Integration documentation section on mail handling for a discussion about this), so just be aware that the default for XD is to delete incoming messages.

MongoDB Source (mongodb)

The MongoDB source allows one to query a MongoDB collection and emit messages for each and every matching result.
This source works by regularly polling MongoDB and emitting the result list, as independent objects. If split is set to
false, the whole list is emitted as payload.

This receives messages from a queue named rabbittest and writes them to the default file sink (/tmp/xd/output/rabbittest.out). It uses the default RabbitMQ broker running on localhost, port 5672.

The queue(s) must exist before the stream is deployed. We do not create the queue(s) automatically. However, you can easily create a Queue using the RabbitMQ web UI. Then, using that same UI, you can navigate to the "rabbittest" Queue and publish test messages to it.

Notice that the file sink has --binary=true; this is because, by default, the data emitted by the source will be bytes. This can be modified by setting the content_type property on messages to text/plain. In that case, the source will convert the message to a String; you can then omit the --binary=true and the file sink will then append a newline after each message.

A Note About Retry

Note

With the default ackMode (AUTO) and requeue (true) options, failed message deliveries will be retried indefinitely. Since there is not much processing in the rabbit source, the risk of failure in the source itself is small. However, when using the LocalMessageBus or Direct Binding, exceptions in downstream modules will be thrown back to the source. Setting requeue to false will cause messages to be rejected on the first attempt (and possibly sent to a Dead Letter Exchange/Queue if the broker is so configured). The enableRetry option allows configuration of retry parameters such that a failed message delivery can be retried and eventually discarded (or dead-lettered) when retries are exhausted. The delivery thread is suspended during the retry interval(s). Retry options are enableRetry, maxAttempts, initialRetryInterval, retryMultiplier, and maxRetryInterval. Message deliveries failing with a MessageConversionException (perhaps when using a custom converterClassName) are never retried; the assumption being that if a message could not be converted on the first attempt, subsequent attempts will also fail. Such messages are discarded (or dead-lettered).

Reactor IP (reactor-ip)

The reactor-ip source acts as a server and allows a remote party to connect to XD and submit data over a raw TCP or UDP socket. The reactor-ip source differs from the standard tcp source in that it is based on the Reactor Project and can be configured to use the LMAX Disruptor RingBuffer library allowing for extremely high ingestion rates, e.g. ~ 1M/sec.

This will create the reactor TCP source and send data read from it to the file named tcpReactor.

The reactor-ip source has the following options:

codec

codec used to transcode data (String, default: string)

dispatcher

type of Reactor Dispatcher to use (String, default: shared)

framing

method of framing the data (String, default: linefeed)

host

host to bind the server to (String, default: 0.0.0.0)

lengthFieldLength

byte precision of the number used in the length field (int, default: 4)

port

port to bind the server to (int, default: 3000)

transport

whether to use TCP or UDP as a transport protocol (String, no default)

SFTP

This source module supports transfer of files using the SFTP protocol.
Files are transferred from the remote directory to the local directory where the module is deployed.

Messages emitted by the source are provided as a byte array by default. However, this can be
customized using the --mode option:

ref Provides a java.io.File reference

lines Will split files line-by-line and emit a new message for each line

contents The default. Provides the contents of a file as a byte array

When using --mode=lines, you can also provide the additional option --withMarkers=true.
If set to true, the underlying FileSplitter will emit additional start-of-file and end-of-file marker messages before and after the actual data.
The payload of these 2 additional marker messages is of type FileSplitter.FileMarker. The option withMarkers defaults to false if not explicitly set.

Options

The sftp source has the following options:

allowUnknownKeys

true to allow connecting to a host with an unknown or changed key (boolean, default: false)

autoCreateLocalDir

if local directory must be auto created if it does not exist (boolean, default: true)

deleteRemoteFiles

delete remote files after transfer (boolean, default: false)

fixedDelay

fixed delay in SECONDS to poll the remote directory (int, default: 1)

host

the remote host to connect to (String, default: localhost)

initialDelay

an initial delay when using a fixed delay trigger, expressed in TimeUnits (seconds by default) (int, default: 0)

knownHostsExpression

a SpEL expresssion location of known hosts file; required if 'allowUnknownKeys' is false; examples: systemProperties["user.home"]+"/.ssh/known_hosts", "/foo/bar/known_hosts" (String, no default)

localDir

set the local directory the remote files are transferred to (String, default: /tmp/xd/output)

maxMessages

the maximum messages per poll; -1 for unlimited (long, default: -1)

mode

specifies how the file is being read. By default the content of a file is provided as byte array (FileReadingMode, default: contents, possible values: ref,lines,contents)

the time unit for the fixed and initial delays (String, default: SECONDS)

tmpFileSuffix

extension to use when downloading files (String, default: .tmp)

user

the username to use (String, no default)

withMarkers

if true emits start of file/end of file marker messages before/after the data. Only valid with FileReadingMode 'lines' (Boolean, no default)

Stdout Capture

There isn’t actually a source named "stdin" but it is easy to capture stdin by redirecting it to a tcp source. For example if you wanted to capture the output of a command, you would first create the tcp stream, as above, using the appropriate sink for your requirements:

You can then capture the output from commands using the netcat command:

$ cat mylog.txt | netcat localhost 1234

Syslog

Three syslog sources are provided: reactor-syslog, syslog-udp, and syslog-tcp. The reactor-syslog adapter uses tcp and builds upon the functionality available in the Reactor project and provides improved throughput over the syslog-tcp adapter.

The reactor-syslog source has the following options:

port

the port on which the system will listen for syslog messages (int, default: 5140)

The syslog-udp source has the following options:

port

the port on which to listen (int, default: 5140)

rfc

the format of the syslog (String, default: 3164)

The syslog-tcp source has the following options:

nio

use nio (recommend false for a small number of senders, true for many) (boolean, default: false)

Tail Status Events

Some platforms, such as linux, send status messages to stderr. The tail module sends these events to a logging adapter, at WARN level; for example…​

[message=tail: cannot open `/tmp/xd/input/tailtest' for reading: No such file or directory, file=/tmp/xd/input/tailtest]
[message=tail: `/tmp/xd/input/tailtest' has become accessible, file=/tmp/xd/input/tailtest]

TCP

The tcp source acts as a server and allows a remote party to connect to XD and submit data over a raw tcp socket.

To create a stream definition in the server, use the following XD shell command

xd:> stream create --name tcptest --definition "tcp | file" --deploy

This will create the default TCP source and send data read from it to the tcptest file.

TCP is a streaming protocol and some mechanism is needed to frame messages on the wire. A number of decoders are available, the default being CRLF which is compatible with Telnet.

TCP Client (tcp-client)

The tcp-client source module uses raw tcp sockets, as does the tcp module but contrary to the tcp module, acts as a client. Whereas the tcp module will open a listening socket and wait for connections from a remote party, the tcp-client will initiate the connection to a remote server and emit as messages what that remote server sends over the wire. As an optional feature, the tcp-client can itself emit messages to the remote server, so that a simple conversation can take place.

TCP Client options

The tcp-client source has the following options:

bufferSize

the size of the buffer (bytes) to use when encoding/decoding (int, default: 2048)

charset

the charset used when converting from bytes to String (String, default: UTF-8)

close

whether to close the socket after each message (boolean, default: false)

decoder

the decoder to use when receiving messages (Encoding, default: CRLF, possible values: CRLF,LF,NULL,STXETX,RAW,L1,L2,L4)

encoder

the encoder to use when sending messages (Encoding, default: CRLF, possible values: CRLF,LF,NULL,STXETX,RAW,L1,L2,L4)

expression

a SpEL expression used to transform messages (String, default: payload.toString())

fixedDelay

the rate at which stimulus messages will be emitted (seconds) (int, default: 5)

Implementing a simple conversation

That "stimulus" counter concept bears some explanation. By default, the module will emit (at interval set by fixedDelay) an incrementing number, starting at 1. Given that the default is to use an expression of payload.toString(), this results in the module sending 1, 2, 3, ... to the remote server.

By using another expression, or more certainly a script, one can implement a simple conversation, assuming it is time based. As an example, let’s assume we want to join some kind of chat server where one first needs to authenticate, then specify which rooms to join. Lastly, all clients are supposed to send some keepalive commands to make sure that the connection is open.

If the script option is set, the script file’s modified timestamp is checked for changes every 60 seconds by
default; this can be changed with the refreshDelay deployment property: --refreshDelay=30000 (every 30 seconds or
30,000ms), --refreshDelay=-1 to disable refresh.

Time

The time source will simply emit a String with the current time every so often.

The time source has the following options:

fixedDelay

time delay between messages, expressed in TimeUnits (seconds by default) (int, default: 1)

format

how to render the current time, using SimpleDateFormat (String, default: yyyy-MM-dd HH:mm:ss)

initialDelay

an initial delay when using a fixed delay trigger, expressed in TimeUnits (seconds by default) (int, default: 0)

maxMessages

the maximum messages per poll; -1 for unlimited (long, default: 1)

timeUnit

the time unit for the fixed and initial delays (String, default: SECONDS)

Trigger Source (trigger)

The trigger source emits a message or messages according to the provided trigger configuration.
The message payload is a simple literal value, provided in the payload property.

The trigger source has the following options:

cron

cron expression specifying when the trigger should fire (String, no default)

date

a one-time date when the trigger should fire; only applies if 'fixedDelay' and 'cron' are not provided (String, default: The current time)

dateFormat

the format specifying how the 'date' should be parsed (String, default: MM/dd/yy HH:mm:ss)

fixedDelay

time delay between executions, expressed in TimeUnits (seconds by default) (Integer, no default)

initialDelay

an initial delay when using a fixed delay trigger, expressed in TimeUnits (seconds by default) (int, default: 0)

maxMessages

the maximum messages per poll; -1 for unlimited (long, default: 1)

payload

the message that will be sent when the trigger fires (String, default: ``)

timeUnit

the time unit for the fixed and initial delays (String, default: SECONDS)

Twitter Search (twittersearch)

The twittersearch source runs a continuous query against Twitter.

The twittersearch source has the following options:

connectTimeout

the connection timeout for making a connection to Twitter (ms) (int, default: 5000)

consumerKey

a consumer key issued by twitter (String, no default)

consumerSecret

consumer secret corresponding to the consumer key (String, no default)

To get a consumerKey and consumerSecret you need to register a twitter application. If you don’t already have one set up, you can create an app at the Twitter Developers site to get these credentials.

Tip

For both twittersearch and twitterstream you can put these keys in a module properties file instead of supplying them in the stream definition. If both sources share the same credentials, it is easiest to configure the required credentials in config/modules/modules.yml. Alternately, each module has its own properties file. For twittersearch, the file would be config/modules/source/twittersearch/twittersearch.properties.

To create and deploy a stream definition in the server using the XD shell:

Twitter Stream (twitterstream)

This source ingests data from Twitter’s streaming API v1.1. It uses the sample and filter stream endpoints rather than the full "firehose" which needs special access. The endpoint used will depend on the parameters you supply in the stream definition (some are specific to the filter endpoint).

You need to supply all keys and secrets (both consumer and accessToken) to authenticate for this source, so it is easiest if you just add these to XD_HOME/config/modules/modules.yml or XD_HOME/config/modules/source/twitterstream/twitterstream.properties file.

Processors

This section describes the processor modules included with Spring XD. A processor implements a processing task within a stream. A stream may chain multiple processors sequentially as needed. To run the examples shown here, start the XD Container as instructed in the
Getting Started page.

Aggregator

The aggregator module does the opposite of the splitter, and builds upon the concept of the same name found in Spring Integration. By default, it will consider all incoming messages from a stream to belong to the same group:

This is an example that is operating on a JSON payload of tweets as consumed from the twitter search module.

Filter with Groovy Script

For more complex filtering, you can pass the location of a Groovy script using the script option. If you want to pass variable values to your script, you can statically bind values using the variables option or optionally pass the path to a properties file containing the bindings using the propertiesLocation option.All properties in the file will be made available to the script as variables. Note that payload and headers are implicitly bound to give you access to the data contained in a message.

By default, Spring XD will search the classpath for custom-filter.groovy and custom-filter.properties. You can place the script in ${xd.home}/modules/processor/scripts and the properties file in ${xd.home}/config to make them available on the classpath. Alternatively, you can prefix the script and properties-location values with file: to load from the file system.

In the following stream definitions, the filter will pass only the first message:

Note the last example demonstrates that values specified in variables override values from propertiesLocation

Tip

The script file’s modified timestamp is checked for changes every 60 seconds by default; this can be changed with
the refreshDelay deployment property: --refreshDelay=30000 (every 30 seconds or 30,000ms), --refreshDelay=-1 to
disable refresh.

Header Enricher (header-enricher)

The header-enricher processor provides a basic header enricher to allow a stream to add runtime state in one or more message headers. Message headers are preserved across the entire stream flow and may be referenced by down stream modules using SpEL, e.g., expression=headers['foo'].

Header expressions are provided using the headers module option which expects a JSON string.

Literal Strings with Embedded Spaces: SpEL expects literals to be enclosed in single quotes. This is straight forward when the literal does not contain embedded spaces as in the Multple Headers example above. Using embedded spaces requires either
wrapping the headers value in single quotes and escaping the single quote for the literal string

a JSON document representing headers in which values are SpEL expressions, e.g {"h1":"exp1","h2":"exp2"} (String, no default)

overwrite

set to true to overwrite any existing message headers (Boolean, default: false)

HTTP Client (http-client)

The http-client processor acts as a client that issues HTTP requests to a remote server, submitting the message payload it receices to that server and in turn emitting the response it receives to the next module down the line.

For example, the following command will result in an immediate fetching of earthquake data and it being logged in the container:

The object-to-json processor has no particular option (in addition to options shared by all modules)

Script

The script processor contains a Service Activator that invokes a specified Groovy script. This is a slightly more generic way to accomplish processing logic, as the provided script may simply terminate the stream as well as transform or filter Messages.

To use the module, pass the location of a Groovy script using the script attribute. If you want to pass variable values to your script, you can statically bind values using the variables option or optionally pass the path to a properties file containing the bindings using the propertiesLocation option. All properties in the file will be made available to the script as variables. Note that payload and headers are implicitly bound to give you access to the data contained in a message. See the Filter example for a more detailed discussion of script variables.

By default, Spring XD will search the classpath for custom-processor.groovy and custom-processor.properties. You can place the script in ${xd.home}/modules/processor/scripts and the properties file in ${xd.home}/config to make them available on the classpath. Alternatively, you can prefix the location and properties-location values with file: to load from the file system.

Tip

The script file’s modified timestamp is checked for changes every 60 seconds by default; this can be changed with
the refreshDelay deployment property: --refreshDelay=30000 (every 30 seconds or 30,000ms), --refreshDelay=-1 to
disable refresh.

Shell

The shell processor forks an external process by running a shell command to launch a process written in any language. The process should implement a continual loop that waits for input from stdin and writes a result to stdout in a request-response manner. The process will be destroyed when the stream is undeployed. For example, it is possible to invoke a Python script within a stream in this manner. Since the shell processor relies on low-level stream processing there are some additional requirements:

Input and output data are expected to be Strings, the charset is configurable.

The shell process must not write out of band data to stdout, such as a start up message or prompt.

Anything written to stderr will be logged as an ERROR in Spring XD but will not terminate the stream.

Responses written to stdout must be terminated using the configured encoder (CRLF or "\r\n" is the default) for the module and must not exceed the configured bufferSize

Any external software required to run the script must be installed on the container node to which the module is deployed.

This transform will convert all message payloads to upper case. If sending the word "foo" to the HTTP endpoint and you should see "FOO" in the XD log:

xd:> http post --target http://localhost:9003 --data "foo"

As part of the SpEL expression you can make use of the pre-registered JSON Path function. The syntax is #jsonPath(payload,<json path expression>)

Transform with Groovy Script

For more complex transformations, you can pass the location of a Groovy script using the script option. If you want to pass variable values to your script, you can statically bind values using the variables option or optionally pass the path to a properties file containing the bindings using the propertiesLocation option. All properties in the file will be made available to the script as variables. Note that payload and headers are implicitly bound to give you access to the data contained in a message. See the Filter example for a more detailed discussion of script variables.

By default, Spring XD will search the classpath for custom-transform.groovy and custom-transform.properties. You can place the script in ${xd.home}/modules/processor/scripts and the properties file in ${xd.home}/config to make them available on the classpath. Alternatively, you can prefix the script and properties-location values with file: to load from the file system.

Tip

The script file’s modified timestamp is checked for changes every 60 seconds by default; this can be changed with
the refreshDelay deployment property: --refreshDelay=30000 (every 30 seconds or 30,000ms), --refreshDelay=-1 to
disable refresh.

Sinks

This section describes the sink modules included with Spring XD. A sink terminates a stream to persist data or push it to an external consumer. To run the examples shown here, start the XD Container
as instructed in the Getting Started page.

Additionally, Spring XD provides a number of counters and gauges, which are specialized sinks useful for real time analytics.

See the section Creating a Sink Module for information on how to create sink modules using other Spring Integration Adapters.

Dynamic Router (router)

The Dynamic Router support allows for routing Spring XD messages to named channels based on the evaluation of SpEL expressions or Groovy Scripts.

SpEL-based Routing

In the following example, 2 streams are created that listen for message on the foo and the bar channel. Furthermore, we create a stream that receives messages via HTTP and then delegates the received messages to a router:

You can also use Groovy scripts located on your classpath by specifying:

--script='org/my/package/router.groovy'

If you want to pass variable values to your script, you can statically bind values using the variables option or optionally pass the path to a properties file containing the bindings using the propertiesLocation option. All properties in the file will be made available to the script as variables. You may specify both variables and propertiesLocation, in which case any duplicate values provided as variables override values provided in propertiesLocation. Note that payload and headers are implicitly bound to give you access to the data contained in a message.

If the script option is set, the script file’s modified timestamp is checked for changes every 60 seconds by
default; this can be changed with the refreshDelay deployment property: --refreshDelay=30000 (every 30 seconds or
30,000ms), --refreshDelay=-1 to disable refresh.

File Sink (file)

Another simple option is to stream data to a file on the host OS. This can be done using the file sink module to create a stream.

For the filename, it will do the same thing as explained previously. For the directory name it will use
the content of the file (trimmed) concatenated with dir- (in that case : "/tmp/test/dir-hello.txt").
If the destination directory does not exists, the respective destination directory and any non-existing
parent directories are being created automatically.

When you use the nameExpression option you have to use the dirExpression option (not the dir option) to
specify the destination directory name, even if it’s a simple string (e.g. 'mydir').

File with Options

The file sink has the following options:

binary

if false, will append a newline character at the end of each line (boolean, default: false)

charset

the charset to use when writing a String payload (String, default: UTF-8)

dir

the directory in which files will be created (String, default: /tmp/xd/output/)

dirExpression

spring expression used to define directory name (String, no default)

mode

what to do if the file already exists (Mode, default: APPEND, possible values: APPEND,REPLACE,FAIL,IGNORE)

name

filename pattern to use (String, default: <stream name>)

nameExpression

spring expression used to define filename (String, no default)

suffix

filename extension to use (String, no default)

FTP Sink (ftp)

FTP sink is a simple option to push files to an FTP server from incoming messages.

It uses an ftp-outbound-adapter, therefore incoming messages could be either a java.io.File object, a String (content of the file)
or an array of bytes (file content as well).

To use this sink, you need a username and a password to login. Once you have this you can stream
data from, for instance, a file source to the ftp sink:

On the ftp server, you should see the file test.txt with the content hello.

To pass the filename to the module you can use the header file_name with the filename you wish to be used.

NOTE:
By default Spring Integration will use o.s.i.file.DefaultFileNameGenerator if none is specified. DefaultFileNameGenerator will determine the file name
based on the value of the file_name header (if it exists) in the MessageHeaders, or if the payload of the Message is already a java.io.File, then it will
use the original name of that file.

FTP with Options

The ftp sink has the following options:

autoCreateDir

remote directory must be auto created if it does not exist (boolean, default: true)

what to do if the file already exists (Mode, default: REPLACE, possible values: APPEND,REPLACE,FAIL,IGNORE)

password

the password for the FTP connection (Password, no default)

port

the port for the FTP server (int, default: 21)

remoteDir

the remote directory to transfer the files to (String, default: /)

remoteFileSeparator

file separator to use on the remote side (String, default: /)

temporaryRemoteDir

temporary remote directory that should be used (String, default: /)

tmpFileSuffix

extension to use on server side when uploading files (String, default: .tmp)

useTemporaryFilename

use a temporary filename while transferring the file and rename it to its final name once it's fully transferred (boolean, default: true)

username

the username for the FTP connection (String, no default)

GemFire Server

Currently XD supports GemFire’s client-server topology. A sink that writes data to a GemFire cache requires at least one cache server to be running in a separate process and may also be configured to use a Locator. While Gemfire configuration is outside of the scope of this document, details are covered in the GemFire Product documentation. The XD distribution includes a standalone GemFire server executable suitable for development and test purposes and bootstrapped using a Spring configuration file provided as a command line argument. The GemFire jar is distributed freely under GemFire’s development license and is subject to the license’s terms and conditions. Sink modules provided with the XD distrubution that write data to GemFire create a client cache and client region. No data is cached on the client.

Tip

If native gemfire properties are required to configure the client cache, e.g., for security, place a gemfire.properties file in $XD_HOME/config.

Launching the XD GemFire Server

To start the GemFire cache server GemFire Server included in the Spring XD distribution, go to the XD install directory:

$cd gemfire/bin
$./gemfire-server ../config/cq-demo.xml

The command line argument is the path of a Spring Data Gemfire configuration file with including a configured cache server and one or more regions. A sample cache configuration is provided cq-demo.xml located in the config directory. Note that Spring interprets the path as a relative path unless it is explicitly preceded by file:. The sample configuration starts a server on port 40404 and creates a region named Stocks.

Gemfire sinks

There are 2 implementations of the gemfire sink: gemfire-server and gemfire-json-server. They are identical except the latter converts JSON string payloads to a JSON document format proprietary to GemFire and provides JSON field access and query capabilities. If you are not using JSON, the gemfire-server module will write the payload using java serialization to the configured region. Both modules accept the same options.

The gemfire-server sink has the following options:

host

host name of the cache server or locator (if useLocator=true). May be a comma delimited list (String, no default)

keyExpression

a SpEL expression which is evaluated to create a cache key (String, default: '<stream name>')

port

port of the cache server or locator (if useLocator=true). May be a comma delimited list (String, no default)

regionName

name of the region to use when storing data (String, default: <stream name>)

useLocator

indicates whether a locator is used to access the cache server (boolean, default: false)

Tip

The keyExpression, as its name suggests, is a SpEL. Typically, the key value is derived from the payload. The default of '<streamname>' (mind the quotes), will overwrite the same entry for every message received on the stream.

Note

The useLocator option is intended for integration with an existing GemFire installation in which the cache servers are configured to use locators in accordance with best practice. GemFire supports configuration of multiple locators (or direct server connections) and this is specified by supplying comma-delimited values for the host and port options. You may specify a single value for either of these options otherwise each value must contain the same size list. The following are examples are valid for multiple connection addresses:

This will write an entry to the GemFire Stocks region with the key FAKE. Please do not put spaces when separating the JSON key-value pairs, only a comma.

You should see a message on STDOUT for the process running the GemFire server like:

INFO [LoggingCacheListener] - updated entry FAKE

Tip

If you are deploying on Java 7 or earlier and need to deploy more than 4 Gemfire modules, be sure to increase the permsize of the singlenode or container. i.e. JAVA_OPTS="-XX:PermSize=256m".

GPFDIST

The gpfdist sink allows you to stream data in parallel to either Pivotal Greenplum DB
or Pivotal HAWQ. Internally, this sink creates a custom http listener that supports
the gpfdist protcol and schedules a task that orchestrates a gploadd session in the
same way it is done natively in Greenplum.

No data is written into temporary files and all data is kept in stream buffers waiting
to get inserted into Greenplum DB or HAWQ. If there are no existing load sessions from Greenplum,
the sink will block until such sessions are established.

Now create the stream definition and deploy. You should ensure that your pg_hba.conf (e.g. /data/master/gpsne-1/pg_hba.conf) is configured to allow a connection from your host where you are running the gpfdist sink. (an entry such as host all gpadmin 192.168.70.128/32 trust)

In this XD stream we send 10M messages from the load-generator-string source to the gpfdist sink.
We roughly keep load session alive for 5 seconds while flushing data after 2s or 200 entries which ever
comes first and sleep 0s in between load sessions.

You will see log output (you will probably need to set the log level of the package log4j.logger.org.springframework.xd.greenplum to INFO.)

Performance Notes

On a Lenovo W540, Spring XD singlenode, load-generator-string | gpfdist inserted data at ~ 540K/sec.
The underlying message handler in the gpfdist sink is able to achieve ~1.2M/sec, which is comprable to
the use of the native gpload client. Additional performance optimizations when used within an XD stream
are on the roadmap.

Implementation Notes

Within a gpfdist sink we have a Reactor based stream where data is published from the incoming SI channel.
This channel receives data from the Message Bus. The Reactor stream is then connected to Netty based
http channel adapters so that when a new http connection is established, the Reactor stream is flushed and balanced among
existing http clients. When Greenplum does a load from an external table, each segment will initiate
a http connection and start loading data. The net effect is that incoming data is automatically spread
among the Greenplum segments.

GPFDIST with Options

The options flushCount and flushTime are used to determine when to flush
data that is buffered in an internal
Reactor stream to
the http connection. Data is flushed based on if the count value has been reached
or the time specified has elapsed. Note that with too high a value, memory consumption
will go up. Too small a value combined with a low ingestion rate will result in data
being inserted into the database less frequently.

batchCount defines the maximum count of aggregated windows the client
takes before the internal Reactor stream and http channel is closed.

batchTimeout defines how many seconds each http connection should be
kept alive if no data is streamed to a client. Use this together with
batchCount to estimate how long each loading session should last.

batchPeriod defines how many seconds a task running load operation
should sleep in between a loads.

mode defines a database load logic which is either INSERT or
UPDATE. INSERT is a default mode. Similar to control file
GPLOAD.OUTPUT.MODE property. MERGE is not currently supported.

columnDelimiter defines a data delimiter character within a line of
data. Defaults to tabulator character. Similar to control file
GPLOAD.SOURCE.DELIMITER property.

updateColumns defines updated column names and required with mode
UPDATE. Similar to control file GPLOAD.OUTPUT.UPDATE_COLUMNS property.

matchColumns defines matched column names and required with mode
UPDATE. Similar to control file GPLOAD.OUTPUT.MATCH_COLUMNS property.

sqlBefore defines a simple sql clause to be run before of every load
operation. Similar to control file GPLOAD.SQL.BEFORE property.

sqlAfter defines a simple sql clause to be run after of every load
operation. Similar to control file GPLOAD.SQL.AFTER property.

delimiter is used to postfix incoming data with a line termination
because Greenplum expects line terminated data.

controlFile can be used to introduce more parameters for a load
operation. For simple use cases, the table property can be used.

rateInterval if set, enables rate logging passing through sink.

The gpfdist sink has the following options:

batchCount

batch count (int, default: 100)

batchPeriod

batch period (int, default: 10)

batchTimeout

batch timeout (int, default: 4)

columnDelimiter

column delimiter (Character, no default)

controlFile

path to yaml control file (String, no default)

dbHost

database host (String, default: localhost)

dbName

database name (String, default: gpadmin)

dbPassword

database password (String, default: gpadmin)

dbPort

database port (int, default: 5432)

dbUser

database user (String, default: gpadmin)

delimiter

data line delimiter (String, default: `
`)

flushCount

flush item count (int, default: 100)

flushTime

flush item time (int, default: 2)

matchColumns

match columns with update (String, no default)

mode

mode, either insert or update (String, no default)

port

gpfdist listen port (int, default: 0)

rateInterval

enable transfer rate interval (int, default: 0)

sqlAfter

sql to run after load (String, no default)

sqlBefore

sql to run before load (String, no default)

table

target database table (String, no default)

updateColumns

update columns with update (String, no default)

Cassandra

The Cassandra sink write into a Cassandra table. Here is a simple example

Hadoop (HDFS) (hdfs)

If you do not have Hadoop installed, you can install Hadoop as described in our separate guide. Spring XD supports 4 Hadoop distributions, see using Hadoop for more information on how to start Spring XD to target a specific distribution.

Once Hadoop is up and running, you can then use the hdfs sink when creating a stream

In the above example, we’ve scheduled time source to automatically send ticks to hdfs once in every second. If you wait a little while for data to accumuluate you can then list can then list the files in the hadoop filesystem using the shell’s built in hadoop fs commands. Before making any access to HDFS in the shell you first need to configure the shell to point to your name node. This is done using the hadoop config command.

xd:>hadoop config fs --namenode hdfs://localhost:8020

In this example the hdfs protocol is used but you may also use the webhdfs protocol. Listing the contents in the output directory (named by default after the stream name) is done by issuing the following command.

While the file is being written to it will have the tmp suffix. When the data written exceeds the rollover size (default 1GB) it will be renamed to remove the tmp suffix. There are several options to control the in use file file naming options. These are --inUsePrefix and --inUseSuffix set the file name prefix and suffix respectfully.

When you destroy a stream

xd:>stream destroy --name myhdfsstream1

and list the stream directory again, in use file suffix doesn’t exist anymore.

In the above examples we didn’t yet go through why the file was written in a specific directory and why it was named in this specific way. Default location of a file is defined as /xd/<stream name>/<stream name>-<rolling part>.txt. These can be changed using options --directory and --fileName respectively. Example is shown below.

It is also possible to control the size of a files written into HDFS. The --rollover option can be used to control when file currently being written is rolled over and a new file opened by providing the rollover size in bytes, kilobytes, megatypes, gigabytes, and terabytes.

Often a stream of data may not have a high enough rate to roll over files frequently, leaving the file in an opened state. This prevents users from reading a consistent set of data when running mapreduce jobs. While one can alleviate this problem by using a small rollover value, a better way is to use the idleTimeout option that will automatically close the file if there was no writes during the specified period of time. This feature is also useful in cases where burst of data is written into a stream and you’d like that data to become visible in HDFS.

Note

The idleTimeout value should not exceed the timeout values set on the Hadoop cluster. These are typically configured using the dfs.socket.timeout and/or dfs.datanode.socket.write.timeout properties in the hdfs-site.xml configuration file.

In the above example we changed a source to http order to control what we write into a hdfs sink. We defined a small rollover size and a timeout of 10 seconds. Now we can simply post data into this stream via source end point using a below command.

xd:> http post --target http://localhost:8000 --data "hello"

If we repeat the command very quickly and then wait for the timeout we should be able to see that some files are closed before rollover size was met and some were simply rolled because of a rollover size.

Files can be automatically partitioned using a partitionPath expression. If we create a stream with idleTimeout and partitionPath with simple format yyyy/MM/dd/HH/mm we should see writes ending into its own files within every minute boundary.

Partitioning can also be based on defined lists. In a below example we simulate feeding data by using a time and a transform elements. Data passed to hdfs sink has a content APP0:foobar, APP1:foobar, APP2:foobar or APP3:foobar.

Partitioning can also be based on defined ranges. In a below example we simulate feeding data by using a time and a transform elements. Data passed to hdfs sink has a content ranging from APP0 to APP15. We simple parse the number part and use it to do a partition with ranges {3,5,10}.

Partition using a dateFormat can be based on content itself. This is a good use case if old log files needs to be processed where partitioning should happen based on timestamp of a log entry. We create a fake log data with a simple date string ranging from 1970-01-10 to 1970-01-13.

threshold in bytes when file will be automatically rolled over (String, default: 1G)

Note

In the context of the fileOpenAttempts option, attempt is either one rollover request or failed stream open request for a path (if another writer came up with a same path and already opened it).

Partition Path Expression

SpEL expression is evaluated against a Spring Messaging Message passed internally into a HDFS writer. This allows expression to use headers and payload from that message. While you could do a custom processing within a stream and add custom headers, timestamp is always going to be there. Data to be written is then available in a payload.

Accessing Properties

Using a payload simply returns whatever is currently being written. Access to headers is via headers property. Any other property is automatically resolved from headers if found. For example headers.timestamp is equivalent to timestamp.

Custom Methods

Addition to a normal SpEL functionality, few custom methods has been added to make it easier to build partition paths. These custom methods can be used to work with a normal partition concepts like date formatting, lists, ranges and hashes.

path

path(String... paths)

Concatenates paths together with a delimiter /. This method can be used to make the expression less verbose than using a native SpEL functionality to combine path parts together. To create a path part1/part2, expression 'part1' + '/' + 'part2' is equivalent to path('part1','part2').

Creates a path using date formatting. Internally this method delegates into SimpleDateFormat and needs a Date and a pattern. On default if no parameter used for conversion is given, timestamp is expected. Effectively dateFormat('yyyy') equals to dateFormat('yyyy', timestamp) or dateFormat('yyyy', headers.timestamp).

Method signature with three parameters can be used to create a custom Date object which is then passed to SimpleDateFormat conversion using a dateformat pattern. This is useful in use cases where partition should be based on a date or time string found from a payload content itself. Default dateformat pattern if omitted is yyyy-MM-dd.

Parameters

pattern

Pattern compatible with SimpleDateFormat to produce a final output.

epoch

Timestamp as Long which is converted into a Date.

date

A Date to be formatted.

dateformat

Secondary pattern to convert datestring into a Date.

datestring

Date as a String

Return Value

A path part representation which can be a simple file or directory name or a directory structure.

list

list(Object source, List<List<Object>> lists)

Creates a partition path part by matching a source against a lists denoted by lists.

Lets assume that data is being written and it’s possible to extrace an appid either from headers or payload. We can automatically do a list based partition by using a partition method list(headers.appid,{{'1TO3','APP1','APP2','APP3'},{'4TO6','APP4','APP5','APP6'}}). This method would create three partitions, 1TO3_list, 4TO6_list and list. Latter is used if no match is found from partition lists passed to lists.

Parameters

source

An Object to be matched against lists.

lists

A definition of list of lists.

Return Value

A path part prefixed with a matched key i.e. XXX_list or list if no match.

range

range(Object source, List<Object> list)

Creates a partition path part by matching a source against a list denoted by list using a simple binary search.

The partition method takes a source as first argument and list as a second argument. Behind the scenes this is using jvm’s binarySearch which works on an Object level so we can pass in anything. Remember that meaningful range match only works if passed in Object and types in list are of same type like Integer. Range is defined by a binarySearch itself so mostly it is to match against an upper bound except the last range in a list. Having a list of {1000,3000,5000} means that everything above 3000 will be matched with 5000. If that is an issue then simply adding Integer.MAX_VALUE as last range would overflow everything above 5000 into a new partition. Created partitions would then be 1000_range, 3000_range and 5000_range.

Parameters

source

An Object to be matched against list.

list

A definition of list.

Return Value

A path part prefixed with a matched key i.e. XXX_range.

hash

hash(Object source, int bucketcount)

Creates a partition path part by calculating hashkey using source`shashCode and bucketcount. Using a partition method hash(timestamp,2) would then create partitions named 0_hash, 1_hash and 2_hash. Number suffixed with _hash is simply calculated using Object.hashCode() % bucketcount.

Parameters

source

An Object which hashCode will be used.

bucketcount

A number of buckets

Return Value

A path part prefixed with a hash key i.e. XXX_hash.

HDFS Dataset (Avro/Parquet) (hdfs-dataset)

The HDFS Dataset sink is used to store Java classes that are sent as the payload on the stream. It uses the Kite SDK Data Module's Dataset implementation to store the payload data serialized in either Avro or Parquet format. The Avro schema is generated from the Java class that is persisted. For Parquet the Java object must follow JavaBean conventions with properties for any fields to be persisted. The fields can only be simple scalar values like Strings and numbers.

The HDFS Dataset sink requires that you have a Hadoop installation that is based on Hadoop v2 (Hadoop 2.2.0, Pivotal HD 1.0, Cloudera CDH4 or Hortonworks HDP 2.0), see using Hadoop for more information on how to start Spring XD to target a specific distribution.

Once Hadoop is up and running, you can then use the hdfs-dataset sink when creating a stream

In the above example, we’ve scheduled time source to automatically send ticks to the hdfs-dataset sink once every second. The data will be stored in a directory named /xd/<streamname> by default, so in this example it will be /xd/mydataset. You can change this by supplying a --basePath parameter and/or --namespace parameter. The --basePath defaults to /xd and the --namespace defaults to <streamname>. The Avro format is used by default and the data files are stored in a sub-directory named after the payload Java class. In this example the stream payload is a String so the name of the data sub-directory is string. If you have multiple Java classes as payloads, each class will get its own sub-directory.

Let the stream run for a minute or so. You can then list the contents of the hadoop filesystem using the shell’s built in hadoop fs commands. You will first need to configure the shell to point to your name node using the hadoop config command. We use the hdfs protocol is to access the hadoop name node.

You can see that the sink has created two files containing the first two batches of 20 stream payloads each. There is also a .metadata directory created that contains the metadata that the Kite SDK Dataset implementation uses as well as the generated Avro schema for the persisted type.

the sub-directory under the basePath where files will be written (String, default: <stream name>)

partitionPath

the partition path strategy to use, a list of KiteSDK partition expressions separated by a '/' symbol (String, default: ``)

writerCacheSize

the size of the cache to be used for partition writers (10 if omitted) (int, default: -1)

About null values

If allowNullValues is set to true then each field in the generated schema will use a union of null and the data type of the field. You can also set allowNullValues to false and instead annotate fields in a POJO using Avro’s org.apache.avro.reflect.Nullable annotation to create a schema using a union with null for that annotated field.

About partitionPath

The partitionPath option lets you specify one or more paths that will be used to partition the files that the data is written to based on the content of the data. You can use any of the FieldPartitioners that are available for the Kite SDK project. We simply pass in what is specified to create the corresponding partition strategy. You can separate multiple paths with a / character. The following partitioning functions are available:

year, month, day, hour, minute creates partitions based on the value of a timestamp and creates directories named like "YEAR=2014" (works well with fields of datatype long)

specify function plus field name like: year('timestamp')

dateformat creates partitions based on a timestamp and a dateformat expression provided - creates directories based on the name provided (works well with fields of datatype long)

specify function plus field name, a name for the partition and the date format like: dateFormat('timestamp', 'Y-M', 'yyyyMM')

range creates partitions based on a field value and the upper bounds for each bucket that is specified (works well with fields of datatype int and string)

specify function plus field name and the upper bounds for each partition bucket like: range('age',20,50,80,T(Integer).MAX_VALUE) (Note that you can use SpEL expressions like we just did for the Integer.MAX_VALUE)

identity creates partitions based on the exact value of a field (works well with fields of datatype string, long and int)

specify function plus field name, a name for the partition, the type of the field (String or Integer) and the number of values/buckets for the partition like: identity('region','R',T(String),10)

hash creates partitions based on the hash calculated from the value of a field divided into a number of buckets that is specified (works well with all data types)

specify function plus field name and number of buckets like: hash('lastname',10)

Multiple expressions can be specified by separating them with a / like: identity('region','R',T(String),10)/year('timestamp')/month('timestamp')

JDBC

The JDBC sink can be used to insert message payload data into a relational database table. By default it inserts the entire payload into a table named after the stream name in the HSQLDB database that XD uses to store metadata for batch jobs. To alter this behavior, the jdbc sink accepts several options that you can pass using the --foo=bar notation in the stream, or change globally. There is also a config/init_db.sql file that contains the SQL statements used to initialize the database table. You can modify this file if you’d like to create a table with your specific layout when the sink starts. You should also change the initializeDatabase property to true to have this script execute when the sink starts up.

The payload data will be inserted as-is if the names option is set to payload. This is the default behavior. If you specify any other column names the payload data will be assumed to be a JSON document that will be converted to a hash map. This hash map will be used to populate the data values for the SQL insert statement. A matching of column names with underscores like user_name will match onto camel case style keys like userName in the hash map. There will be one insert statement executed for each message.

To create a stream using a jdbc sink relying on all defaults you would use a command like

This will insert the time messages into a payload column in a table named mydata. Since the default is using the XD batch metadata HSQLDB database we can connect to this database instance from an external tool. After we let the stream run for a little while, we can connect to the database and look at the data stored in the database.

You can query the database with your favorite SQL tool using the following database URL: jdbc:hsqldb:hsql://localhost:9101/xdjob with sa as the user name and a blank password. You can also use the HSQL provided SQL Tool (download from HSQLDB) to run a quick query from the command line:

Note

If you access any database other than HSQLDB or Postgres in a stream module then the JDBC driver jar for that database needs to be present in the $XD_HOME/lib directory.

the maximum amount of time the server will wait for acknowledgments from followers to meet the acknowledgment requirements the producer has specified with the acks configuration (int, default: 30000)

batchBytes

batch size in bytes, per partition (int, default: 16384)

blockOnBufferFull

whether to block or not when the memory buffer is full (boolean, default: true)

brokerList

comma separated broker list (String, default: localhost:9092)

bufferMemory

the total bytes of memory the producer can use to buffer records waiting to be sent to the server (int, default: 33554432)

compressionCodec

compression codec to use (String, default: none)

maxBufferTime

the amount of time, in ms that the producer will wait before sending a batch to the server (int, default: 0)

maxRequestSize

the maximum size of a request (int, default: 1048576)

maxSendRetries

number of attempts to automatically retry a failed send request (int, default: 3)

receiveBufferBytes

the size of the TCP receive buffer to use when reading data (int, default: 32768)

reconnectBackoff

the amount of time to wait before attempting to reconnect to a given host when a connection fails (long, default: 10)

requestRequiredAck

producer request acknowledgement mode (int, default: 0)

retryBackoff

the amount of time to wait before attempting to retry a failed produce request to a given topic partition (long, default: 100)

sendBufferBytes

the size of the TCP send buffer to use when sending data (int, default: 131072)

topic

kafka topic name (String, default: <stream name>)

topicMetadataFetchTimeout

the maximum amount of time to block waiting for the metadata fetch to succeed (int, default: 60000)

topicMetadataRefreshInterval

the period of time in milliseconds after which a refresh of metadata is forced (int, default: 300000)

Log

Probably the simplest option for a sink is just to log the data. The log sink uses the application logger to output the data for inspection. The log level is set to WARN and the logger name is created from the stream name. To create a stream using a log sink you would use a command like

Mail

The "mail" sink allows sending of messages as emails, leveraging Spring Integration mail-sending channel adapter. Please refer to Spring Integration documentation for the details, but in a nutshell, the sink is able to handle String, byte[] and MimeMessage messages out of the box.

You would then receive an email whose body contains "Hello" and whose subject is "Hellow world". Of special attention here is the way you need to escape strings for most of the parameters, because they’re actually SpEL expressions (so here for example, we used a String literal for the to parameter).

The mail sink has the following options:

bcc

the recipient(s) that should receive a blind carbon copy (SpEL) (String, default: null)

cc

the recipient(s) that should receive a carbon copy (SpEL) (String, default: null)

contentType

the content type to use when sending the email (SpEL) (String, default: null)

from

the primary recipient(s) of the email (SpEL) (String, default: null)

host

the hostname of the mail server (String, default: localhost)

password

the password to use to connect to the mail server (String, no default)

port

the port of the mail server (int, default: 25)

replyTo

the address that will become the recipient if the original recipient decides to "reply to" the email (SpEL) (String, default: null)

subject

the email subject (SpEL) (String, default: null)

to

the primary recipient(s) of the email (SpEL) (String, default: null)

username

the username to use to connect to the mail server (String, no default)

Mongo

The Mongo sink writes into a Mongo collection. Here is a simple example

the username to use when connecting to the broker (String, default: guest)

Note

The defaults are set up to connect to the RabbitMQ MQTT adapter on localhost.

Null Sink (null)

Null sink can be useful when the main stream isn’t focused on a specific destination but drives taps used for analytics etc.
It is also useful to iteratively add in steps to a stream without worrying about where the data will end up.

This sends the time, every 3 seconds to the default (no-name) Exchange for a RabbitMQ broker running on localhost, port 5672.

The routing key will be the name of the stream by default; in this case: "rabbittest". Since the default Exchange is a direct-exchange to which all Queues are bound with the Queue name as the binding key, all messages sent via this sink will be passed to a Queue named "rabbittest", if one exists. We do not create that Queue automatically. However, you can easily create a Queue using the RabbitMQ web UI. Then, using that same UI, you can navigate to the "rabbittest" Queue and click the "Get Message(s)" button to pop messages off of that Queue (you can choose whether to requeue those messages).

Options

the collection type to use for the given key (CollectionType, default: LIST, possible values: LIST,SET,ZSET,MAP,PROPERTIES)

database

database index used by the connection factory (int, default: 0)

hostname

redis host name (String, default: localhost)

key

name for the key (String, no default)

keyExpression

a SpEL expression to use for keyExpression (String, no default)

maxActive

max number of connections that can be allocated by the pool at a given time; negative value for no limit (int, default: 8)

maxIdle

max number of idle connections in the pool; a negative value indicates an unlimited number of idle connections (int, default: 8)

maxWait

max amount of time (in milliseconds) a connection allocation should block before throwing an exception when the pool is exhausted; negative value to block indefinitely (int, default: -1)

minIdle

target for the minimum number of idle connections to maintain in the pool; only has an effect if it is positive (int, default: 0)

password

redis password (String, default: ``)

port

redis port (int, default: 6379)

queue

name for the queue (String, no default)

queueExpression

a SpEL expression to use for queue (String, no default)

sentinelMaster

name of Redis master server (String, default: ``)

sentinelNodes

comma-separated list of host:port pairs (String, default: ``)

topic

name for the topic (String, no default)

topicExpression

a SpEL expression to use for topic (String, no default)

Shell Sink (shell)

The shell sink forks an external process by running a shell command to launch a process written in any language. The process should implement a continual loop that waits for and consumes input from stdin. The process will be destroyed when the stream is undeployed. For example, it is possible to invoke a Python script within a stream in this manner. Since the shell sink relies on low-level stream processing there are some additional requirements:

Input data is expected to be a String, the charset is configurable.

Anything written to stderr will be logged as an ERROR in Spring XD but will not terminate the stream.

All messages must be terminated using the configured encoder (CRLF or "\r\n" is the default) for the module and must not exceed the configured bufferSize (see the detailed description of encoders in the TCP section).

Any external software required to run the script must be installed on the container node to which the module is deployed.

Note the 4 byte length field preceding the data generated by the L4 encoder.

Counters and Gauges

Counter and Gauges are analytical data structures collectively referred to as metrics. Metrics can be used directly in place of a sink just as if you were creating any other stream, but you can also analyze data from an existing stream using a tap. We’ll look at some examples of using metrics with taps in the following sections. As a prerequisite start the XD Container as instructed in the Getting Started page.

Spring XD supports these metrics and analytical data structures as a general purpose class library that works with several backend storage technologies. The 1.0 release provides in memory and Redis implementations.

Tip

As of Spring XD 1.2 you can now create data visualizations for the various counters and gauges using the Admin UI. Please see the Admin UI Analytics Chapter for more details.

Counter

A counter is a Metric that associates a unique name with a long value. It is primarily used for counting events triggered by incoming messages on a target stream. You create a counter with a unique name and optionally an initial value then set its value in response to incoming messages. The most straightforward use for counter is simply to count messages coming into the target stream. That is, its value is incremented on every message. This is exactly what the counter module provided by Spring XD does.

The twittersearch source produces JSON strings which contain the user id of the tweeter in the fromUser field. The field_value_counter sink parses the tweet and updates a field value counter named fromUserCount in Redis. To view the counts:

From xd-shell,
xd:> field-value-counter display fromUserCount

Aggregate Counter (aggregate-counter)

The aggregate counter differs from a simple counter in that it not only keeps a total value for the count, but also retains the total count values for each minute, hour day and month of the period for which it is run. The data can then be queried by supplying a start and end date and the resolution at which the data should be returned.

Creating an aggregate counter is very similar to a simple counter. For example, to obtain an aggregate count for our spring tweets stream:

Rich Gauge (rich-gauge)

A rich gauge is a Metric that holds a double value associated with a unique name. In addition to the value, the rich gauge keeps a running average, along with the minimum and maximum values and the sample count.

The rich-gauge sink provided with XD expects a numeric value as a payload, typically this would be a decimal formatted string, and keeps its value in a store.

Accessing Analytics Data over the RESTful API

Spring XD has a discoverable RESTful API based on the Spring HATEAOS library. You can discover the resources available by making a GET request on the root resource of the Admin server. Here is an example where navigate down to find the data for a counter named httptap that was created by these commands

Jobs

This section describes the job modules included with Spring XD. For a general overview of creating, deploying, and launching batch jobs in Spring XD, see the Batch Jobs section. To run the examples shown here, start the XD Container
as instructed in the Getting Started page.

Import CSV Files to HDFS (filepollhdfs)

This module is designed to be driven by a stream polling a directory. It imports data from CSV files and requires that you supply a list of named columns for the data using the names parameter. For example:

the directory to write the file(s) to in HDFS (String, default: /xd/<job name>)

fileExtension

the file extension to use (String, default: csv)

fileName

the filename to use in HDFS (String, default: <job name>)

fsUri

the URI to use to access the Hadoop FileSystem (String, default: ${spring.hadoop.fsUri})

names

the field names in the CSV file (String, no default)

restartable

whether the job should be restartable or not in case of failure (boolean, default: false)

rollover

the number of bytes to write before creating a new file in HDFS (int, default: 1000000)

Import CSV Files to JDBC (filejdbc)

A module which loads CSV files into a JDBC table using a single batch job. By default it uses the internal HSQL DB which is used by Spring Batch. Refer to how module options are resolved for further details on how to change defaults (one can of course always use --foo=bar notation in the job definition to achieve the same effect).

Note

If you access any database other than HSQLDB or Postgres in a job module then the JDBC driver jar for that database needs to be present in the $XD_HOME/lib directory.

The filejdbc job has the following options:

abandonWhenPercentageFull

connections that have timed out wont get closed and reported up unless the number of connections in use are above the percentage (int, default: 0)

the database table to which the data will be written (String, default: <job name>)

testOnBorrow

indication of whether objects will be validated before being borrowed from the pool (boolean, default: false)

testOnReturn

indication of whether objects will be validated before being returned to the pool (boolean, default: false)

testWhileIdle

indication of whether objects will be validated by the idle object evictor (boolean, default: false)

timeBetweenEvictionRunsMillis

number of milliseconds to sleep between runs of the idle connection validation/cleaner thread (int, default: 5000)

url

the JDBC URL for the database (String, no default)

useEquals

true if you wish the ProxyConnection class to use String.equals (boolean, default: true)

username

the JDBC username (String, no default)

validationInterval

avoid excess validation, only run validation at most at this frequency - time in milliseconds (long, default: 30000)

validationQuery

sql query that will be used to validate connections from this pool (String, no default)

validatorClassName

name of a class which implements the org.apache.tomcat.jdbc.pool.Validator (String, no default)

The job should be defined with the resources parameter defining the files which should be loaded. It also requires a names parameter (for the CSV field names) and these should match the database column names into which the data should be stored. You can either pre-create the database table or the module will create it for you if you use --initializeDatabase=true when the job is created. The table initialization is configured in a similar way to the JDBC sink and uses the same parameters. The default table name is the job name and can be customized by setting the tableName parameter. As an example, if you run the command

it will create the table "people" in the database with three varchar columns called "forename", "surname" and "address". When you launch the job it will load the files matching the resources pattern and write the data to this table. As with the filepollhdfs job, this module also supports the deleteFiles parameter which will remove the files defined by the resources parameter on successful completion of the job.

Launch the job using:

xd:> job launch myjob

Tip

The connection pool settings for xd are located in servers.yml (i.e. spring.datasource.* )

Import FTP to HDFS (ftphdfs)

Copies files from FTP directory into HDFS. Job is partitioned in a way that each
separate file copy is executed on its own partitioned step.

Full path is preserved so that above command would result files in HDFS shown below:

/ftp/pub/files
/ftp/pub/files/file1.txt
/ftp/pub/files/file2.txt

The ftphdfs job has the following options:

fsUri

the URI to use to access the Hadoop FileSystem (String, default: ${spring.hadoop.fsUri})

host

the host name for the FTP server (String, default: localhost)

partitionResultsTimeout

time (ms) that the partition handler will wait for results (long, default: 3600000)

password

the password for the FTP connection (Password, no default)

port

the port for the FTP server (int, default: 21)

restartable

whether the job should be restartable or not in case of failure (boolean, default: false)

username

the username for the FTP connection (String, no default)

Running gpload as a batch job (gpload)

The gpload utility can be deployed and launched from Spring XD as a batch job. The gpload job uses a GploadTasklet that submits a gpload job as an external process. The Spring XD gpload batch job aims to support most of the gpload functionality.

We need to provide the following required options:

gploadHome - this must be the path to where gpload utility is installed. This is usually /usr/local/greenplum-loaders-<version>.

controlFile - this file defines the gpload options in effect for this load job and is documented in the Greenplum Load Tools Reference documentation.

password or passswordFile - you can either speciy the passord or provide a password file that must follow the general format for a PostgreSQL password file.

Here is an example of a basic load job definition. Please note that some options like host, port, database and username could have been specified in the control file as well.

the database table to which the data will be written (String, default: <job name>)

testOnBorrow

indication of whether objects will be validated before being borrowed from the pool (boolean, default: false)

testOnReturn

indication of whether objects will be validated before being returned to the pool (boolean, default: false)

testWhileIdle

indication of whether objects will be validated by the idle object evictor (boolean, default: false)

timeBetweenEvictionRunsMillis

number of milliseconds to sleep between runs of the idle connection validation/cleaner thread (int, default: 5000)

url

the JDBC URL for the database (String, no default)

useEquals

true if you wish the ProxyConnection class to use String.equals (boolean, default: true)

username

the JDBC username (String, no default)

validationInterval

avoid excess validation, only run validation at most at this frequency - time in milliseconds (long, default: 30000)

validationQuery

sql query that will be used to validate connections from this pool (String, no default)

validatorClassName

name of a class which implements the org.apache.tomcat.jdbc.pool.Validator (String, no default)

Tip

The connection pool settings for xd are located in servers.yml (i.e. spring.datasource.* )

Export HDFS to MongoDB (hdfsmongodb)

Exports CSV data from HDFS and stores it in a MongoDB collection which defaults to the job name. This can be overridden with the collectionName parameter. Once again, the field names should be defined by supplying the names parameter. The data is converted internally to a Spring XD Tuple and the collection items will have an id matching the tuple’s UUID. You can override this by setting the idField parameter to one of the field names if desired.

Import JDBC to HDFS (jdbchdfs)

Performs the reverse of the previous module. The database configuration is the same as for filejdbc but without the initialization options since you need to already have the data to import into HDFS. When creating the job, you must either supply the select statement by setting the sql parameter, or you can supply both tableName and columns options (which will be used to build the SQL statement).

You can customize how the data is written to HDFS by supplying the options directory (defaults to /xd/(job name)), fileName (defaults to job name), rollover (in bytes, default 1000000) and fileExtension (defaults to csv).

Launch the job using:

xd:> job launch myjob

Note

If you access any database other than HSQLDB or Postgres in a job module then the JDBC driver jar for that database needs to be present in the $XD_HOME/lib directory.

If you want to partition your job across multiple XD containers you can provide the partitionColumn and partitions option. When the job is launched the partitioner will query the database for the range of values and evenly divide the load between the partitions. This assumes that there is an even distribution of column values in the table. When using the partitioning support you must also use the tableName and columns options instead of the sql option. This is so the partitioner can construct the queries with the appropriate where clauses for the different partitions.

When using the partitioning support you can not use the sql option. Use tableName and columns instead.

You can perform incremental imports using this job by defining a column to check against. Currently the column must be numeric (similar to how the partitionColumn works). An example of launching a job that performs incremental imports would look like the following:

There are two things to keep in mind when using incremental imports with this job:

When using incremental imports, the sql option is not available. Use tableName and columns instead.

If an import fails, it must be rerun to completion before running the next import. Without this, inconsistent data may result. Since HDFS is a non-transactional store, failed records may not be rolled back. An administrator may need to check HDFS for completeness and the last imported value.

The jdbchdfs job has the following options:

abandonWhenPercentageFull

connections that have timed out wont get closed and reported up unless the number of connections in use are above the percentage (int, default: 0)

indication of whether objects will be validated before being borrowed from the pool (boolean, default: false)

testOnReturn

indication of whether objects will be validated before being returned to the pool (boolean, default: false)

testWhileIdle

indication of whether objects will be validated by the idle object evictor (boolean, default: false)

timeBetweenEvictionRunsMillis

number of milliseconds to sleep between runs of the idle connection validation/cleaner thread (int, default: 5000)

url

the JDBC URL for the database (String, no default)

useEquals

true if you wish the ProxyConnection class to use String.equals (boolean, default: true)

username

the JDBC username (String, no default)

validationInterval

avoid excess validation, only run validation at most at this frequency - time in milliseconds (long, default: 30000)

validationQuery

sql query that will be used to validate connections from this pool (String, no default)

validatorClassName

name of a class which implements the org.apache.tomcat.jdbc.pool.Validator (String, no default)

Tip

The connection pool settings for xd are located in servers.yml (i.e. spring.datasource.* )

Running Spark application as a batch job (sparkapp)

A Spark Application can be deployed and launched from Spring XD as a batch job. SparkTasklet submits the Spark application into Spark cluster manager using org.apache.spark.deploy.SparkSubmit. Through this approach, you can also launch a Spark application with specific criteria via Spring XD stream (for instance: A real time scoring algorithm through MLlib spark job can be triggered based on the streaming data events). To get started, please refer to Spark examples here: https://spark.apache.org/examples.html.

Once the job is launched, go to Spring XD admin-ui to verify the job results.
Jobs → Executions → Select the job to verify that execution context holds the log for Spark application results. If you launch the Spark application through Spark Master, then the results and application status can be verified from SparkUI as well.

The sparkapp job has the following options:

appJar

path to a bundled jar that includes your application and its dependencies - excluding spark (String, no default)

comma separated list of files to be placed in the working directory of each executor (String, default: ``)

mainClass

the main class for Spark application (String, no default)

master

the master URL for Spark (String, default: local)

name

the name of the Spark application (String, default: ``)

programArgs

program arguments for the application main class (String, default: ``)

Running Sqoop as a batch job (sqoop)

A Sqoop job can be deployed and launched from Spring XD as a batch job. The Sqoop job uses a SqoopTasklet and a SqoopRunner that submits a Sqoop job using org.apache.sqoop.Sqoop.runTool. The Spring XD Sqoop batch job aims to support most of the Sqoop functionality, but at this point we have only tested a subset:

import

export

codegen

merge

job

list-tables

Note

The current release supports Sqoop 1.4.5

The intention is to eventually support all features of the Sqoop tool. See Sqoop User Guide for full documentation of the Sqoop features.

Note

If you access any database other than HSQLDB or Postgres in a job module then the JDBC driver jar for that database needs to be present in the $XD_HOME/lib directory.

The definition contains the name of the provided job as sqoop and the --command option names the Sqoop command we want to run, which in this case is "list-tables".

Once the job is launched, go to Spring XD admin-ui to verify the job results.
Jobs → Executions → Select the job to verify that step execution context holds the log for Sqoop Tool execution results. You should see some tables listed there. Since we didn’t provide any connection arguments Spring XD will by default use the batch respoitory database for the Sqoop Tool execution. We could provide options specifying a different database using the --url, --username and --password options for the job:

Here we connect to a local MySQL database. It’s important to note that you need to provide the MySQL JDBC driver jar in the Spring XD lib directory for this to work.

There also is an option to specify connection arguments using the --args option. This allows you to use the same arguments that you are used to provide on the command line when running the Sqoop Tool directly. To connect to the same MySQL database as above using --args we would use:

In this example we provided the connection arguments using the -args option. We could also have used --url, --username and --password options like we did above for the "list-tables" example. The "import" command will use the spring.hadoop.fsUri that is specified when Spring XD starts up. You can override this by providing the --fsUri option when defining the job. The same is true for spring.hadoop.resourceManagerHost and spring.hadoop.resourceManagerPort. You can override the Spring XD configured values with --resourceManagerHost and --resourceManagerPort options.

Here we rely on the connection options to default to the same database used for the batch repository. Note that Sqoop requires that the table to export data into must already exist.

Note

If your Sqoop args are more complex, as is the case when you provide a query expression or a where clause, then you will need to use escaping for double quotes used within the --args option. A quick example of using a where clause:

(For this example we have omitted the equal sign for the individual Sqoop arguments within the --args option. Either style works fine.)

Note

If your Sqoop args use escape sequences (common when working with Hive data) then you should provide double back-slash characters when working with the XD Shell (this effectively escapes the escape character and only one back-slash will be passed on). Here is a brief example:

Some JDBC drivers and some compression codecs require additional jars to work properly. If this is the case, then you can use the --libjars option to provide a comma separated list of jars to be added to the job execution. You should only specify the name of the jar and not the full path. The jars will be looked up from the classpath and included in the job submitted to the Hadoop cluster.

Note

When using Sqoop’s --as-avrodatafile argument we will automatically include the Avro jars in the Sqoop job submission. No need to specify them as part of the --libjars option.

Note

Advanced Hadoop configuration options can be provided in one of several configuration files. The hadoop-site.xml file is only used by the Sqoop job while the other configuration files are used by all Hadoop related jobs and streams:

$XD_HOME/config/hadoop.properties — just add the property you would like to set:

Using Sqoop’s metastore

It is possible to use Sqoop’s metastore with some restrictions.

Warning

Sqoop ships with HSQLDB version 1.8 and Spring XD ships with HSQLDB version 2.3. Since these two versions are not compatible you can not use a Sqoop metastore
that uses HSQLDB. This is unfortunate since HSQLDB version 1.8 is the only database that is fully supported for the metastore by Sqoop. We can however use another database
for the metastore as long as we use some workarounds.

Note

You can use PostgreSQL for the Sqoop metastore. We recommend that you run the commands listed below to create and initialize the tables to be used by the Sqoop metastore.

You can now modify the scoop-site.xml file in the Spring XD config directory. Add the JDBC URL, username and password to use for connection to the PostgreSQL database
that hosts the Sqoop metastore tables. You need to provide the following properties:

sqoop.metastore.client.autoconnect.url

sqoop.metastore.client.autoconnect.username

sqoop.metastore.client.autoconnect.password

Note

In addition to the above configurations you need to use a --password-file option when creating the Sqoop job definitions. If you don’t then Sqoop will prompt for a password
as Spring XD runs the job. This will cause the job to hang.

Here is an example of defining a Sqoop job using Spring XD’s sqoop job:

Taps

Introduction

A Tap allows you to "listen" to data while it is processed in an existing stream and process the data in a separate stream. The original stream is unaffected by the tap and isn’t aware of its presence, similar to a phone wiretap. (WireTap is included in the standard catalog of EAI patterns and implemented in the Spring Integration EAI framework used by Spring XD).

Simply put, a Tap is a stream that uses a point in another stream as a source.

Example

The following XD shell commands create a stream foo1 and a tap named foo1tap:

Since a tap is a type of stream, use the stream create command to create the tap. The tap source is specified using the named channel syntax and always begins with tap:. In this case, we are tapping the stream named foo1 specified by :stream:foo1

Note

stream: is required in this case as it is possible to tap alternate XD targets such as jobs. This tap consumes data at the source of the target stream.

A tap can consume data from any point along the target stream’s processing pipeline. XD provides a few ways to tap a stream after a given processor has been applied:

Example - tap after a processor has been applied

If the module name is unique in the target stream, use tap:stream:<stream_name>.<module_name>

A primary use case for a Tap is to perform realtime analytics at the same time as data is being ingested via its primary stream. For example, consider a Stream of data that is consuming Twitter search results and writing them to HDFS. A tap can be created before the data is written to HDFS, and the data piped from the tap to a counter that correspond to the number of times specific hashtags were mentioned in the tweets.

Creating a tap on a named channel, a stream whose source is a named channel, or a label is not yet supported. This is planned for a future release.

In cases where a multiple modules with the same module name, a label must be specified on the module to be tapped. For example if you want to tap the 2nd transform:
http | transform --expression=payload.toLowerCase() | tapMe: transform --expression=payload.substring(3) | file

Tap Lifecycle

A side effect of a stream being unaware of any taps on its pipeline is that deleting the stream will not automatically delete the taps. The taps have to be deleted separately. However if the tapped stream is re-created, the existing tap will continue to function.

Analytics

Introduction

Spring XD provides support for the real-time evaluation of various machine learning scoring algorithms as well simple real-time data analytics using various types of counters and gauges. The analytics functionality is provided via modules that can be added to a stream. In that sense, real-time analytics is accomplished via the same exact model as data-ingestion. It’s possible that the primary role of a stream is to perform real-time analytics, but it’s quite common to add a tap to initiate a secondary stream where analytics, e.g. a field-value-counter, is applied to the same data being ingested through a primary stream. You will see both approaches in the examples below.

Predictive analytics

Spring XD’s support for implementing predictive analytics by scoring analytical models that leverage machine learning algorithms begins with an extensible class library foundation upon which implementations can be built, such as the PMML Module that we describe here.

That module integrates with the JPMML-Evaluator library that provides support for a wide range of model types and is interoperable with models exported from R, Rattle, KNIME, and RapidMiner. For counter and gauge analytics, in-memory and Redis implementations are provided.

Incorporating the evaluation of machine learning algorithms into stream processing is as easy as using any other processing module. Here is a simple example

The http source converts posted data to a Tuple. The analytic-pmml processor loads the model from the specifed file and creates two mappings so that fields from the Tuple can be mapped into the input and output model names. The log sink writes the payload of the event message to the log file of the XD container.

The next section on analytical models goes into more detail on the general infrastructure

Analytical Models

We provide some core abstractions for implementing analytical models in stream processing applications.
The main interface for integrating analytical models is Analytic. Some analytical
models need to adjust the domain input and the model output in some way, therefore we provide a special base class MappedAnalytic
which has core abstractions for implementing that mapping via InputMapper and OutputMapper.

Since Spring XD 1.0.0.M6 we support the integration of analytical models, also called statistical models or mining models, that are defined via PMML.
PMML is the abbreviation for Predictive Model Markup Language and is a standard XML representation that allows specifications of different mining models, their ensembles, and associated preprocessing.

Note

PMML is maintained by the Data Mining Group (DMG) and supported by several state-of-the-art statistics and data mining software tools such as InfoSphere Warehouse, R / Rattle, SAS Enterprise Miner, SPSS®, and Weka.
The current version of the PMML specification is 4.2 at the time of this writing.
Applications can produce and consume PMML models, thus allowing an analytical model created in one application to be implemented and used for scoring or prediction in another.

PMML is just one of many other technologies that one can integrate to implement analytics with, more will follow in upcoming releases.

Modeling and Evaluation

Analytical models are usually defined by a statistician aka data scientist or quant by using some statistical tool to analyze the data and build an appropriate model.
In order to implement those models in a business application they are usually transformed and exported in some way (e.g. in the form of a PMML definition).
This model is then loaded into the application which then evaluates it against a given input (event, tuple, example).

Modeling

Analytical models can be defined in various ways. For the sake of brevity we use R from the r-project to demonstrate
how easy it is to export an analytical model to PMML and use it later in stream processing.

For our example we use the iris example dataset in R to generate a classifier for iris flower species by applying the Naive Bayes algorithm.

Evaluation

The above defined PMML model can be evaluated in a Spring XD stream definition by using the analytic-pmml module as a processor
in your stream definition. The actual evaluation of the PMML is performed via the PmmlAnalytic which uses the jpmml-evaluator library.

Model Selection

The PMML standard allows multiple models to be defined within a single PMML document.
The model to be used can be configured through the modelName option.

NOTE The PMML standard also supports other ways for selection models, e.g. based on a predicate. This is currently not supported.

In order to perform the evaluation in Spring XD you need to save the generated PMML document to some folder, typically the with the extension "pmml.xml".
For this example we save the PMML document under the name iris-flower-classification-naive-bayes-1.pmml.xml.

In the following example we set up a stream definition with an http source that produces iris-flower-records
that are piped to the analytic-pmml module which applies our iris flower classifier to predict the species of a given flower record.
The result of that is a new record extended by a new attribute predictedSpecies which simply sent to a log sink.

The definition of the stream, which we call iris-flower-classification, looks as follows:

The location parameter can be used to specify the exact location of the pmml document. The value must be a valid spring resource location

The inputFieldMapping parameter defines a mapping of domain input fields to model input fields. It is just a list of fields or optional field:alias mappings to control which fields and how they are going to end up in the model-input. If no inputFieldMapping is defined then all domain input fields are used as model input.

The outputFieldMapping parameter defines a mapping of model output fields to domain output fields with semantics analog to the inputFieldMapping.

The optional modelName parameter of the analytic-pmml module can be used to refer to a particular named model within the PMML definition. If modelName is not defined the first model is selected by default.

NOTE Some analytical models like for instance association rules require a different typ of mapping. You can implement your own custom mapping strategies by implementing a custom InputMapper and OutputMapper
and defining a new PmmlAnalytic or TuplePmmlAnalytic bean that uses your custom mappers.

After the stream has been successfully deployed to Spring XD we can eventually start to throw some data at it by issuing the following http request via the XD-Shell (or curl, or any other tool):

Note that our example record contains no information about which species the example belongs to - this will be added by our classifier.

NOTE the generated field predictedSpecies which now identifies our input as belonging to the iris species versicolor.

We verify that the generated PMML classifier produces the same result as R by executing the issuing the following commands in rproject:

datasets$testset[,1:4][1,]
# This is the first example record that we sent via the http post.
Sepal.Length Sepal.Width Petal.Length Petal.Width
52 6.4 3.2 4.5 1.5
#Predict the class for the example record by using our naiveBayes model.
> predict(model, datasets$testset[,1:4][1,])
[1] versicolor

Counters and Gauges included with Spring XD

See the Counters and Gauges section for a list of available metrics included with Spring XD, along with configuration options and examples.

Tuples

Introduction

The Tuple class is a central data structure in Spring XD. It is an ordered list of values that can be retrieved by name or by index. Tuples are created by a TupleBuilder and are immutable. The values that are stored can be of any type and null values are allowed.

The underlying Message class that moves data from one processing step to the next can have an arbitrary data type as its payload. Instead of creating a custom Java class that encapsulates the properties of what is read or set in each processing step, the Tuple class can be used instead. Processing steps can be developed that read data from specific named values and write data to specific named values.

There are accessor methods that perform type conversion to the basic primitive types as well as BigDecimal and Date. This avoids you from having to cast the values to specific types. Instead you can rely on the Tuple’s type conversion infastructure to perform the conversion.

The Tuple’s types conversion is performed by Spring’s Type Conversion Infrastructure which supports commonly encountered type conversions and is extensible.

There are several overloads for getters that let you provide default values for primitive types should the field you are looking for not be found. Date format patterns and Locale aware NumberFormat conversion are also supported. A best effort has been made to preserve the functionality available in Spring Batch’s FieldSet class that has been extensively used for parsing String based data in files.

Creating a Tuple

The TupleBuilder class is how you create new Tuple instances. The most basic case is

Tuple tuple = TupleBuilder.tuple().of("foo", "bar");

This creates a Tuple with a single entry, a key of foo with a value of bar. You can also use a static import to shorten the syntax.

To customize the underlying type conversion system you can specify the DateFormat to use for converting String to Date as well as the NumberFormat to use based on a Locale. For more advanced customization of the type conversion system you can register an instance of a FormattingConversionService. Use the appropriate setter methods on TupleBuilder to make these customizations.

You can also create a Tuple from a list of String field names and a List of Object values.

Gradle Dependencies

If you wish to use Spring XD Tuples in you project add the following dependencies:

//Add this repo to your repositories if it does not already exist.
maven { url "http://repo.spring.io/libs-snapshot"}
//Add this dependency
compile 'org.springframework.xd:spring-xd-tuple:1.3.3.BUILD-SNAPSHOT'

Type Conversion

Introduction

Spring XD allows you to declaratively configure type conversion in stream definitions using the inputType and outputType module options. Note that general type conversion may also be accomplished easily within a transformer or a custom module. Currently, Spring XD natively supports the following type conversions commonly used in streams:

Object to/from byte[] : Either the raw bytes serialized for remote transport, bytes emitted by a module, or converted to bytes using Java serialization(requires the object to be Serializable)

String to/from byte[]

Object to plain text (invokes the object’s toString() method)

Where JSON represents either a byte array or String payload containing JSON. Currently, Objects may be converted from a JSON byte array or String. Converting to JSON always produces a String. Registration of custom type converters is covered in this section.

MIME types

inputType and outputType values are parsed as media types, e.g., application/json or text/plain;charset=UTF-8. MIME types are especially useful for indicating how to convert to String or byte[] content. Spring XD also uses MIME type format to represent Java types, using the general type application/x-java-object with a type parameter. For example, application/x-java-object;type=java.util.Map or application/x-java-object;type=com.bar.Foo can be used in an inputType convertion, --inputType='application/x-java-object;type=com.bar.Foo'. Note that the type is quoted with '' to avoid parsing errors produced by the ; character or other invalid characters. For convenience, you can use the class name by itself and Spring XD will translate a valid class name to the corresponding MIME type. In addition, Spring XD provides custom MIME types, notably, application/x-xd-tuple to specify a Tuple.

Stream Definition Examples

POJO to JSON

Type conversion will likely come up when implementing a custom module which produces or consumes a custom domain object. For example, you want to create a stream that integrates with a legacy system that includes custom domain types in its API. To process custom domain types directly minimally requires these types to be defined in Spring XD’s class path. This approach will be cumbersome to maintain when the domain model changes. The recommended approach is to convert such types to JSON at the source, or back to POJO at the sink. You can do this by declaring the required conversions in the stream definition:

Note that the sink above does require the declared type to be in the module’s classpath to perform the JSON to POJO conversion. Generally, POJO to JSON does not require the Java class. Once the payload is converted to JSON, Spring XD provided transformers and filters (p1, p2, etc.) can evaluate the payload contents using JsonPath functions in SpEL expressions. Alternately, you can convert the JSON to a Tuple, as shown in the following example.

JSON to Tuple

Sometimes it is convenient to convert JSON content to a Tuple in order to evaluate and access individual field values.

Note inputType=application/x-xd-tuple on the filter module will cause the payload to be converted to a Tuple at the filter’s input channel. Thus, subsequent expressions are evaluated on a Tuple object. Here we invoke the Tuple methods hasFieldName('hello') on the filter and getString('hello') on the transformer. The output of the http source is expected to be JSON in this case. We set the Content-Type header to tell Spring XD that the payload is JSON.

Java Serialization

The following serializes a java.io.Serializable object to a file. Presumably the foo module outputs a Serializable type. If not, this will result in an exception. If remote transport is configured, the output of foo will be marshalled using Spring XD’s internal serialization. The object will be unmarshalled in the file module and then converted to a byte array using Java serialization.

foo | --inputType=application/x-java-serialized-object file

MIME types and Java types

The use of MessageCoverter for data type channels was introduced in Spring Integration 4 to pass the Message to the converter method to allow it to access the Message’s content-type header. This provides greater flexibility. For example, it is now possible to support multiple strategies for converting a String or byte array to a POJO, based on the content-type header.

When Spring XD deploys a module with a declared type conversion, it modifies the module’s input and/or output channel definition to set the required Java type and registers MessageConverters associated with the target MIME type and Java type to the channel. The type conversions Spring XD provides out of the box are summarized in the following table:

Source Payload

Target Payload

content-type header

outputType/inputType

Comments

POJO

JSON String

ignored

application/json

Tuple

JSON String

ignored

application/json

JSON is tailored for Tuple

POJO

String (toString())

ignored

text/plain, java.lang.String

POJO

byte[] (java.io serialized)

ignored

application/x-java-serialized-object

JSON byte[] or String

POJO

application/json (or none)

application/x-java-object

byte[] or String

Serializable

application/x-java-serialized-object

application/x-java-object

JSON byte[] or String

Tuple

application/json (or none)

application/x-xd-tuple

byte[]

String

any

text/plain, java.lang.String

will apply any Charset specified in the content-type header

String

byte[]

any

application/octet-stream

will apply any Charset specified in the content-type header

Caveats

Note that inputType and outputType parameters only apply to payloads that require type conversion. For example, if a module produces an XML string with outputType=application/json, the payload will not be converted from XML to JSON. This is because the payload at the module’s output channel is already a String so no conversion will be applied at runtime.

Developing Modules and Extensions

Creating a Source Module

Introduction

As outlined in the modules document, Spring XD currently supports four types of modules: source, sink, and processor for stream processing and job for batch processing. This document walks through the creation of a custom source module.

The first module in a stream is always a source. Source modules are built with Spring Integration and are responsible for producing messages originating from an external data source on its output channel. These message can then be processed by the downstream modules in a stream. A source module is often fed data by a Spring Integration inbound channel adapter, configured with a poller.

Spring Integration provides a number of adapters out of the box to integrate with various transports and data stores, such as JMS, File, HTTP, Web Services, Mail, and more. Typically, it is straightforward to create a source module using an existing inbound channel adapter.

The adapter is configured to poll an RSS feed at a fixed rate (e.g., every 5 seconds). Note that auto-startup is set to false. This is a requirement for Spring XD modules. When a stream is deployed, the Spring XD runtime will create and start stream modules in reverse order to ensure that all modules are initialized before the source starts emiting messages. When an RSS Entry is retreived, it will create a message with a com.rometools.rome.feed.synd.SyndEntry payload type and send it to a message channel called output. The name output is a Spring XD convention indicating the module’s output channel. Any messages on the output channel will be consumed by the downstream processor or sink in a stream used by this module.

The module is configurable so that it may pull data from any feed URL, such as http://feeds.bbci.co.uk/news/rss.xml. Spring XD will automatically register a PropertyPlaceholderConfigurer in the module’s application context. These properties correspond to module options defined for this module (discussed below). Users supply option values when creating a stream using the DSL.

Users must provide a url option value when creating a stream that uses this source. The polling rate and maximum number of entries retrieved for each poll are also configurable and for these properties we should provide reasonable default values. The module’s properties file in the config resource directory contains Module Options Metadata including a description, type, and optional default value for each property. The metadata supports features like auto-completion in the Spring XD shell and option validation:

Alternately, you can write a POJO to define the metadata. Using a Java class provides better validation along with additional features and requires that the class be packaged as part of the module.

Create a Module Project

This section covers the setup of a standalone project containing the module configuration and some code for testing the module. This example uses Maven but Spring XD supports Gradle as well.

Take a look at the pom file for this example. You will see it declares spring-xd-module-parent as its parent and declares a dependency on spring-integration-feed which provides the inbound channel adapter. The parent pom provides everything else you need. We also need to configure repositories to access the parent pom and any other dependencies. The required xml file containing the bean definitions and properties file are located in src\main\resources\config. In this case, we have elected to use a custom transformer to convert the output of the feed inbound adapter to a JSON string.

The project README contains a detailed explanation of why this transformer is needed, but such things are easily accomplished with Spring Integration.

Create a Spring Integration test

The first level of testing should ensure that the module’s Application Context is loaded and that the message flow works as expected independent of Spring XD. In this case, we need to wrap the module application context in a test context that provides a property placeholder (the Spring XD runtime does this for you). In addition, it is convenient to override the module’s output channel with a queue channel so that the test will block until a message is received from the feed.

Add the following configuration in the appropriate location under src/test/resources/:

The above test configures an and starts embedded Spring XD runtime (SingleNodeApplication) to deploy a stream that uses the module under test.

The SingleNodeProcessingChainConsumer can test a stream that does not include a sink. The chain itself provides an in-memory sink to access the stream’s output directly. In this case, we use the chain to test the source in isolation. The above test is equivalent to deploying following stream definition:

and the chain consumes messages on the named queue channel. At the end of each test method, the chain should be destroyed to destroy these internal resources and restore the initial state of the Spring XD container.

Note

The spring-xd-module-parent Maven pom includes a tasks to install a local message bus implementation under lib in the project root to enable a local transport provider for the embedded Spring XD container. It is necessary to run maven process-resources or a downstream goal (e.g., compile, test, package) once in order for this test to work correctly.

Install the Module

We have implemented and tested the module using Spring Integration directly and also by deploying the module to an embedded Spring XD container. Time to install the module to Spring XD!

The next step is to package the module as an uber-jar using maven:

$mvn package

This will build an uber-jar in target/rss-feed-source-1.0.0.BUILD-SNAPSHOT.jar. If you inspect the contents of this jar, you will see it includes the module configuration files, custom transformer class, and dependent jars.
Fire up the Spring XD runtime if it is not already running and,
using the Spring XD Shell, install the module as a source named feed using the module upload command:

Creating a Data Stream Processor

Introduction

This section covers how to create a processor module that uses stream processing libraries and runtimes.module. Spring XD 1.2 provides integration with Project Reactor Stream, RxJava Observables, and Spark Streaming. Creating a data stream processor in XD allows you to use a functional programming model to filter, transform and aggregate data in a very concise and performant way. This section walks through implementing a custom processor module using each of these libraries.

Reactor Streams

Project Reactor provides a Stream API that is based on the Reactive Streams specification. The specification was jointly developed by twenty people from a dozen companies (Pivotal included) and has the goal of creating a standard for asynchronous stream processing with non-blocking back pressure on the JVM.

Messages that are delivered on the Message Bus are accessed from the input Stream, which can be directly composed. The return value is the output Stream that is the result of applying various operations to the input stream. The content of the output Stream is sent to the message bus for consumption by other processors or sinks.

Examples of operations you can perform on the Stream are map, flatMap, buffer, window, and reduce. The parameterized data type can be a org.springframework.messaging.Message, org.springframework.xd.tuple.Tuple, java.lang.Map or any other POJO. The following example uses the Tuple object to compute the average value of a measurement from a sample size of 5.

You can now create unit tests for the Processor module just as you would for any other Java class. The module application context file can be in XML or in Java using a @Configuration class. The XML version is shown below.

Examples of unit and integration testing a module are available in the reactor sample project. The sample project also shows how you can package your module into a single jar and upload it to the admin server.

Messages that are delivered on the Message Bus are accessed from the Observable input stream. The return value is the Observable output stream that contains the results of applying various operation to the input stream. The content of the output stream is sent to the message bus for consumption by other processors or sinks.

Examples of operations you can perform on the Stream are map, flatMap, buffer, window, and reduce. The parameterized data type can be a org.springframework.messaging.Message, org.springframework.xd.tuple.Tuple, java.lang.Map or any other POJO.

When used in combination with Data Partitioning on the Message Bus, this allows you to create an streaming application where Stream state is calculated based on those partitions where necessary.

In this deployment the data that is sent to the RxJava processing modules from the HTTP sources is partitioned such that the red data always goes to the red stream processing module and so on for the other colors. The next hop of processing, where writing to HDFS occurs, does not require data partitioning, so the message load can be shared across the HDFS sink instances.

There can be as many layers of RxJava Stream processing as you require, allowing you to collocate specific functional operations as you see fit within a single JVM or to distribute across multiple JVMs.

The following example uses the Tuple object to compute the average value of a measurement from a sample size of 5.

You can now create unit tests for the Processor module as you would for any other Java class. The module application context file can be in XML or in Java using a @Configuration class. The XML version is shown below.

Scheduling

There are two MessageHandler implementations that you can choose from, SubjectMessageHandler and MultipleSubjectMessageHandler.

SubjectMessageHandler uses a single SerializedSubject to process messages that were received from the Message Bus. This subject, downcast to Observable, is what is passed into the process method. Using SubjectMessageHandler has the advantage that the state of the Observabale input stream can be shared across all the Message Bus dispatcher threads that are invoking onNext. It has the disadvantage that the processing and consumption of the Observable output stream (that sends messages to the Message Bus) will execute serially on one of the dispatcher threads. Note you can modify what thread the Observable output stream will use by calling observeOn before returning the output stream from your processor.

MultipleSubjectMessageHandler uses multiple Subjects to perform processing. A Spring Expression Language (SpEL) expression is used to map the incoming message to a specific Subject to use for processing. Using MultipleSubjectMessageHandler has the advantage that it can use all Message Bus dispatcher threads. It has the disadvantage in that each Observable input stream has its own state, which may not be desirable for certain types of aggregate calculations that should see all of the data. A common partition expression to use is T(java.lang.Thread).currentThread().getId() so that a Subject will be created per thread.

The satisfies the contract to have single threaded access to a Subject. Another interesting partition expression to use in the case of the Kafka Message Bus is header['kafka_partition_id']. This will create a Subject per Kafka partition that represents an ordered sequence of events. The XD Kafka Message Bus statically maps partitions to dispatcher threads to there is only single threaded access toa Subject.

Spark streaming

Spring XD integrates with Spark streaming so that the streaming data computation logic can be run on a Spark cluster. Spring XD runs the Spark Driver as an XD module (processor or sink) in the XD container while the Spark streaming reliable receiver and the data computation is done at the Spark Cluster.

This provides advantage over connecting to various streaming sources while running the computation logic on Spark cluster. Running the spark driver on the XD container also provides automatic failover capabilities in case of driver failure.

With Spark Streaming, events are processed at the micro batch level via DStreams, which represent a continuous flow of partitioned RDDs. Setting up a Spark Streaming module within XD can be beneficial when adding streaming data computation logic for a tapped XD stream. While the primary stream processes events one at a time (through the regular XD modules), the tapped stream will become a source for the Spark Streaming module.

Lets discuss a real world scenario of data collection and doing some analytics on it.

In the above set of streams, consider a primary stream that collects data one at a time from various sensors and stores that raw data into HDFS, after only same basic filtering. At the same time, there are a few other streams that perform analytics on the data being collected at micro-batch level. Here, the tapped stream’s source can be reliable or durable based on the messagebus implementation, and this data is processed (at the micro batch level) by the Spark Streaming module. This allows the developer to choose the stream data processing based on the use case.

One can also add a tap at the output of the Spark streaming processor module.
For instance, adding a tap at the output of sparkstream2's Spark stream processor would be:

Writing a spark streaming module

Spring XD provides Java and Scala based interfaces which expose a process method that the spark streaming developer would implement. This method processes the input DStream received by the spark streaming receiver. In case of XD processor module this method would return an output DStream. In case of XD sink module, it would write the computed data into file system, HDFS etc., (for example saveAsTextFiles(), saveAsHadoopFiles() using Spark APIs).

For Java based implementation, the interface org.springframework.xd.spark.streaming.java.Processor is defined

When creating an XD processor/sink module, developer would implement this interface and make the module archive (along with its dependencies) available in the modules registry.

To set the Spark configuration properties when developing spark streaming module, the developer can use org.springframework.xd.spark.streaming.SparkConfig annotation on the method that returns type java.util.Properties.

To add default spark streaming command line options for the spark streaming module and to let XD admin know this is spark streaming module, following entry should be added in module registry module config properties (for example: modules/processor/spark-wordcount/config/spark-wordcount.properties):

Developer can extend this to provide more custom command line options. By default, the following module options are supported for the spark streaming module:

batchInterval (the time interval in millis for batching the stream events)

storageLevel (the streaming data persistence storage level)

Note

If you are using Java7 to run Spring XD, then make sure to set the JAVA_OPTS to increase -XX:MaxPermSize to avoid PermGen issue on the XD container where the spark driver would be running.

How this works

When a Spark streaming processor (a processor or a sink) that implements Processor interface above is deployed, the SparkDriver sets up the streaming context and runs as an XD module inside the XD container.

This sets up Spark streaming receiver (in case of processor and sink) in spark cluster that connects to XD upstream module’s output channel in the message bus. Also note that this receiver is a reliable Spark streaming receiver (if you use rabbit or kafka as message-bus) out of the box. This is implemented using manual acknowledgement and explicit offset management on Rabbit and Kafka respectively.
The MessageBusReceiver makes the incoming messages available for the computation in spark cluster as DStreams. If the streaming module is of XD processor type then the computed messages are pushed to the downstream module by MessageBusSender. The MessageBusSender binds to the downstream module’s input channel which subsequently connects to any of the XD processor or sink modules.

It is important to note that the MessageBusReceiver, streaming processor computation and the MessageBusSender run on Spark cluster.

Failover and recovery

Spark streaming integration supports automatic failover capability on spark driver failure. In case of driver module failure, the module will get automatically re-deployed on another available XD container.
Also, the underlying Spark streaming receiver is a reliable receiver when using RabbitMQ or Kafka as the messagebus. This will make sure all the messages are always acknowledged and processed reliably in Spark.

Module Type Conversion

Spark streaming modules can benefit from the out of the box module type conversion support from Spring XD. A spark streaming processor module can specify inputType and outputType while a spark streaming sink module can specify inputType to denote the contentType of the incoming/outgoing messages before they get ingested into/written out of spark streaming module.

Creating a Processor Module

Introduction

As outlined in the modules document, Spring XD currently supports four types of modules: source, sink, and processor for stream processing and job for batch processing. This document walks through implementing a custom processor module.

One or more processors can be included in a stream definition to modify the data as it passes on its way from the source to the sink. The architecture section covers the basics of stream processing. Processor modules provided out of the box are covered in the processors section.

Here we’ll look at how to create a simple processor module from scratch. This module will extract the text field from input messages from from a twittersearch source. The steps are essentially the same regardless of the module’s functionality. Note that Spring XD can perform this type of transformation without requiring a custom module. Rather than using the built-in functionality, we will implement a custom processor and wire it up with Spring Integration. The complete code for this example is here.

Write the Transformer Code

The tweet messages from twittersearch contain quite a lot of data (id, author, time, hash tags, and so on). The transformer we’ll write extracts the text of each tweet and outputs this as a string. The output messages from the twittersearch source are also strings, rendering the tweet data as JSON. We first load this into a map using Jackson library code, then extract the text field from the map.

Alternately, you can create the application context using an @Configuration class. In the example below, we’ve combined the configuration and the transformer into a single Java file for simplicity. Note that TweetTransformer now includes Spring Integration annotations:

To use @Configuration, you must also tell Spring which packages to scan in the module’s properties file spring-module.properties:

base_packages=my.custom.transformer

Write a Test

Writing a test to deploy the module in an embedded single node container requires the spring-xd-dirt and spring-xd-test libraries and a few other things. See the project pom or the gradle build script for details. The following code snippets are from TweetTransformerIntegrationTest

Note

See test a module for some important tips abouts regarding in-container testing.

First we start the SingleNodeApplication and register the module under test by adding a SingletonModuleRegistry providing the module name and type. This looks in the root classpath by default, so will find the module configuration in src/main/resources/config. SingleNodeIntegrationTestSupport provides programmatic access to major beans in the Admin and Container application contexts, as well as the contexts themselves.

To implement ths test, we will use the SingleNodeProcessingChain test fixture. The chain is a partial stream definition, represented as Spring XD DSL, which may be a single module, a chain of processors separated by |. In this case we are testing a single module. The chain binds local message handlers that act as source and sink to complete the stream. Thus we can deploy the stream and send messages directly to the source and receive messages directly from the sink:

We could, in theory, test against the actual twittersearch source, but this is not advised because it would depend on connecting to Twitter, providing credentials, etc. So we will save that for when the module is actually installed to the target Spring XD runtime. Instead, we can simply send a message with a sample tweet and verify that we get the content of the text property as output, as expected.

If you make changes and need to re-install, you must first delete the existing module:

xd:>module delete processor:tweet-transformer

Note

A simple jar file works in this case because the module requires no additional library dependencies since the Spring XD class path already includes Jackson and Spring Integration. See Module Packaging for more details.

If you haven’t already used twittersearch, read the sources section for more details. This command should stream tweets to the file /tmp/xd/output/javatweets but, unlike the normal twittersearch output, you should just see the text of the tweet rather than the full JSON document.

Creating a Sink Module

Introduction

As outlined in the modules document, Spring XD currently supports four types of modules: source, sink, and processor for stream processing and job for batch procesing. This document walks through implementing a custom sink module.

The last module in a stream is always a sink. A sink module is built with Spring Integration to consume messages on its input channel and send them to an external resource to terminate the stream.

Spring Integration provides a number of outbound channel adapters to integrate with various transports such as TCP, AMQP, JMS, Kafka, HTTP, web services, mail, or data stores such as file, Redis, MongoDB, JDBC, Splunk, Gemfire, and more. It is straightforward to create a sink module using an existing outbound channel adapters. Such outbound channel adapters are typically used to integrate streams with external data stores or legacy systems. Alternately, you may need to invoke a third party Java API to provide data to an external system. In this case, the sink can easily invoke a Java method using a Service Activator.

The adapter, as required by Spring XD, is configured as an endpoint on a channel named input. When a message is consumed, the Redis Store outbound channel adapter will write the payload to a Redis list with a key given by the ${collection} property. By default, the Redis Store outbound channel adapter uses a bean named redisConnectionFactory to connect to the Redis server. Here the connection factory is configured with property placeholders ${host}, ${port} which will be provided as module options in stream definitions that use this sink. Note that auto-startup is set to false. This is a requirement for Spring XD modules. When a stream is deployed, the Spring XD runtime will create and start the modules in the correct order to ensure that everything is initialized before the stream starts processing messages.

Note

By default, the adapter uses a StringRedisTemplate. Therefore, this module will store all payloads directly as Strings. You may configure a RedisTemplate with a different value Serializer to serialize other data types, such as Java objects, to the Redis collection.

Spring XD will automatically register a PropertyPlaceholderConfigurer to your application context, so there is no need to declare one here. These properties correspond to module options defined for this module (discussed below). Users supply option values when creating a stream using the DSL.

The module’s properties file in the config resource directory contains Module Options Metadata including a description, type, and optional default value for each property. The metadata supports features like auto-completion in the Spring XD shell and option validation:

Note that the collection defaults to the stream name, referencing a common property provided by Spring XD.

Alternately, you can write a POJO to define the metadata. Using a Java class provides better validation along with additional features and requires that the class be packaged as part of the module.

Create a module project

This section covers creating the module as a standalone project containing some code to test the module. This example uses Maven but Spring XD supports Gradle as well

Take a look at the pom file for this example. You will see it declares spring-xd-module-parent as its parent and declares a dependency on spring-integration-redis which provides the outbound channel adapter. The parent pom provides everything else you need. We also need to configure repositories to access the parent pom and any other dependencies. The xml file containing the bean definitions and the properties file are located in src\main\resources\config.

Create the Spring integration test

The main objective of the test is to ensure that messages are stored in a Redis list once the module’s Application Context is loaded. In order to test the module stand-alone, we need to enhance the module context with property values and a RedisTemplate to retrieve the stored messages.

The test will load the module application context using our test context and send a message to the module’s input channel. It will fail if the input payload "hello" is not added to the Redis list within 5 seconds.

Run the test

The test requires a running Redis server. See Getting Started for information on installing and starting Redis.

Install the module

The next step is to package the module as an uber-jar using maven:

$mvn package

This will build an uber-jar in target/redis-store-sink-1.0.0.BUILD-SNAPSHOT.jar. If you inspect the contents of this jar, you will see it includes the module configuration files and dependent jars (spring-integration-redis in this case).
Fire up the Spring XD runtime if it is not already running and,
using the Spring XD Shell, install the module as a sink named redis-store using the module upload command:

Creating a Job Module

Introduction

As outlined in the modules document, XD currently supports four types of modules: source, sink, and processor for stream processing and job for batch procesing. This document walks through creation of a simple job module.

Developing your Job

The Job definitions provided as part of the Spring XD distribution as well as those included in the Spring XD Samples repository can be used a basis for building your own custom Jobs. The development of a Job largely follows the development of a Spring Batch job, for which there are several references.

Creating a Simple Job

First we’ll look at how to create a job module from scratch. The complete working example is here.

Create a Module Project

This section covers the setup of a standalone project containing the module configuration and custom code. This example uses Maven but Spring XD supports Gradle as well.

Take a look at the pom file for this example. You will see it declares spring-xd-module-parent as its parent. The parent pom provides support for building and packaging Spring XD modules, including spring-batch libraries. We also need to configure repositories to access the parent pom and its dependencies.

First create a java project for your module, named batch-simple in your favorite IDE.

Create the Spring Batch Job Definition

Create a The job definition file in src\main\resources\config. In this case, we use a custom Tasklet. In this example there is only one step and it simply prints out the job parameters.

Modules can reside in an expanded directory named after the module, e.g. modules/job/myjob or as a single uber-jar, e.g., modules/job/myjob.jar. See module packaging and registering a modulefor more details.

Creating a read-write processing Job

To create a job in the XD shell, execute the job create command specifying:

name - the "name" that will be associated with the Job

definition - the name of the job module

Often a batch job will involve reading batches of data from a source, tranforming or processing that data and then wrting the batch of data to a destination. This kind of flow is implemented using Chunk-oriented processing, represented in the job configuration using the <chunk/> element containing reader, writer and optional processor elements. Other attributes define the size of the chunck and various policies for handling failure cases.

You will usually be able to reuse existing reader and writer implementations. The filejdbc job provided with the Spring XD distribution shows an example of this using the standard File reader and JDBC writer.

The processor is based on the ItemProcessor interface. It has a generic signature that lets you operate on a record at at time. The batch of records is handled as a collection in reader and writer implementations. In the filejdbc job, the reader converts input records into a Spring XD Tuple. The tuple serves as a generic data structure but you can also use or write another converter to convert the input record to your own custom POJO object.

Orchestrating Hadoop Jobs

There are several tasklet implementation that will run various types of Hadoop Jobs

The Spring Hadoop Samples project provides examples of how to create batch jobs that orchestate various hadoop jobs at each step. You can also mix and match steps related to work that is executed on the Hadoop cluster and work that is executed on the Spring XD cluster.

Creating a Python Module

Introduction

Spring XD provides support for processor and sink modules that invoke an external shell command. You can use these to integrate a Python script with a Spring XD stream. The following echo.py script is a simple example which can implement a processor to simply echo the input.

Python must be installed on the host of any container to which the processor module is deployed.

You should see the time messages echoed in the Spring XD container log. The shell processor works by binding its message channels to the external process' stdin and stdout. Behind the scenes, the shell modules use java.lang.ProcessBuilder to connect to the shell process. As you can see, most of echo.py is boilerplate code. To make things easier, Spring XD provides a python module to handle all of the low level I/O.

As you can see, this creates a Processor object which has a start method to which you may pass any function that accepts a single argument and returns a value. Currently, both the input and output data must be strings. Processor uses Encoders.CRLF (\r\n) by default. This is how the Spring XD module delimits individual messages in the stream. Encoders.LF is also supported. The shell command processor also uses CRLF by default.

The stream module also provides a similar Sink object which accepts a function that need not return a value (Sink will ignore the returned value).

Note

In order to import the springxd.stream module into your script, you must include it in your Python module search path. Python provides several ways to do this as described here. Spring XD python modules are included in the distribution in the python directory. The stream module is designed to be version agnostic and has been tested against Python 2.7.6 and Python 3.4.2

Providing Module Options Metadata

Introduction

Each available module can expose metadata about the options it accepts. This is useful to enhance the user experience, and is the foundation to advanced features like contextual help and code completion.
the
For example, provided that the file source module has been enriched with options metadata (and it has), one can use the module info command in the shell to get information about the module:

For this to be available, module authors have to provide a little bit of extra information, known as "Module Options Metadata". That metadata can take two forms, depending on the needs of the module: one can either use the "simple" approach, or the "POJO" approach. If one does not need advanced features like profile activation, validation or options encapsulation, then the "simple" approach is sufficient.

Using the "Simple" approach

To use the simple approach, simply create a file named <module>.properties right next to the <module>.xml file for your module.

Declaring and documenting an option

In that file, each option <option> is declared by adding a line of the form

options.<option>.description = the description

The description for the option is the only required part, and is a very important piece of information for the end user, so pay special attention to it (see also Style remarks)

That sole line in the properties file makes a --<option>= construct available in the definition of a stream using your module.

Note

About plugin provided options metadata

Some options are automatically added to a module, depending on its type. For example, every source module automatically inherits a outputType option, that controls the type conversion feature between modules. You don’t have to do anything for that to happen.

Note that there is support for both wrapper types (e.g. Integer) and primitive types (e.g. int). Although this is used for documentation purposes only, the primitive type would typically be used to indicate a required option (null being prohibited).

Using the "POJO" approach

To use advanced features such as profile activation driven by the values provided by the end user, one would need to leverage the "POJO" approach.

Instead of writing a properties file, you will need to write a custom java class that will hold the values at runtime. That class is also introspected to derive metadata about your module.

Declaring options to the module

For the simplest cases, the class you need to write does not need to implement or inherit from anything. The only thing you need to do is to reference it in a properties file named after your module (the same file location you would have used had you been leveraging the "simple" approach):

options_class = fully.qualified.name.of.your.Pojo

Note that the key is options_class, with an s and an underscore (not to be confused with option.<optionname> that is used in the "simple" approach)

For each option you want available using the --<option>= syntax, you must write a public setter annotated with @ModuleOption, providing the option description in the annotation.

The type accepted by that setter will be used as the documented type.

That setter will typically be used to store the value in a private field. How the module application can get ahold of the value is the topic of the next section.

Exposing values to the context

For a provided value to be used in the module definition (using the ${foo} syntax), your POJO class needs to expose a getFoo() getter.

At runtime, an instance of the POJO class will be created (it requires a no-arg constructor, by the way) and values given by the user will be bound (using setters). The POJO class thus acts as an intermediate PropertySource to provide values to ${foo} constructs.

Providing defaults

To provide default values, one would most certainly simply store a default value in the backing field of a getter/setter pair. That value (actually, the result of invoking the matching getter to a setter on a newly instanciated object) is what is advertised as the default.

Encapsulating options

Although one would typically use the combination of a foo field and a getFoo(), setFoo(x) pair, one does not have to.

In particular, if your module definition requires some "complex" (all things being relative here) value to be computed from "simpler" ones (e.g. a suffix value would be computed from an extension option, that would take care of adding a dot, depending on whether it is blank or not), then you’d simply do the following:

This would expose a --extension= option, being surfaced as a ${suffix} placeholder construct.

The astute reader will have realized that the default can not be computed then, because there is no getExtension() (and there should not be, as this could be mistakenly used in ${extension}). To provide the default value, you should use the defaultValue attribute of the @ModuleOption annotation.

Using profiles

The real benefit of using a POJO class for options metadata comes with advanced features though, one of which is dynamic profile activation.

If the set of beans (or xml namespaced elements) you would define in the module definition file depends on the value that the user provided for one or several options, then you can make your POJO class implement ProfileNamesProvider. That interface brings one contract method, profilesToActivate() that you must implement, returning the names of the profiles you want to use (this method is invoked after user option values have been bound, so you can use any logic involving those to compute the list of profile names).

As an example of this feature, see e.g.TriggerSourceOptionsMetadata.

Using validation

Your POJO class can optionally bear JSR303 annotations. If it does, then validation will occur after values have been successfully bound (understand that injection can fail early due to type incoherence by the way. This comes for free and does not require JSR303 annotations).

This can be used to validate a set of options passed in (some are often mutually exclusive) or to catch misconfiguration earlier than deployment time (e.g. a port number cannot be negative).

Metadata style remarks

To provide a uniform user experience, it is better if your options metadata information adheres to the following style:

option names should follow the camelCase syntax, as this is easier with the POJO approach. If we later decide to switch to a more unix-style, this will be taken care of by XD itself, with no change to the metadata artifacts described here

description sentences should be concise

descriptions should start with a lowercase letter and should not end with a dot

use primitive types for required numbers

descriptions should mention the unit for numbers (e.g ms)

descriptions should not describe the default value, to the best extent possible (this is surfaced thru the actual default metadata awareness)

options metadata should know about the default, rather than relying on the ${foo:default} construct

Extending Spring XD

Introduction

This document describes how to customize or extend the Spring XD Container. Spring XD is a distributed runtime platform delivered as executable components including XD Admin, XD Container, and XD Shell. The XD Container is a Spring application combining XML resources, Java @Configuration classes, and Spring Boot auto configuration for its internal configuration, initialized via the Spring Boot SpringApplicationBuilder. Since Spring XD is open source, the curious user can see exactly how it is configured. However, all Spring XD’s configuration is bundled in jar files and therefore not directly accessible to end users. Most users do not need to customize or extend the XD Container. For those that do, Spring XD provides hooks to:

This following sections provide an overview of XD Container internals and explain how to extend Spring XD for each of these scenarios. The reader is expected to have a working knowledge of both the Spring Framework and Spring Integration.

Spring XD Application Contexts

The diagram below shows how Spring XD is organized into several Spring application contexts. Some understanding of the Spring XD application context hierarchy is necessary for extending XD. In the diagram, solid arrows indicate a parent-child relationship. As with any Spring application a child application context may reference beans defined in its parent application context, but the parent context cannot access beans defined in the child context. It is important to keep in mind that a bean definition registered in a child context with the same id as a bean in the parent context will create a separate instance in the child context. Similarly, any bean definition will override an earlier bean definition in the same application context registered with the same id (Sometimes referred to as "last one wins").

Spring XD’s primary extension mechanism targets the Plugin Context highlighted in the diagram. Using a separate convention, it is also possible to register an alternate MessageBus implementation in the Shared Server Context.

Figure 32. The Spring XD Application Context Hierarchy

While this arrangement of application contexts is more complex than the typical Spring application, XD is designed this way for the following reasons:

Bean isolation - Some beans are "global" in that they are shared by all XD runtime components: Admin, Container, and Modules. Those allocated to the Shared Server Context are shared only by Admin and Container. Some beans must be available to Plugins ,used to configure Modules. However Plugins and Modules should be isolated from critical internal components. While complete isolation has proven difficult to achieve, the intention is to minimize any undesirable side effects when introducing extensions.

Bean scoping - To ensure that single node and distributed configurations of the Spring XD runtime are logically equivalent, the Spring configuration is identical in both cases, avoiding unnecessary duplication of bean definitions.

Lifecycle management - Plugins and other beans used to configure these application contexts are also Spring beans which Spring XD dynamically "discovers" during initialization. Such components must be fully instantiated prior to the creation of the application context to which they are targeted. To ensure initialization happens in the desired order, such beans may be either defined in an isolated application context (i.e., not part of the hierarchy) or in a parent context which Spring initializes before any of its descendants.

Plugin Architecture

The XD Container at its core is simply a runtime environment for hosting and managing micro Spring applications called Modules. Each module runs in its own application context (Module Context). The Module Context is a child of Global Context, as modules share some bean definitions, but otherwise is logically isolated from beans defined in the XD Container. The Module Context is fundamental to the Spring XD design. In fact, this is what allows each module to define its own input and output channels, and in general, enables beans be uniquely configured via property placeholders evaluated for each deployed instance of a Module. The Module interface and its default implementation provide a thin wrapper around a Spring Application Context for which properties are bound, profiles activated, and beans added or enhanced in order to "plug" the module into the XD Container.

The ModuleDeployer, shown in the diagram, is a core component of the Container Context, responsible for initializing modules during deployment, and shutting them down during undeployment. The ModuleDeployer sees the module as a "black box", unaware of its purpose or runtime requirements. Binding a module’s channels to XD’s data transport, for instance, is the responsibility of the MessageBus implementation configured for the transport. The MessageBus binding methods are actually invoked by the StreamPlugin during the initialization of a stream module. To support jobs, XD provides a JobPlugin to wire the Spring Batch components defined in the module during deployment. The JobPlugin also invokes the MessageBus to support communications between XD and job modules. These, and other functions critical to Spring XD are performed by classes that implement the Plugin interface. A Plugin operates on every deployed Module which it is implemented to support. Thus the ModuleDeployer simply invokes the deployment life cycle methods provided by every Plugin registered in the Plugin Context.

The ModuleDeployer discovers registered Plugins by calling getBeansOfType(Plugin.class) for the Plugin Context (its parent context). This means that adding your own Plugin requires these steps:

How to Add a Spring bean to the XD Container

This section applies to adding a Plugin, which is generally useful since a Plugin has access to every module as it is being deployed (see the previous section on Plugin Architecture). Furthermore, this section describes a generic mechanism for adding any bean definition to the Plugin Context. Spring XD uses both Spring Framework’s class path component scanning and resource resolution to find any components that you add to specified locations in the class path. This means you may provide Java @Configuration and/or any classes annotated with the @Component stereotype in a configured base package in addition to bean definitions defined in any XML or Groovy resource placed under a configured resource location. These locations are given by the properties xd.extensions.locations and xd.extensions.basepackages, optionally configured in servers.yml down at the bottom:

As the pluralization of these property names suggests, you may represent multiple values as a comma delimited string. Also note that there is no default for xd.extensions.basepackages. So if you want to use annotation based configuration, you must first set up one or more base package locations. The resource location(s) define the root locations where any XML or Groovy Spring bean definition file found in the given root or any of its subdirectories will be loaded. The root location defaults to META-INF/spring-xd/ext

The Container loads any bean definitions found in these configured locations on the class path and adds them to the Plugin Context. This is the appropriate application context since in order to apply custom logic to modules, you will most likely need to provide a custom Plugin.

Note

The extension mechanism is very flexible. In theory, one can define BeanPostProcessors, BeanFactoryPostProcessors, or ApplicationListeners to manipulate Spring XD application contexts. Do so at your own risk as the Spring XD initialization process is fairly complex, and not all beans are intended to be extensible.

Extensions are packaged in a jar file which must be added to Spring XD’s class path. Currently, you must manually copy the jar to $XD_HOME/lib for each container instance. To implement a Plugin, you will need to include a compile time dependency on spring-xd-module in your build. To access other container classes and to test your code in a container you will also require spring-xd-dirt.

Providing A new Type Converter

Spring XD supports automatic type conversion to convert payloads declaratively. For example, to convert an object to JSON, you provide the module option --outputType=application/json to a module used in a stream definition. The conversion is enabled by a Plugin that binds a Spring MessageConverter to a media type. The default type converters are currently configured in streams.xml, packaged in spring-xd-dirt-<version>.jar. If you look at that file, you can see an empty list registered as customMessageConverters.

<!-- Users can override this to add converters.-->
<util:list id="customMessageConverters"/>

So registering new type converters is a matter of registering an alternate list as customMessageConverters to the application context. Spring XD will replace the default empty list with yours. xd.messageConverters and customMessageConverters are two lists injected into the ModuleTypeConversionPlugin to build an instance of CompositeMessageConverter which delegates to the first converter in list order that is able to perform the necessary conversion. The Plugin injects the CompositeMessageConverter into the module’s input or output the MessageChannel, corresponding to the inputType or outputType options declared for any module in the stream definition (or defined as the module’s default inputType).

The CompositeMessageConverter is desirable because a module does not generally know what payload type it will get from its predecessor. For example, the converters that Spring XD provides out of the box can convert any Java object, including a Tuple and a byte array to a JSON String. However the methods for converting a byte array or a Tuple are each optimized for the respective type. The CompositeMessageConverter for --outputType=application/json must provide all three methods and the Data Type channel chooses the first converter that applies to both the incoming payload type and the media type (e.g., application/json). Note that the order that the converters appear in the list is significant. In general, converters for specific payload types precede more general converters for the same media type. The customMessageConverters are added after the standard converters in the order defined. So it is generally easier to add converters for new media types than to replace existing converters.

For example, a member of the Spring XD community inquired about Spring XD’s support for Google protocol buffers. This user was interested in integrating Spring XD with an existing messaging system that uses GPB heavily and needed a way to convert incoming and outgoing GPB payloads to interoperate with XD streams. This could be accomplished by providing a customMessageConverters bean containing a list of required message converters. Writing a custom converter to work with XD requires extending AbstractFromMessageConverter provided by spring-xd-dirt. It is recommended to review the existing implementations listed in streams.xml to get a feel for how to do this. In addition, you would likely define a custom MimeType such as application/gpb.

Note

It is worth mentioning that GPB is commonly used for marshaling objects over the network. In the context of Spring XD marshaling is treated as a separate concern from payload conversion. In Spring XD, marshaling happens at the "pipe" indicated by the | symbol using a different serialization mechanism, described below. In this case, the GPB payloads are produced and consumed by systems external to Spring XD and need to be converted in order that a GPB payload can work with XD streams. In this scenario, if the GPB is represented as a byte array, the bytes are transmitted over the network directly and marshaling is unnecessary.

As an illustration, suppose this user has developed a source module that emits GPB payloads from a legacy service. Spring XD provides transform and filter modules that accept SpEL expressions to perform their respective tasks. These modules are useful in many situations but the SpEL expressions generally require a POJO representing a domain type, or a JSON string. In this case it would be convenient to support stream definitions such as

where gpb-source represents a custom module that emits a GPB payload and expression references some specific object property. The media type application/x-java-object is a convention used by XD to indicate that the payload should be converted to a Java type embedded in the serialized representation (GPB in this example). Alternately, converting to JSON could be performed if the stream definition were:

gpb-source --outputType=application/json | transform --expression=...

To convert an XD stream result to GPB to be consumed by an external service might look like:

source | P1 ... | Pn | gpb-sink --inputType=application/gpb

These examples would require registering custom MessageConverters to handle the indicated conversions.
Alternately, this may be accomplished by writing custom processor modules to perform the required conversion. The above examples would then have stream definitions that look more like:

While custom processor modules are easier to implement, they add unnecessary complexity to stream definitions that use them. If such conversions are required everywhere, enabling automatic conversion may be worth the effort. Also, note that using a separate module generally requires additional network hops (at each pipe). If a processor module is necessary only to perform a common payload conversion, it is more efficient to install a custom converter.

Adding a New Data Transport

Spring XD offers Redis and Rabbit MQ for data transport out of the box. Transport is configured simply by setting the property xd.transport to redis or rabbit. In addition xd-singlenode supports a --transport command line option that can accept local(the single node default) in addition. This simple configuration mechanism is supported internally by an import declaration that binds the transport implementation to a name.

The above snippet is from an internal Spring configuration file loaded into the Shared Server Context. Spring XD provides MessageBus implementations in META-INF/spring-xd/transports/redis-bus.xml and META-INF/spring-xd/transports/rabbit-bus.xml

This makes it relatively simple for Spring XD developers and advanced users to provide alternate MessageBus implementations to enable a new transport and activate that transport by setting the xd.transport property. For example, to implement a JMS MessageBus you would add a jar containing /META-INF/spring-xd/transports/jms-bus.xml in the class path. This file must register a bean of type MessageBus with the ID messageBus. A jar providing the above configuration file along with the MessageBus implementation and any dependencies must be installed $XD_HOME/lib.

When implementing a MessageBus, it is advisable to review and understand the existing implementations which extend MessageBusSupport. This base class performs some common tasks including payload marshaling. Spring XD uses the term codec to connote a component that performs both serialization and deserialization and provides a bean with the same name. In the example above, the JMS MessageBus configuration`/META-INF/spring-xd/transports/jms-bus.xml` might look something like:

where JmsMessageBus extends MessageBusSupport and the developer is responsible for configuring any dependent JMS resources appropriately.

Optimizing Serialization

Introduction

Spring XD is configured by default to use
Kryo to serialize and deserialize POJO
message payloads when using remote transport such as Redis, Rabbit, or Kafka.
Note that if the payload is already in the form of a byte array, then XD will
transport it as is. Also if the payload is a String, XD will use getBytes() and
bypass Kryo. Kryo performs favorably compared to other JVM serialization
libraries but may become a bottleneck in high-throughput applications. In rare
cases, custom serialization may be needed to address functional issues. This
section offers some tips and techniques for customizing and optimizing
serialization to address such situations. In accordance with the usual caveats
about premature optimization, don’t apply these techniques unless you really
need to. Furthermore, the time it takes to serialize and deserialize an object
is proportional to factors such the number of fields, composition, and
collection sizes. Prerequisite to the techniques presented here, consider
colocating modules (via local binding or module composition), or using lighter
payloads.

Serialization Performance

Published serialization
benchmarks have shown that Kryo serialization operations take less than 1000
nanoseconds for reasonably sized objects. The Spring XD team has independently
verified these results. Serializing large objects may take many orders of
magnitude longer. To put this in perspective, let’s suppose that serialization
and deserialization of your payload each take 5000 nanoseconds. For a stream
,e.g., some-source | some-sink running in a distributed XD runtime, moving the
payload from the source to the sink requires serialization at the producer and
deserialization at the consumer. Both operations would account for 10,000 ns or
10 microseconds. So the maximum theoretical throughput for this simple stream,
with no partitioning or parallel deployment, is at most 100,000 messages/sec.
This does not account for network latency (transporting the serialized bytes),
overhead in the messaging middleware, or computations performed by the modules.
If your application requires a higher level of
throughput than mentioned above then serialization is a potential bottleneck. However, if
your throughput requirements are not so
demanding, serialization performance becomes less of an issue. In this
hypothetical scenario, At a throughput of 10,000 messages/second, 10
microseconds is only 10% of the maximum time required to process the message.
However if you trying to stream large objects at this rate, serialization time,
along with transport and processing time, will be significantly higher.

Serialization in XD

In Spring XD 1.2.x, MessageBus implementations inject a bean of type
MultiTypeCodec,
where the term Codec refers to an component providing methods to serialize and
deserialize objects. MultiType means that the component can handle any object. Spring
XD is internally configured with a
PojoCodec
that delegates to Kryo and provides hooks to register custom Kryo serializers for
specific types.

Customizing Kryo

By default, Kryo uses delegates unknown Java types to its FieldSerializer.
Kryo also registers default serializers for each primitive type along with
String, Collection and Map serializers. FieldSerializer uses reflection
to navigate the object graph. A more efficient approach is to implement a custom
serializer that is aware of the object’s structure and can directly serialize
selected primitive fields:

The Serializer interface exposes Kryo, Input, and Output which provide
complete control over which fields are included and other internal settings as
described in the documentation.

Disabling References

A simple setting may be applied to boost performance if the payload types do
not contain any cyclical references. If the payload contains cyclical references, then
Kryo needs to do some extra work in this case and the
references
setting is enabled by default. If not needed, this code can be disabled and can
have a measurable impact on performance. In Spring XD this is controlled by the
property xd.codec.kryo.references in servers.yml. Set this property to
false to disable references.

Registering a Custom Kryo Serializer in XD

If custom serialization is indicated, please consult the
Kryo documentation since you will be
using the native API. The general requirement is to provide a jar on the Spring XD class
path (copy it to <XD_INSTALL_DIR>/xd/lib). This may be the same jar that
contains the domain objects or a separate jar that contains the custom Kryo
serializer(s) along with a bit of Spring configuration to be imported into the
XD runtime.

First provide one or more classes that extend
com.esotericsoftware.kryo.Serializer. Next provide a Spring @Configuration in
the package spring.xd.bus.ext. For example:

The above example works by configuring a
KryoRegistrationRegistrar.
This class holds a list of com.esotericsoftware.kryo.Registration each of which
associates a Java class to a Serializer and a unique integer. The integer is the
registration ID for the type which allows Kryo to encode the serialized type
as an integer instead of writing the fully qualified class name. This significantly
reduces the size of the serialized payload. Spring XD will inject any beans of
type KryoRegistrar found on the class path into the PojoCodec. Hence a jar
containing a Spring configuration similar to the above and installed in xd/lib
will register the custom serializer. One caveat is that multiple
KryoRegistrars may contain conflicting registrations. The ID assigned must be
unique, and there may not be multiple registrations for the same class.
PojoCodec will merge and validate all registrations during container
initialization so that any conflicts will result in an exception during
container initialization.

Note

The codec must be configured exactly the same way in every container
instance. This it is important to keep custom jars and other related runtime
configuration consistent. The container logs include the Kryo registration
settings.

Implementing KryoSerializable

If you have write access to the domain object source code it may implement
KryoSerializable as described
here. In this case
the class provides the serialization methods itself and no further configuration
is required for Spring XD. This has the advantage of being much simpler to use
with XD, however benchmarks have shown this is not quite as efficient as
registering a custom serializer explicitly:

Using DefaultSerializer Annotation

If you have write access to the domain object this may be a simpler alternative
to specify a custom serializer. Note this does not register the class with an
ID, so your mileage may vary. This may be combined with using a
KryoClassMapRegistrar
or
KryoClassListRegistrar
to register objects if necessary, but then there is less benefit to using the
annotation.

Replacing PojoCodec

It is also possible to replace PojoCodec with an implementation of
MultiTypeCodec that uses another serialization library in place of Kryo. XD
does not provide an alternate implementation, but if one were inclined to write
one, a configuration similar to this, in the spring.xd.bus.ext package, is
required:

Benchmarking

Prior to adding any serialization configuration to XD, we highly recommend
running some benchmark tests to measure serialization of your data in isolation.
It is important to first establish a baseline measurement. Once the baseline
performance is known, you can readily measure the impact of optimizations.
Serialization has been measured on the order of few hundred nanoseconds. At
this scale, it is important to test in an environment which does not have
external processes competing for resources. This type of microbenchmark must
also account for JVM optimizations and garbage collection by "warming up"
(letting the test run for a while before starting the timer) , requesting GC and
pausing between runs, and the like. Such tests can also be run with a JVM
profiling tool such as Yourkit to get to the finest level of detail.

An excellent resource on JVM serialization benchmarks is the
jvm-serializers project which
incidentally demonstrates that manually optimized Kryo is the fastest among the
libraries tested. The Spring XD samples repository includes a
serialization-benchmarks
project that has co-opted some of the jvm-serializer techniques and contains
sample benchmarks, including one which closely matches results for the
jvm-serializers kryo_manual test using XD’s PojoCodec as an entry point. You
can use one of these samples as a template to benchmark your custom serializer.

where "host" is the container(launcher) host where the syslog module is deployed.

2) Add log rule to log message sources:

log {
source(<message_source>); destination(<destinationName>);
};

3) Make sure to restart the service after the change:

sudo service syslog-ng restart

Now, the syslog messages from the syslog message sources are written into HDFS /xd/<stream-name>/

Configuration Guidelines

Overview

When running a distributed Spring XD runtime, there are a number of considerations related to performance and reliability. In most cases, these involve settings that have tradeoffs, but in this section we provide some background so you know what the options are and how to configure them.

In the Deployment section that follows, we provide detailed information about various properties that can be passed along with the stream deploy command. That section also describes a scenario that is common for minimizing network hops, where direct binding can occur between modules rather than having each pipe within a stream correspond to a send and receive over the Message Bus. For more detail see the Direct Binding subsection.

Another relevant topic for minimizing network hops is the ability to compose modules. That is a useful technique where a subset of the stream’s contiguous modules can be grouped together as if a single module. All of the pipes within the composed module will rely upon a local transport rather than sending and receiving via the Message Bus. For more detail read the Composing Modules section.

When configuring a RabbitMQ Message Bus, you will also want to consider several performance settings. For example, unless strict sequential ordering is required, the prefetch and concurrency values should be overridden (the default for each is 1). That can lead to a significant performance improvement. In the less likely case that performance concerns completely outweigh reliability, you can disable acknowledgements and even disable the persistence of messages. For a listing of these settings and more, refer to the RabbitMQ Configuration section. Several performance related configuration settings exist on the broker itself, and those are well-documented in the RabbitMQ Admin Guide. For example, the vm_memory_high_watermark and vm_memory_high_watermark_paging_ratio are both explained within the Flow Control subsection of the guide.

If you are using the HTTP source module in a stream and want to scale, you can deploy multiple instances by specifying the module.http.count property as described in the Deployment Properties section. Keep in mind that each instance will share the same port value. The default is 9000, but that can be overridden, for all instances, by including --port as an option for the HTTP module in the stream definition. That means you would want to ensure that each container that may be a candidate for deploying one of the HTTP module instances (taking into account the criteria deployment property if provided), is running on a different host, either physically or on separate virtual machines. Of course, in a production environment, you would likely want to add a load balancer in front of those HTTP endpoints.

Also when using the HTTP source module, you may want to consider enabling support for HTTPS. An example is provided in the documentation for that module’s options.

Deployment Manifest

A stream is composed of modules. Each module is deployed to one or more Container instance(s). In this way, stream processing is distributed among multiple containers. By default, deploying a stream to a distributed runtime configuration uses simple round robin logic. For example if there are three containers and three modules in a stream definition, s1= m1 | m2 | m3, then Spring XD will attempt to distribute the work load evenly among each container. This is a very simplistic strategy and does not take into account things like:

server load - how many modules are already deployed to a container? What is the current memory and CPU utilization?

server affinity - some containers may have external software installed and specific modules will benefit from co-location. For example, an hdfs sink might be deployed only to hosts running Hadoop. Or perhaps a file sink should be deployed to hosts configured with extra disk space.

scalability - Suppose the stream s1, above, can achieve higher throughput with multiple instances of m2 running, so we want to deploy m2 to every available container.

Generally, more complex deployment strategies are needed to tune and operate XD. Additionally, we must consider various features and constraints when deploying to a PaaS, Yarn or some other cluster manager. Additionally, Spring XD allows supports Stream Partitioning and Direct Binding.

To address such deployment concerns, Spring XD provides a Deployment Manifest which is submitted with the deployment request, in the form of in-line deployment properties (or potentially a reference to a separate document containing deployment properties).

Deployment Properties

When you execute the stream deploy shell command, you can optionally provide a comma delimited list of key=value pairs known as deployment properties. Examples for the key include module.[modulename].count and module.[modulename].criteria (for a full list of properties, see below). The value for the count is a positive integer, and the value for criteria is a valid SpEL expression. The Spring XD runtime matches an available container for each module according to the deployment manifest.

The deployment properties allow you to specify deployment instructions for each module. Currently this includes:

The number of module instances

A target server or server group

MessageBus attributes required for a specific module

Stream Partitioning

Direct Binding

History Tracking

Spring XD Shell interaction

When using the Spring XD Shell, there are two ways to provide deployment properties: either inline or via a file reference. Those two ways are exclusive and documented below:

Inline properties

use the --properties shell option and list properties as a comma separated list of key=value pairs, like so:

use the --propertiesFile option and point it to a local Java .properties file (i.e. that lives in the filesystem of the machine running the shell). Being read as a .properties file, normal rules apply (ISO 8859-1 encoding, =, <space> or : delimiter, etc.) although we recommend using = as a key-value pair delimiter for consistency:

stream deploy foo --propertiesFile myprops.properties

where myprops.properties contains

# this is a comment
module.transform.count=2
module.log.criteria = groups.contains('group1')

Those two options apply to the stream deploy and job deploy commands.

General Properties

Note

You can apply criteria to all modules in the stream by using the wildcard * for [modulename]

A boolean value indicating whether history should be tracked in a message header for this module. Usually used during stream development or for debugging, with module.*.trackHistory=true to track all modules. The xdHistory message header contains an entry for each module that processes the message; each entry includes useful information including the stream name, module label, host, container id, thread name, etc. This enables the determination of exactly how a message was processed through the stream(s).

The number of delivered messages between acknowledgements (when ackMode=AUTO) (default 1)

module.[modulename].consumer.durableSubscription

When true, publish/subscribe named channels (tap:, topic:) will be backed by a durable queue and will be eligible for dead-letter configuration, accoring to the autBindDLQ setting. Note that, since RabbitMQ doesn’t permit queue attributes to be changed, changing the durableSubscription property from true to false between deployments, without first removing the queue, will not have any effect. If a stream is deployed with durableSubscription=true, and you wish to change it to a non-durable subscription, you will need to remove the queue from RabbitMQ before redeploying. Spring XD will create the queue the with the appropriate settings, unless the queue exists already. Changing from a non-durable subscription to a durable subscription will not have this problem because, for a non-durable subscription, the queue will be automatically deleted when the stream is undeployed.

module.[modulename].producer.deliveryMode

The delivery mode of messages sent to RabbitMQ (PERSISTENT or NON_PERSISTENT) (default PERSISTENT)

module.[modulename].producer.requestHeaderPatterns

Controls which message headers are passed between modules (default STANDARD_REQUEST_HEADERS,*)

module.[modulename].producer.replyHeaderPatterns

Controls which message headers are passed between modules (only used in partitioned jobs) (default STANDARD_REPLY_HEADERS,*)

module.[modulename].consumer.autoBindDLQ

When true, the bus will automatically declare dead letter queues and binding for each bus queue. The user is responsible for setting a policy on the broker to enable dead-lettering; see Message Bus Configuration for more information. The bus will configure a dead-letter-exchange (<prefix>DLX) and bind a queue with the name <original queue name>.dlq and route using the original queue name..

module.[modulename].consumer.republishToDLQ

By default, failed messages after retries are exhausted are rejected. If a dead-letter queue (DLQ) is configured, rabbitmq will route the failed message (unchanged) to the DLQ. Setting this property to true instructs the bus to republish failed messages to the DLQ, with additional headers, including the exception message and stack trace from the cause of the final failure. Note that the republish will occur even if maxAttempts is only set to 1. Also see autoBindDLQ(default false)

module.[modulename].producer.batchingEnbled

Batch messages sent to the bus (default false)

module.[modulename].producer.batchSize

The normal batch size, may be preempted by batchBufferLimit or batchTimeout(default 100)

module.[modulename].producer.batchBufferLimit

If a batch will exceed this limit, the batch will be sent prematurely (default 10000)

module.[modulename].producer.batchTimeout

If no messages are received in this time (ms), the batch will be sent (default 5000)

Stream Partitioning

Note

Partitioning is not supported with the local transport.

A common pattern in stream processing is to partition the data as it is streamed. This entails deploying multiple instances of a message consuming module and using content-based routing so that messages containing the identical data value(s) are always routed to the same module instance. You can use the Deployment Manifest to declaratively configure a partitioning strategy to route each message to a specific consumer instance.

A SpEL expression, evaluated against the message, to determine the partition key; only applies if partitionKeyExtractorClass is null. If both are null, the module is not partitioned (default null)

module.[modulename].producer.partitionSelectorClass

The class name of a PartitionSelectorStrategy(default null)

module.[modulename].producer.partitionSelectorExpression

A SpEL expression, evaluated against the partition key, to determine the partition index to which the message will be routed. The final partition index will be the return value (an integer) modulo [nextModule].count If both the class and expression are null, the bus’s default PartitionSelectorStrategy will be applied to the key (default null)

In summary, a module is partitioned if its count is > 1 and the previous module has a partitionKeyExtractorClass or partitionKeyExpression (class takes precedence). When a partition key is extracted, the partitioned module instance is determined by invoking the partitionSelectorClass, if present, or the partitionSelectorExpression % partitionCount , where partitionCount is count in the case of Redis and RabbitMQ, and the underlying partition count of the topic in the case of Kafka (see the Message Bus section on Kafka partition configuration for details). If neither a partitionSelectorClass nor a partitionSelectorExpression is present the result is key.hashCode() % partitionCount.

For Redis and Rabbit, the use of partitionKeyExpression and partitionKeyExtractorClass is restricted to sending data modules that have count > 1. Any other use (i.e. sending data to a module with count = 1, or to a named channel) will result in a error at deployment time.

In the case of Kafka, partitionKeyExpression and partitionKeyExtractorClass may be used for sending data to any modules, including the ones with count = 1, as well as to named channels, since partitioning is based on the partition count of the target topic, and not the receiving module count.

Direct Binding

Sometimes it is desirable to allow co-located, contiguous modules to communicate directly, rather than using the configured remote transport, to eliminate network latency. Spring XD creates direct bindings by default only in cases where every "pair" of producer and consumer (modules bound on either side of a pipe) are guaranteed to be co-located.

Currently Spring XD implements no conditional logic to force modules to be co-located. The only way to guarantee that every producer-consumer pair is co-located is to specify that the pair be deployed to every available container instance, in other words, the module counts must be 0. The figure below illustrates this concept. In the first hypothetical case, we deploy one instance (the default)of producer m1, and two instances of the consumer m2. In this case, enabling direct binding would isolate one of the consumer instances. Spring XD will not create direct bindings in this case. The second case guarantees co-location of the pairs and will result in direct binding.

In addition, direct binding requires that the producer is not configured for partitioning since partitioning is implemented by the Message Bus.

Using module.*.count=0 is the most straightforward way to enable direct binding. Direct binding may be disabled for the stream using module.*.producer.directBindingAllowed=false. Additional direct binding deployment examples are shown below.

Deployment States

The ability to specify criteria to match container instances and deploy multiple instances for each module leads to one of several possible deployment states for the stream as a whole. Consider a stream in an initial undeployed state.

After executing the stream deployment request, the stream will be one of the following states:

Deployed - All modules deployed successfully as specified in the deployment manifest.

Incomplete - One of the requested module instances could not be deployed, but at least one instance of each module definition was successfully deployed. The stream is operational and can process messages end-to-end but the deployment manifest was not completely satisfied.

Failed - At least one of the module definitions was not deployed. The stream is not operational.

Note

The state diagram above represents these states as final. This is an over-simplification since these states are affected by container arrivals and departures that occur during or after the execution of a deployment request. Such transitions have been omitted intentionally but are worth considering. Also, there is an analogous state machine for undeploying a stream, initially in any of these states, which is left as an exercise for the reader.

If there are only two container instances available, only two instances of transform will be deployed. The stream deployment state is incomplete and the stream is functional. However the unfulfilled deployment request remains active and the third instance will be deployed if a new container comes on line that matches the criteria.

Container Attributes

The SpEL context (root object) for module.[modulename].criteria is ContainerAttributes, basically a map derivative that contains some standard attributes:

id - the generated container ID

pid - the process ID of the container instance

host - the host name of the machine running the container instance

ip — the IP address of the machine running the container instance

ContainerAttributes also includes any user-defined attribute values configured for the container. These attributes are configured by editing xd/config/servers.yml the file included in the XD distribution contains some commented out sections as examples. In this case, the container attributes configuration looks something like:

xd:
container:
groups: group2
color: red

Groups

Groups may be assigned to a container via the optional command line argument --groups or by setting the environment variable XD_CONTAINER_GROUPS. As the property name suggests, a container may belong to more than one group, represented as comma-delimited string. The concept of server groups is considered an especially useful convention for targeting groups of servers for deployment to support many common scenarios, so it enjoys special status. Internally, groups is simply a user defined attribute.

IP Address

The IP address of the container can also be optionally set via the command argument --containerIp or by setting the environment variable XD_CONTAINER_IP. If not specified, the IP address will be automatically set. Please be aware of the limitations, though, particularly in cases where the physically machine has multiple IP addresses assigned.

For the automatic assignment of the IP address, XD internally loops through the available network interfaces and assigned IP addresses and will pick the first available IPv4 address that is not a loopback address.

Depending on your underlying server or network infrastructure, you may prefer specifying the IP address explicitly.

Hostname

The hostname of the container can be optionally set as well via the command argument --containerHostname or by setting the environment variable XD_CONTAINER_HOSTNAME. If not specified, the hostname will be automatically set. Please be aware of the limitations, though. You may prefer specifying the hostname address explicitly.

Tip

While there is no command line option to set the container hostname and IP address when running in Single Node mode, you can still specify the values via environment variables or by customizing the respective settings in application.yml

Stream Deployment Examples

To Illustrate how to use the Deployment Manifest, We will use a runtime configuration with 3 container instances, as displayed in the XD shell:

We can see that three instances of the transform processor have been deployed, one to each container instance. Also the log module has been deployed to the container assigned to group1. Now we can undeploy and deploy the stream using a different manifest:

Now there are only two instances of the log module deployed. We asked for three however the deployment criteria specifies only containers not in group1 are eligible. The log module is deployed only to the two containers matching the criteria. The deployment status of stream test1 is shown as incomplete. The stream is functional even though the deployment manifest is not completely satisfied. If we fire up a new container not in group1, the DeploymentSupervisor will handle any outstanding deployment requests by comparing xd/deployments/modules/requested to xd/deployments/modules/allocated, and will deploy the third log instance and update the stream state to deployed.

Partitioned Stream Deployment Examples

Using SpEL Expressions

The hypothetical SpEL function expensiveTransformation represents a resource intensive processor which we want to load balance by running on multiple containers. In this case, we also want to partition the stream so that payloads containing the same customerId are always routed to the same processor instance. Perhaps the processor aggregates data by customerId and this step needs to run using co-located resources.

In this example three instances of the transformer will be created (with partition index of 0, 1, and 2). When the jms module sends a message it will take the customerId property on the message payload, invoke its hashCode() method and apply the modulo function with the divisor being the transform.count property to determine which instance of the transform will process the message (payload.getCustomerId().hashCode() % 3). Messages with the same customerId will always be processed by the same instance.

Direct Binding Deployment Examples

In the simplest case, we enforce direct binding by setting the instance count to 0 for all modules in the stream. A count of 0 means deploy the module to all available containers:

Note that we have two containers and two instances of each module deployed to each. Spring XD automatically sets the bus properties needed to allow direct binding, producer.directBindingAllowed=true on the time module.

Suppose we only want one instance of this stream and we want it to use direct binding. Here we can add deployment criteria to restrict the available containers to group1.

Direct binding eliminates latency between modules but sacrifices some of the resiliency provided by the messaging middleware. In the scenario above, if we lose one of the containers, we lose messages. To disable direct binding when module counts are set to 0, set module.*.producer.directBindingAllowed=false.

Finally, we can still have the best of both worlds by enabling guaranteed delivery at one point in the stream, usually the source. If the tail of the stream is co-located and the source uses the message bus, the message bus may be configured so that if a container instance goes down, any unacknowledged messages will be retried until the container comes back or its modules are redeployed.

TDB: A realistic example

An alternate scenario with similar characteristics would be if the stream uses a rabbit or jms source. In this case, guaranteed delivery would be configured in the external messaging system instead of the Spring XD transport.

Troubleshooting

Debugging a distributed system to diagnose problems can be challenging. While using Spring XD, if you encounter

Reason: ZooKeeper requires a heartbeat at a regular interval to test liveness of connected processes. Full "stop the world" GCs can result in connection and session timeouts from ZooKeeper. While verbose, GC logs are helpful for diagnosing this and other performance issues.

Debugging Slowness

Reason: Examination of thread dumps can reveal stuck or slow moving threads. This data is useful for determining the root cause of a slow or unresponsive application.

File Descriptors and limit violation

Problem: java.io.FileNotFoundException: (Too many open files)

Recommendation: Default ulimit setting in most UNIX based operating systems is 1024. Raise ulimit setting to at least 10000.

Reason: Stream and job modules in Spring XD are loaded and unloaded dynamically on demand. When a module is unloaded, the associated class loaders may not be garbage collected right away, resulting in open file handles for the jar files used by the module. Depending on the number of modules in use, the file handle limit of 1024 may be exceeded.

Message Bus Configuration

Introduction

This section contains additional information about configuring the Message Bus, including High Availability, SSL,
Error handling and partitioning.

Rabbit Message Bus High Availability (HA) Configuration

Introduction

First, use the addresses property in servers.yml to include the host/port for each server in the cluster. See Application Configuration.

By default, queues and exchanges declared by the bus are prefixed with xdbus. (this prefix can be changed as described in Application Configuration).

To configure the entire bus for HA, create a policy:

rabbitmqctl set_policy ha-xdbus "^xdbus\." '{"ha-mode":"all"}'

Connection Management and HA Queues

When consuming from HA queues, there might be some performance advantage in consuming from the node that actually hosts
the queue.
Starting with version 1.2 it is now possible to configure the Rabbit Message Bus to do that.

Caution

To utilize this mechanism, the rabbit management plugin must be enabled on each node in the cluster.
The plugin’s REST API is used to determine the location of the queue.

This feature is enabled by adding more than one node to the spring.rabbitmq.node property.
See RabbitMQ Configuration for configuration details.

When a node fails and a queue is moved to one of the mirrors, the bus will automatically reconnect to the right node.

Error Handling (Message Delivery Failures)

RabbitMQ Message Bus

Note

The following applies to normally deployed streams. When direct binding between modules is being used, exceptions thrown by the consumer are thrown back to the producer.

When a consuming module (processor, sink) fails to handle a message, the bus will retry delivery based on the module (or default bus) retry configuration. The default configuration will make 3 attempts to deliver the message. The retry configuration can be modified at the bus level (in servers.yml), or for an individual stream/module using the deployment manifest.

When retries are exhausted, by default, messages are discarded. However, using RabbitMQ, you can configure such messages to be routed to a dead-letter exchange/dead letter queue. See the RabbitMQ Documentation for more information.

Note

The following configuration examples assume you are using the default bus prefix used for naming rabbit elements: "xdbus."

The first pipe (by default) will be backed by a queue named xdbus.foo.0, the second by xdbus.foo.1. Messages are routed to these queues using the default exchange (with routing keys equal to the queue names).

To enable dead lettering just for this stream, first configure a policy:

The next step is to declare the dead letter exchange, and bind dead letter queues with the appropriate routing keys.

For example, for the second "pipe" in the stream above we might bind a queue foo.sink.dlq to exchange foo.dlx with a routing key xdbus.foo.1 (remember, the original routing key was the queue name).

Now, when the sink fails to handle a message, after the configured retries are exhausted, the failed message will be routed to foo.sink.dlq.

There is no automated mechanism provided to move dead lettered messages back to the bus queue.

Automatic Dead Lettering Queue Binding

Starting with version 1.1, the dead letter queue and binding can be automatically configured by the system. A new property autoBindDLQ has been added; it can be set at the bus level (in servers.yml) or using deployment properties, e.g. --properties module.*.consumer.autoBindDLQ=true for all modules in the stream. When true, the dead letter queue will be declared (if necessary) and bound to a dead letter exchange named xdbus.DLX (again, assuming the default prefix) using the queue name as the routing key.

In the above example, where we have queues xdbus.foo.0 and xdbus.foo.1, the system will also create xdbus.foo.0.dlq, bound to xdbus.DLX with routing key xdbus.foo.0 and xdbus.foo.1.dlq, bound to xdbus.DLX with routing key xdbus.foo.1.

Note

Starting with version 1.2, any queues that are deployed with autoBindDLQ will automatically be configured to enable dead-lettering, routing to the DLX with the proper routing key. It is no longer necessary to use a policy to set up dead-lettering when using autoBindDLQ.

Also, starting with version 1.2, the provision of dead-lettering on publish/subscribe named channels (tap: or topic:) depends on a new deployment property durable.
This property is similar to a JMS durable subscription to a topic and is false by default.
When false (default), the queue backing such a named channel is declared auto-delete and is removed when the stream is undeployed.
A DLQ will not be created for such queues.
When true, the queue becomes permanent (durable) and is not removed when the stream is undeployed.
Also, when true, the queue is eligible for DLQ provisioning, according to the autoBindDLQ deployment property.
durable can be set at the bus level, or in an individual deployment property, such as:

Redis Message Bus

When Redis is the transport, the failed messages (after retries are exhausted) are LPUSH+ed to a +LIST ERRORS:<stream>.n (e.g. ERRORS:foo.1 in the above example in the RabbitMQ Message Bus section).

This is unconditional; the data in the ERRORS LIST is in "bus" format; again, as with the RabbitMQ Message Bus, some external mechanism would be needed to move the data from the ERRORS LIST back to the bus’s foo.1 LIST.

Note

When moving errored messages back to the main stream, it is important to understand that these messages contain binary data and are unlikely to survive conversion to and from Unicode (such as with Java String variables). If you use Java to move these messages, we recommend that you use a RedisTemplate configured as follows:

Rabbit Message Bus Secure Sockets Layer (SSL)

First configure the broker as described there. The message bus is a client of the broker and supports both of the described configurations for connecting clients (SSL without certificate validation and with certficate validation).

To use SSL without certificate validation, simply set

spring:
rabbitmq:
useSSL: true

In servers.yml (and set the port(s) in the addresses property appropriately).

The sslProperties property is a Spring resource (file:, classpath: etc) that points to a properties file, Typically, this file would be secured by the operating system (and readable by the XD container) because it contains security information. Specifically:

Where the pkcs12 keystore contains the client certificate and the truststore contains the server’s certificate as described in the rabbit documentation. The key/trust store properties are Spring resources.

Alternately, you may specify these properties in-line in lieu of using an external properties file.
In this case,the passphrase properties may be encrypted.
See Encrypted Properties for more details.
When both sslProperties and in-line ssl properties are configured,the in-line properties take precedence.

Note

By default, the rabbit source and sink modules inherit their default configuration from the container, but it
can be overridden, either using modules.yml or with specific module definitions.

Rabbit Message Bus Batching and Compression

Removing RabbitMQ MessageBus Resources

When a stream or job is undeployed, the broker resources (queues, exchanges) are NOT removed from RabbitMQ.
This is due to the possibility that a stream might be being undeployed temporarily, and avoids message loss.

If you wish to completely remove these resources, a REST API is provided for this purpose. In addition, the
SpringXDTemplate provides a Java binding for this REST api via its streamOperations().cleanBusResources(String name)
and jobOperations().cleanBusResources(String name) APIs.

Kafka Message Bus Partition Control

This section describes how topic partitioning functions when using Kafka as transport.

Controlling the partition count of a transport topic

The KafkaMessageBus will attempt to set the number of partitions in a transport topic as consumerCount * consumerConcurrency,
either by creating the topic with the required number of partitions, or by repartitioning it, in case it exists.

For example, let’s consider a stream with the following definition:

stream create ingest --definition="http | hdfs"

A default deployment will result in the creation of a single topic with a single partition.

stream deploy ingest

A deployment (or redeployment) of the same stream with a different module count and concurrency will result in
6 partitions, evenly distributed across the 3 module instances:

Besides relying on defaults, you can customize the number of Kafka partitions used by transport topics by indicating a
minimum value to be used by deployments (by minimum, it is understood that, if smaller than
consumerCount * consumerConcurrency, the latter value will be used instead).

This can be done globally by changing the xd.messagebus.kafka.default.minPartitionCount property in
servers.yml:

xd:
messagebus:
kafka:
default:
minPartitionCount: 5

This will result in creating at least 5 partitions for each transport topic.

Alternatively, and for more granular control, the property can be specified for specific deployments and modules,
through the producer.minPartitionCount property in the deployment manifest, as in the following example, where
10 partitions will be created:

Overpartitioning can serve a number of purposes, such as load balancing and distributing data among brokers, as well as
allowing for scaling up by increasing the number of concurrent consumers in the future.

Note

If the Kafka topic already exists and it already has a number of partitions larger than either minPartitionCount
or consumerCount * consumerConcurrency, its partition count will remain unchanged, and the Kafka transport will operate
with all the existing partitions.

Administration

Monitoring and Management

Spring XD uses Spring Boot’s monitoring and management support over HTTP and JMX along with Spring Integration’s MBean Exporters

Monitoring XD Admin, Container and Single-node servers

JMX is disabled by default. To enable JMX, set XD_JMX_ENABLED=true. JMX is disabled by default due to performance issues when message rates are over 100K (for ~100 byte messages). Peformance related issue will be addressed in a future release.

Spring integration components are exposed over JMX using IntegrationMBeanExporter

Once JMX is enabled, all the available MBeans can be accessed over HTTP using Jolokia.

If you want to disable Jolokia endpoints but still want to use JMX, then you can set this property in config/servers.yml:

endpoints:
jolokia:
enabled: false

To enable boot provided management endpoints over HTTP

The Spring Boot management endpoints are exposed over HTTP.

When starting admin, container or singlenode server, the command-line option --mgmtPort can be specified to use an explicit port for management server. With the given valid management port,
the management endpoints can be accessed from that port. Please refer Spring Boot document here for more details on the endpoints.

For instance, once XD admin is started on localhost and the management port set to use the admin port (9393)

To enable the container shutdown operation in the UI

Add the following configuration to config/servers.yml. This configuration is available as a commented section in config/servers.yml.

---
spring:
profiles: container
management:
port: 0

To disable boot endpoints over HTTP

Set management.port=-1 for both default and container profiles in config/servers.yml

Management over JMX

All the boot endpoints are exposed over JMX with the domain name org.springframework.boot
The MBeans that are exposed within XD admin, container server level are available with the domain names xd.admin (for XD admin), xd.container (for XD container), xd.shared.server and xd.parent representing the application contexts common to both XD admin and container. Singlenode server will have all these domain names exposed.
When the stream/job gets deployed into the XD container, the stream/job MBeans are exposed with specific domain/object naming strategy.

Monitoring deployed modules in XD container

When a module is deployed (with JMX is enabled on the XD container), the IntegrationMBeanExporter is injected into module’s context via MBeanExportingPlugin and this exposes all the spring integration components inside the module. For the given module, the IntegrationMBeanExporter uses a specific object naming strategy that assigns domain name as xd.<stream/job name> and, object name as <module name>.<module index>.

Streams

For a stream name mystream with DSL http | log will have

MBeans with domain name xd.mystream with two objects http.0 and log.1

Source, processor, and sink modules will generally have the following attributes and operations

will list all the MessageChannel MBeans exposed in XD container.
Apart from this, other available domain and types can be accessed via Jolokia endpoints.

REST API

Introduction

The Spring XD Administrator process (Admin) provides a REST API to access various Spring XD resources such as streams, jobs, metrics, modules, Spring batch resources, and container runtime information. The REST API is used internally by the XD Shell and Admin UI and can support any custom client application that requires interaction with XD.

The HTTP port is configurable and may be set as a command line argument when starting the Admin server, or set in $XD_HOME/config/servers.yml. The default port is 9393: