SDC RPC Pipelines

SDC RPC Pipeline Overview

Data Collector
Remote Protocol Call pipelines, a.k.a. SDC RPC pipelines, are a set of StreamSets
pipelines that pass data from one pipeline to another without writing to an intermediary system.

SDC RPC pipelines can write to each other on the same
machine or over a local network or the public internet. You might use SDC RPC pipelines to
send data securely between two data centers.

Typically, a pipeline uses a standard origin such as Directory and writes to a standard
destination such as HBase. An SDC RPC pipeline includes an SDC RPC destination or an SDC RPC
origin to communicate with another SDC RPC pipeline.

To use SDC RPC pipelines, you create an origin pipeline and a destination pipeline. The
origin pipeline uses an SDC RPC destination to write directly to an SDC RPC origin in the
destination pipeline, as follows:

The SDC RPC destination and SDC RPC origin enable you to pass data securely from one pipeline
to another, effectively like creating a single pipeline that spans a network.

Pipeline Types

You can create two kinds of
SDC RPC pipelines:

origin pipeline

Processes data from the origin system and passes it to the destination pipeline.

Uses an SDC RPC destination to pass data to a destination pipeline. To provide
redundancy and load-balancing, you can define connections to multiple destination
pipelines.

destination pipeline

Processes data from the origin pipeline and passes it to the destination system.

Uses an SDC RPC origin to process data from the origin pipeline.

Deployment Architecture

When using SDC RPC pipelines, consider your needs and environment carefully as you
design the deployment architecture.

Note that the origin pipeline writes data to a single destination pipeline, but can
round-robin through multiple pipelines. By using multiple destination pipelines, you can
provide redundancy and avoid bottlenecks with a high-volume origin pipeline:

If you have multiple pipelines with similar data, you might deploy several sets of
this model to provide load-balancing and redundancy:

Configuring the Delivery Guarantee

The delivery guarantee determines when a pipeline commits the offset. When configuring
the delivery guarantee for SDC RPC pipelines, use the same option in origin and destination
pipelines.

A set of SDC RPC pipelines process data like a single
pipeline: The origin pipeline creates a batch, passes it through the pipeline, then passes it to
the destination pipeline. Only when the destination pipeline writes the batch to its destination
systems does the Data Collector commit the
offset. As with standard pipelines, you can use the delivery guarantee property to define how the
Data Collector commits
the offset:

Use At Least Once in both pipelines to ensure the pipelines process
all data.

Use At Most Once in both pipelines to avoid the possible duplication
of data.

Note: If the SDC RPC pipelines are configured to use different delivery guarantees, the resulting
behavior is At Most Once.