What's New

What's New in 3.1.1.0

The Max Files in Directory property has been renamed to Max
Files Soft Limit. As the name indicates, the property is now
a soft limit rather than a hard limit. As such, if the
directory contains more files than the configured Max Files
Soft Limit, the origin can temporarily exceed the soft limit
and the pipeline can continue running.

Previously, this
property was a hard limit. When the directory contained
more files, the pipeline failed.

The origin includes a new Spooling Period property that
determines the number of seconds to continue adding files to
the processing queue after the maximum files soft limit has
been exceeded.

Destinations

Einstein Analytics destination enhancement - The Append
Timestamp to Alias property is now disabled by default for new
pipelines. When disabled, the destination can append, delete,
overwrite, or upsert data to an existing dataset. When enabled, the
destination creates a new dataset for each upload of data.

The
property was added in version 3.1.0.0 and was enabled by
default. Pipelines upgraded from versions earlier than 3.1.0.0
have the property enabled by default.

What's New in 3.1.0.0

Data Collector
version 3.1.0.0 includes the following new features and enhancements:

Data Synchronization Solution for Postgres

This release includes a beta version of the Data Synchronization Solution for Postgres. The solution uses
the new Postgres Metadata processor to detect drift in incoming data and
automatically create or alter corresponding PostgreSQL tables as needed
before the data is written. The solution also leverages the JDBC Producer
destination to perform the writes.

As a beta feature, use the Data Synchronization Solution for Postgres for
development or testing only. Do not use the solution in production
environments.

Support for additional databases is planned for future releases. To state a
preference, leave a comment on this issue.

HTTP Client origin enhancement - You can now configure
the origin to use the Link in Response Field pagination type. After
processing the current page, this pagination type uses a field in
the response body to access the next page.

The
Field Replacer processor replaces the Value Replacer processor
which has been deprecated. The Field Replacer processor lets you
define more complex conditions to replace values. For example,
the Field Replacer can replace values which fall within a
specified range. The Value Replacer cannot replace values that
fall within a specified range.

You can now specify the root field for event records. You can use a
String or Map root field. Upgraded pipelines retain the
previous behavior, writing aggregation data to a String root
field.

JDBC Lookup processor enhancement - The processor includes the
following enhancements:

You can now configure a Missing Values Behavior property that defines
processor behavior when a lookup returns no value. Upgraded
pipelines continue to send records with no return value to
error.

You can now enable the Retry on Cache Miss property so that the
processor retries lookups for known missing values. By
default, the processor always returns the default value for
known missing values to avoid unnecessary lookups.

Kudu Lookup processor enhancement - The processor no
longer requires that you add a primary key column to the Key Columns
Mapping. However, adding only non-primary keys can slow the
performance of the lookup.

runtime:loadResource() - This function has been changed to
trim any leading or trailing whitespace characters from the
file before returning the value in the file. Previously, the
function did not trim white space characters - you had to
avoid including unnecessary characters in the file.

runtime:loadResourceRaw() - New function that returns the
value in the specified file, including any leading or
trailing whitespace characters in the file.

Data Collector classpath validation - Data Collector now
performs a classpath health check upon starting up. The results of
the health check are written to the Data Collector log. When
necessary, you can configure Data Collector to skip the health check
or to stop upon errors.

Support bundle Data Collector property - You can
configure a property in the Data Collector configuration file to
have Data Collector automatically upload support bundles when
problems occur. The property is disabled by default.

Scripting processors enhancement - The Groovy Evaluator, JavaScript Evaluator,
and Jython Evaluator processors can use a new boolean sdcFunction.isPreview()
method to determine if the pipeline is in preview mode.

What's New in 3.0.2.0

Data Collector
version 3.0.2.0 includes the following enhancement:

SFTP/FTP Client origin enhancement - The origin can now generate events when
starting and completing processing for a file and when all available files have
been processed.

Previously, StreamSets provided a single RPM package used to install Data Collector on any of these operating systems.

Edge Pipelines

You can now design and run edge
pipelines to read data from or send data to an edge device. Edge
pipelines are bidirectional. They can send edge data to other Data Collector pipelines for further processing. Or, they can receive data from other
pipelines and then act on that data to control the edge device.

Edge pipelines run in edge execution mode on StreamSetsData Collector Edge (SDC Edge). SDC Edge is a lightweight agent without a UI that runs pipelines on edge devices.
Install SDC Edge on each edge device where you want to run edge pipelines.

You design edge pipelines in Data Collector, export the edge pipelines, and then use commands to run the edge
pipelines on an SDC Edge installed on an edge device.

Origins

New
Amazon SQS Consumer origin - An origin that reads
messages from Amazon Simple Queue Service (SQS). Can create multiple
threads to enable parallel processing in a multithreaded
pipeline.

New UDP
Multithreaded Source origin - The origin listens for UDP
messages on one or more ports and queues incoming packets on an
intermediate queue for processing. It can create multiple threads to
enable parallel processing in a multithreaded pipeline.

The origin now uses local buffering instead of Oracle LogMiner
buffering by default. Upgraded pipelines require no changes.

The origin now supports reading the Timestamp with Timezone data type. When reading
Timestamp with Timezone data, the origin includes the offset
with the datetime data in the Data Collector Zoned Datetime data type. It does not include the time
zone ID.

SQL Server CDC Client origin enhancements - You can now perform the
following tasks with the SQL Server CDC Client origin:

In addition, a new Capture Instance Name
property replaces the Schema and Table Name Pattern
properties from earlier releases.

You can simply use the schema name and table
name pattern for the capture instance name. Or, you can
specify the schema name and a capture instance name
pattern, which allows you to specify specific CDC tables
to process when you have multiple CDC tables for a
single data table.

Upgraded pipelines require no changes.

UDP Source origin enhancement - The Enable Multithreading property
that enabled using multiple epoll receiver threads is now named Use
Native Transports (epoll).

Processors

New
Aggregator processor - A processor that aggregates data
within a window of time. Displays the results in Monitor mode and
can write the results to events.

New Delay
processor - A processor that can delay processing a batch
of records for a specified amount of time.

HTTP Client processor enhancement - The Rate Limit now
defines the minimum amount of time between requests in milliseconds.
Previously, it defined the time between requests in seconds.
Upgraded pipelines require no changes.

JDBC Lookup and JDBC Tee processor enhancements - You can now specify an
Init Query to be executed after establishing a connection to the
database, before performing other tasks. This can be used, for
example, to modify session attributes.

Kudu Lookup processor enhancement - The Cache Kudu Table
property is now named Enable Table Caching. The Maximum Entries to
Cache Table Objects property is now named Maximum Table Entries to
Cache.

Salesforce Lookup processor enhancement - You can use a new Retrieve lookup mode to look up data for a set of
records instead of record-by-record. The mode provided in previous
releases is now named SOQL Query. Upgraded pipelines require no
changes.

HTTP Client destination enhancement - You can now use the HTTP
Client destination to write Avro, Delimited, and Protobuf data in
addition to the previous data formats.

JDBC Producer destination enhancement - You can now
specify an Init Query to be executed after establishing a connection
to the database, before performing other tasks. This can be used,
for example, to modify session attributes.

Kudu destination enhancement - If the destination
receives a change data capture log from the following source
systems, you now must specify the source system in the Change Log
Format property so that the destination can determine the format of
the log: Microsoft SQL Server, Oracle CDC Client, MySQL Binary Log,
or MongoDB Oplog.

Salesforce destination enhancement - When using the
Salesforce Bulk API to update, insert, or upsert data, you can now
use a colon (:) or period (.) as a field separator when defining the
Salesforce field to map the Data Collector field to. For example,
Parent__r:External_Id__c or
Parent__r.External_Id__c are both valid
Salesforce fields.

Wave Analytics destination rename - With this release, the Wave
Analytics destination is now named the Einstein Analytics destination, following the recent
Salesforce rebranding. All of the properties and functionality of
the destination remain the same.

JDBC Query executor enhancement - You can now specify an
Init Query to be executed after establishing a connection to the
database, before performing other tasks. This can be used, for
example, to modify session attributes.

Cloudera Navigator

Cloudera Navigator integration is now released as part of the StreamSets
Commercial Subscription. The beta version included in earlier releases is no
longer available with Data Collector. For information about the StreamSets Commercial Subscription, contact us.

CyberArk - Data Collector now provides a credential
store implementation for CyberArk Application Identity Manager. You
can define the credentials required by external systems - user names
or passwords - in CyberArk. Then you use credential expression
language functions in JDBC stage properties to retrieve those
values, instead of directly entering credential values in stage
properties.

Supported stages - You can now use the credential
functions in all stages that require you to enter sensitive
information. Previously, you could only use the credential functions
in JDBC stages.

Data Collector Configuration

By default when Data Collector restarts, it automatically restarts all pipelines that were running
before Data Collector shut down. You can now disable the automatic restart
of pipelines by configuring the runner.boot.pipeline.restart
property in the $SDC_CONF/sdc.properties file.

Dataflow Performance Manager / StreamSets Control Hub

StreamSets Control Hub - With this release, we have created a new product called StreamSets Control HubTM (SCH) that includes a number of new cloud-based dataflow
design, deployment, and scale-up features. Since this release is now
our core service for controlling dataflows, we have renamed the
StreamSets cloud experience from "Dataflow Performance Manager
(DPM)" to "StreamSets Control Hub (SCH)".

DPM now refers to the performance management functions
that reside in the cloud such as live metrics and data SLAs.
Customers who have purchased the StreamSets Enterprise Edition
will gain access to all SCH functionality and continue to have
access to all DPM functionality as before.

Legacy stage libraries - Stage libraries that are more
than two years old are no longer included with Data Collector. Though not recommended, you can still download and install the
older stage libraries as custom stage libraries.

If you have
pipelines that use these legacy stage libraries, you will need
to update the pipelines to use a more current stage library or
install the legacy stage library manually, For more information
see Update Pipelines using Legacy Stage
Libraries.

New Data Collector metrics - JVM metrics have been renamed Data Collector Metrics and now include general Data Collector metrics in addition to JVM metrics. The JVM Metrics menu item has
also been renamed SDC Metrics.

Pipeline error records - You can now write error records
to Google Pub/Sub, Google Cloud Storage, or an MQTT broker.

Snapshot enhancements:

Standalone pipelines can now automatically take a failure snapshot when the pipeline fails due to
a data-related exception.

Time zone enhancement - Time zones have been organized and updated
to use JDK 8 names. This should make it easier to select time zones
in stage properties. In the rare case that your pipeline uses a
format not supported by JDK 8, edit the pipeline to select a
compatible time zone.

Oracle CDC Client origin enhancement - You can now
configure a JDBC Fetch Size property to determine the minimum number
of records that the origin waits for before passing data to the
pipeline. When writing to the destination is slow, use the default
of 1 record to improve performance. Previously, the origin used the
Oracle JDBC driver default of 10 records.

Executor

New MapR FS File Metadata executor - The new executor
can change file metadata, create an empty file, or remove a file or
directory in MapR each time it receives an event.

What's New in 2.7.1.0

Data Collector
version 2.7.1.0 includes the following new features and enhancements:

What's New in 2.7.0.0

Data Collector
version 2.7.0.0 includes the following new features and enhancements:

Credential Stores

Data Collector now has a credential
store API that integrates with the following credential store
systems:

Java keystore

Hashicorp Vault

You define the credentials required by external systems - user names,
passwords, or access keys - in a Java keystore file or in Vault. Then you
use credential expression language functions in JDBC stage
properties to retrieve those values, instead of directly entering credential
values in stage properties.

The following JDBC stages can use the new credential functions:

JDBC Multitable Consumer origin

JDBC Query Consumer origin

Oracle CDC Client origin

SQL Server CDC Client origin

SQL Server Change Tracking origin

JDBC Lookup processor

JDBC Tee processor

JDBC Producer destination

JDBC Query executor

Publish Pipeline Metadata to Cloudera Navigator (Beta)

Data Collector now provides beta support for publishing metadata about running pipelines
to Cloudera Navigator. You can then use Cloudera Navigator to explore the
pipeline metadata, including viewing lineage diagrams of the metadata.

Feel free to try out this feature in a development or test Data Collector, and send us your feedback. We are continuing to refine metadata publishing
as we gather input from the community and work with Cloudera.

New Hadoop user impersonation property - When you enable
Data Collector to impersonate the current Data Collector user when
writing to Hadoop, you can now also configure Data Collector to make the
username lowercase. This can be helpful with case-sensitive
implementations of LDAP.

New Java security properties - The Data Collector
configuration file now includes properties with a "java.security."
prefix, which you can use to configure Java security properties.

SDC_HEAPDUMP_PATH enhancement - The new default file name,
$SDC_LOG/sdc_heapdump_${timestamp}.hprof, includes
a timestamp so you can write multiple heap dump files to the specified
directory.

Dataflow Triggers

Pipeline events - The event framework now generates pipeline
lifecycle events when the pipeline stops and starts. You can pass each
pipeline event to an executor or to another pipeline for more complex
processing. Use pipeline events to trigger tasks before pipeline
processing begins or after it stops.

Directory origin event enhancements - The Directory origin
can now generate no-more-data events when it completes processing all
available files and the batch wait time has elapsed without the arrival
of new files. Also, the File Finished event now includes the number of
records and files processed.

Hadoop
FS origin enhancement - The Hadoop FS origin now allows you
to read data from other file systems using the Hadoop FileSystem
interface. Use the Hadoop FS origin in cluster batch pipelines.

HTTP
Client origin enhancement - The HTTP Client origin now allows
time functions and datetime variables in the request body. It also
allows you to specify the time zone to use when evaluating the request
body.

You can now use the origin to perform multithreaded processing
of partitions within a table. Use partition processing to handle
even larger volumes of data. This enhancement also includes new
JDBC header attributes.

By default, all new pipelines use
partition processing when possible. Upgraded pipelines use
multithreaded table processing to preserve previous
behavior.

You can now configure the behavior for the origin when it
encounters data of an unknown data type.

New last-modified time record header attribute - Directory, File Tail, and SFTP/FTP Client origins now include the last modified time
for the originating file for a record in an mtime record header
attribute.

Processors

New Data
Parser processor - Use the new processor to extract NetFlow
or syslog messages as well as other supported data formats that are
embedded in a field.

Solr
destination enhancement - You can now configure the
destination to skip connection validation when the Solr configuration
file, solrconfig.xml, does not define the default
search field (“df”) parameter.

Executors

New Amazon
S3 executor - Use the Amazon S3 executor to create new Amazon
S3 objects for the specified content or add tags to existing objects
each time it receives an event.

Writing
XML - You can now use the Google Pub/Sub Publisher, JMS
Producer, and Kafka Producer destinations to write XML documents to
destination systems. Note the record structure requirement before you
use this data format.

Avro:

Origins now write the Avro schema to an avroSchema record header
attribute.

Origins now include precision and scale field attributes for every Decimal field.

Data Collector now supports the time-based logical types added
to Avro in version 1.8.

Delimited - Data Collector can now continue processing records with
delimited data when a row has more fields than the header. Previously,
rows with more fields than the header were sent to error.

list:join() - Merges elements in a List field into a
String field, using the specified separator between elements.

list:joinSkipNulls() - Merges elements in a List field
into a String field, using the specified separator between elements
and skipping null values.

str:indexOf() - Returns the index within a string of the
first occurrence of the specified subset of characters.

Miscellaneous

Global bulk edit mode - In any property where you would
previously click an Add icon to add additional configurations, you can
now switch to bulk edit mode to enter a list of configurations in JSON
format.

What's New in 2.6.0.1

Data Collector
version 2.6.0.1 includes the following enhancement:

Kinesis
Consumer origin - You can now reset the origin for Kinesis Consumer
pipelines. Resetting the origin for Kinesis Consumer differs from other origins,
so please note the requirement and guidelines.

What's New in 2.6.0.0

Data Collector
version 2.6.0.0 includes the following new features and enhancements:

Installation

MapR prerequisites - You can now run the
setup-mapr command in interactive or
non-interactive mode. In interactive mode, the command prompts you
for the MapR version and home directory. In non-interactive mode,
you define the MapR version and home directory in environment
variables before running the command.

New buffer size configuration - You can now use a new
parser.limit configuration property to increase the Data Collector
parser buffer size. The parser buffer is used by the origin to
process many data formats, including Delimited, JSON, and XML. The
parser buffer size limits the size of the records that origins can
process. The Data Collector parser buffer size is 1048576 bytes by
default.

Hive Metadata processor data format property -
Use the new data format property to indicate the data format
to use.

Parquet support in the Hive Metastore destination - The destination can
now create and update Parquet tables in Hive. The
destination no longer includes a data format property since
that information is now configured in the Hive Metadata
processor.

When you run a job on multiple Data Collectors, a
remote pipeline instance runs on each of the Data Collectors. To
view aggregated statistics for the job within DPM, you must
configure the pipeline to write the statistics to a Kafka
cluster, Amazon Kinesis Streams, or SDC RPC.

Update
published pipelines - When you update a published
pipeline, Data Collector now displays a red asterisk next to the
pipeline name to indicate that the pipeline has been updated since
it was last published.

Origins

New CoAP
Server origin - An origin that listens on a CoAP endpoint
and processes the contents of all authorized CoAP requests. The
origin performs parallel processing and can generate multithreaded
pipelines.

New TCP
Server origin - An origin that listens at the specified
ports, establishes TCP sessions with clients that initiate TCP
connections, and then processes the incoming data. The origin can
process NetFlow, syslog, and most Data Collector data formats as
separated records. You can configure custom acknowledgement messages
and use a new batchSize variable, as well as other expressions, in
the messages.

Offset management - Both the REST API and command line interface can now retrieve the last-saved
offset for a pipeline and set the offset for a pipeline when it is
not running. Use these commands to implement pipeline failover using
an external storage system. Otherwise, pipeline offsets are managed
by Data Collector and there is no need to update the offsets.

Support bundles - You can now use Data Collector to
generate a support bundle. A support bundle is a ZIP file that
includes Data Collector logs, environment and configuration
information, pipeline JSON files, resource files, and pipeline
snapshots.

You upload the generated file to the StreamSets
support team so that we can use the information to troubleshoot
your support tickets.

TLS property enhancements - Stages that support
SSL/TLS now provide the following enhanced set of properties
that enable more specific configuration:

Keystore and truststore type - You can now choose
between Java Keystore (JKS) and PKCS-12 (p-12).
Previously, Data Collector only supported JKS.

Transport protocols - You can now specify the transport
protocols that you want to allow. By default, Data
Collector allows only TLSv1.2.

Cipher suites - You can now specify the cipher suites to
allow. Data Collector provides a modern set of default
cipher suites. Previously, Data Collector always allowed
the default cipher suites for the JRE.

To avoid upgrade impact, all SSL/TLS/HTTPS properties in
existing pipelines are preserved during upgrade.

Precondition enhancement - Stages with user-defined
preconditions now process all preconditions before passing a record
to error handling. This allows error records to include all
precondition failures in the error message.

Pipeline import/export enhancement - When you export multiple pipelines,
Data Collector now includes all pipelines in a single zip file. You
can also import multiple pipelines from a single zip file.

What's New in 2.5.1.0

Data Collector
version 2.5.1.0 includes the following enhancement:

New
stage library - Data Collector now supports the Cloudera CDH version 5.11 distribution of Hadoop and the
Cloudera version 5.11 distribution of Apache Kafka 2.1.

Maximum pipeline runners - You can now configure a
maximum number of pipeline runners to use in a pipeline. Previously,
Data Collector generated pipeline runners based on the number of threads created
by the origin. This allows you to tune performance and resource
usage. By default, Data Collector still generates runners based on the number of threads that the
origin uses.

Stop
pipeline execution - You can configure pipelines to transfer
data and automatically stop execution based on an event such as reaching
the end of a table. The JDBC and Salesforce origins can generate events
when they reach the end of available data that the Pipeline Finisher
uses to stop the pipeline. Click here for a case study.

Pipeline runtime parameters - You can now define runtime
parameters when you configure a pipeline, and then call the parameters
from within that pipeline. When you start the pipeline from the user
interface, the command line, or the REST API, you specify the values to
use for those parameters. Use pipeline parameters to represent any stage
or pipeline property with a value that must change for each pipeline run
- such as batch sizes and timeouts, directories, or URI.

In previous
versions, pipeline runtime parameters were named pipeline constants.
You defined the constant values in the pipeline, and could not pass
different values when you started the pipeline.

Pipeline ID enhancement - Data Collector now prefixes the pipeline ID with the alphanumeric characters entered
for the pipeline title. For example, if you enter “Oracle To HDFS” as
the pipeline title, then the pipeline ID has the following value:
OracleToHDFStad9f592-5f02-4695-bb10-127b2e41561c.

Webhooks for pipeline state changes and alerts - You can now configure
pipeline state changes and metric and data alerts to call webhooks in
addition to sending email. For example, you can configure an incoming
webhook in Slack so that an alert can be posted to a Slack channel. Or,
you can configure a webhook to start another pipeline when the pipeline
state is changed to Finished or Stopped.

You can now specify SELECT * FROM
<object> in a SOQL query. The origin or
processor expands * to all fields in the Salesforce object
that are accessible to the configured user.

The origin and processor generate Salesforce field
attributes that provide additional information about each
field, such as the data type of the Salesforce field.

The origin and processor can now additionally retrieve
deleted records from the Salesforce recycle bin.

The origin can now generate events when it completes
processing all available data.

Salesforce destination - The destination can now use a
CRUD operation record header attribute to indicate the operation to
perform for each record. You can also configure the destination to
use a proxy to connect to Salesforce.

Wave Analytics destination - You can now configure the
authentication endpoint and the API version that the destination
uses to connect to Salesforce Wave Analytics. You can also configure
the destination to use a proxy to connect to Salesforce.

Origins

New
Elasticsearch origin - An origin that reads data from an
Elasticsearch cluster. The origin uses the Elasticsearch scroll API to
read documents using a user-defined Elasticsearch query. The origin
performs parallel processing and can generate multithreaded pipelines.

New
WebSocket Server origin - An origin that listens on a
WebSocket endpoint and processes the contents of all authorized
WebSocket requests. The origin performs parallel processing and can
generate multithreaded pipelines.

HTTP Client
origin enhancements - When using pagination, the origin can
include all response fields in the resulting record in addition to the
fields in the specified result field path. The origin can now also
process the following new data formats: Binary, Delimited, Log, and SDC
Record.

HTTP Server origin enhancement - The origin requires that
HTTP clients include the application ID in all requests. You can now
configure HTTP clients to send data to a URL that includes the
application ID in a query parameter, rather than including the
application ID in request headers.

You can also configure the quote character to use
around table, schema, and column names in the query. And you can
configure the number of times a thread tries to read a batch of data
after receiving an SQL error.

MongoDB origin and MongoDB Oplog origin enhancements - The origins can now use
LDAP authentication in addition to username/password authentication to
connect to MongoDB. You can also now include credentials in the MongoDB
connection string.

Processors

New Field
Order processor - A processor that orders fields in a map or
list-map field and outputs the fields into a list-map or list root
field.

Field Flattener enhancement - You can now flatten a field in place to
raise it to the parent level.

Groovy,
JavaScript, and Jython
Evaluator processor enhancement - You can now develop an
initialization script that the processor runs once when the pipeline
starts. Use an initialization script to set up connections or resources
required by the processor.

You can also develop a destroy script that
the processor runs once when the pipeline stops. Use a destroy
script to close any connections or resources opened by the
processor.

JDBC Lookup enhancement - Default value date formats. When
the default value data type is Date, use the following format:
yyyy/MM/dd . When the default value data type is Datetime, use the
following format: yyyy/MM/dd HH:mm:ss.

The processor also now provides beta support
of cluster mode pipelines. In a development or test environment, you
can use the processor in pipelines that process data from a Kafka or
MapR cluster in cluster streaming mode. Do not use the Spark
Evaluator processor in cluster mode pipelines in a production
environment.

Cassandra destination enhancements - The destination now supports
the Cassandra uuid and timeuuid data types. And you can now specify
the Cassandra batch type to use: Logged or Unlogged. Previously, the
destination used the Logged batch type.

MongoDB destination enhancements - The destination can now
use LDAP authentication in addition to username/password authentication
to connect to MongoDB. You can also now include credentials in the
MongoDB connection string.

SDC RPC destination enhancements - The Back Off Period value
that you enter now increases exponentially after each retry, until it
reaches the maximum wait time of 5 minutes. Previously, there was no
limit to the maximum wait time. The maximum value for the Retries per
Batch property is now unlimited - previously it was 10 retries.

Solr
destination enhancement - You can now configure the action
that the destination takes when it encounters missing fields in the
record. The destination can discard the fields, send the record to
error, or stop the pipeline.

Executors

New Spark
executor - The executor starts a Spark application on a YARN
or Databricks cluster each time it receives an event.

New
Pipeline Finisher executor - The executor stops the pipeline
and transitions it to a Finished state when it receives an event. Can be
used with the JDBC Query Consumer, JDBC Multitable Consumer, and
Salesforce origins to perform batch processing of available data.

Data Collector Hadoop impersonation enhancement - You can
use the
stage.conf_hadoop.always.impersonate.current.userData Collector configuration property to ensure that Data Collector uses the current Data Collector user to read from or write to Hadoop systems.

When enabled, you
cannot configure alternate users in the following Hadoop-related
stages:

Hadoop FS origin and destination

MapR FS origin and destination

HBase lookup and destination

MapR DB destination

HDFS File Metadata executor

Map Reduce executor

Stage precondition property enhancement - Records that do not meet all
preconditions for a stage are now processed based on error handling
configured in the stage. Previously, they were processed based on error
handling configured for the pipeline. See Evaluate Precondition Error Handling for information about
upgrading.

XML parsing enhancements - You can include field XPath expressions and
namespaces in the record with the Include Field XPaths property. And use the new Output Field Attributes property to write XML attributes and
namespace declarations to field attributes rather than including them in
the record as fields.

Wrap long lines in properties - You can now configure Data Collector to wrap long lines of text that you enter in properties, instead of
displaying the text with a scroll bar.

What's New in 2.4.1.0

Data Collector
version 2.4.1.0 includes the following new features and enhancements:

Salesforce origin enhancement - When the origin processes existing
data and is not subscribed to notifications, it can now repeat the specified
query at regular intervals. The origin can repeat a full or incremental
query.

Log data
display - You can stop and restart the automatic display of the most
recent log data on the Data Collector Logs page.

New
time function - The time:createDateFromStringTZ
function enables creating Date objects adjusted for time zones from string
datetime values.

New stage library stage-type icons - The stage library now displays icons to
differentiate between different stage types.

Note: The Hive Drift Solution is now known as the "Drift Synchronization Solution
for Hive" in the documentation.

What's New in 2.4.0.0

Data Collector
version 2.4.0.0 includes the following new features and enhancements:

Pipeline Sharing and Permissions

Data Collector now provides pipeline-level permissions. Permissions
determine the access level that users and groups have on pipelines. To
create a multitenant environment, create groups of users and then share
pipelines with the groups to grant different levels of access.

With this change, only the pipeline owner and users with the Admin role can
view a pipeline by default. If upgrading from a previous version of Data
Collector, see the following post-upgrade task, Configure Pipeline Permissions.

Permissions transfer - You can transfer all pipeline
permissions associated with a user or group to a different user or
group. Use pipeline transfer to easily migrate permissions after
registering with DPM or after a user or group becomes obsolete.

Dataflow Performance Manager (DPM)

Register Data Collectors with DPM - If Data Collector
uses file-based authentication and if you register the Data
Collector from the Data Collector UI, you can now create DPM user
accounts and groups during the registration process.

Aggregated statistics for DPM - When working with DPM,
you can now configure a pipeline to write aggregated statistics to
SDC RPC. Write statistics to SDC RPC for development purposes only.
For a production environment, use a Kafka cluster or Amazon Kinesis
Streams to aggregate statistics.

Origins

Dev SDC RPC with Buffering origin - A new development
stage that receives records from an SDC RPC destination, temporarily
buffering the records to disk before passing the records to the next
stage in the pipeline. Use as the origin in an SDC RPC destination
pipeline.

Amazon S3 origin enhancement - You can configure a new
File Pool Size property to determine the maximum number of files
that the origin stores in memory for processing after loading and
sorting all files present on S3.

Install external libraries using the Data Collector user
interface - You can now use the Data Collector user
interface to install external libraries to make them available to
stages. For example, you can install JDBC drivers for stages that
use JDBC connections. Or, you can install external libraries to call
external Java code from the Groovy, Java, and Jython Evaluator
processors.

Custom header enhancement - You can now use HTML in the
ui.header.title configuration property to configure a custom header
for the Data Collector UI. This allows you to specify the look and feel for any text
that you use, and to include small images in the header.

Groovy enhancement - You can configure the processor to
use the invokedynamic bytecode instruction.

Pipeline renaming - You can now rename a pipeline by clicking
directly on the pipeline name when editing the pipeline, in addition
to editing the Title general pipeline property.

What's New in 2.3.0.1

Data Collector
version 2.3.0.1 includes the following new features and enhancements:

Oracle CDC Client
origin enhancement - The origin can now track and adapt to schema
changes when reading the dictionary from redo logs. When using the dictionary in
redo logs, the origin can also generate events for each DDL that it reads.

HTTP
Server origin - Listens on an HTTP endpoint and
processes the contents of all authorized HTTP POST requests. Use
the HTTP Server origin to receive high volumes of HTTP POST
requests using multiple threads.

Enhanced runtime statistics - Monitoring a pipeline
displays aggregated runtime statistics for all threads in the
pipeline. You can also view the number of runners, i.e. threads
and pipeline instances, being used.

CDC/CRUD Enhancements

With this release, certain Data Collector stages enable you to easily
process change data capture (CDC) or transactional data in a pipeline. The
sdc.operation.type record header attribute is now used by all CDC-enabled
origins and CRUD-enabled stages:

The JDBC Tee processor and JDBC Producer can now
process changed data based on CRUD operations in record headers.
The stages also include a default operation and unsupported
operation handling.

The MongoDB and Elasticsearch destinations now look for
the CRUD operation in the sdc.operation.type record header
attribute. The Elasticsearch destination includes a default
operation and unsupported operation handling.

Multitable Copy

You can use the new JDBC
Multitable Consumer origin when you need to copy multiple tables
to a destination system or for database replication. The JDBC Multitable
Consumer origin reads database data from multiple tables through a JDBC
connection. The origin generates SQL queries based on the table
configurations that you define.

Configuration

Groups for file-based authentication - If you use
file-based authentication, you can now create groups of users when
multiple users use Data Collector. You configure groups in the
associated realm.properties file located in the Data Collector
configuration directory, $SDC_CONF.

If you use file-based
authentication, you can also now view all user accounts granted
access to the Data Collector, including the roles and groups
assigned to each user.

LDAP authentication enhancements - You can now
configure Data Collector to use StartTLS to make secure
connections to an LDAP server. You can also configure the
userFilter property to define the LDAP user attribute used to
log in to Data Collector. For example, a username, uid, or email
address.

Java garbage collector logging - Data Collector now
enables logging for the Java garbage collector by default. Logs
are written to $SDC_LOG/gc.log. You can disable the logging if
needed.

Heap dump for out of memory errors - Data Collector now
produces a heap dump file by default if it encounters an out of
memory error. You can configure the location of the heap dump file
or you can disable this default behavior.

Modifying the log level - You can now use the Data
Collector UI to modify the log level to display messages at another
severity level.

Pipelines

Pipeline renaming - You can now rename pipelines by
editing the Title general pipeline property.

Field attributes - Data Collector now supports
field-level attributes. Use the Expression Evaluator to add
field attributes.

Origins

New HTTP Server origin - A multithreaded origin that
listens on an HTTP endpoint and processes the contents of all
authorized HTTP POST requests. Use the HTTP Server origin to read
high volumes of HTTP POST requests using multiple threads.

New
HTTP to Kafka origin - Listens on a HTTP endpoint and
writes the contents of all authorized HTTP POST requests
directly to Kafka. Use to read high volumes of HTTP POST
requests and write them to Kafka.

JDBC
Query Consumer origin enhancements - The JDBC Consumer
origin has been renamed to the JDBC Query Consumer origin. The
origin functions the same as in previous releases. It reads database
data using a user-defined SQL query through a JDBC connection. You
can also now configure the origin to enable auto-commit mode for the
JDBC connection and to disable validation of the SQL query.

MongoDB
origin enhancements - You can now use a nested field as
the offset field. The origin supports reading the MongoDB BSON
timestamp for MongoDB versions 2.6 and later. And you can configure
the origin to connect to a single MongoDB server or node.

Processors

Field Type Converter processor enhancement - You can now
configure the processor to convert timestamp data in a long field to
a String data type. Previously, you had to use one Field Type
Converter processor to convert the long field to a datetime, and
then use another processor to convert the datetime field to a
string.

HTTP Client processor enhancements - You can now
configure the processor to use the OAuth 2 protocol to connect
to an HTTP service. You can also configure a rate limit for the
processor, which defines the maximum number of requests to make
per second.

JDBC Lookup processor enhancements - You can now
configure the processor to enable auto-commit mode for the JDBC
connection. You can also configure the processor to use a
default value if the database does not return a lookup value for
a column.

XML
Parser enhancement - A new Multiple Values Behavior
property allows you to specify the behavior when you define a
delimiter element and the document includes more than one value:
Return the first value as a record, return one record with a
list field for each value, or return all values as records.

Azure Data Lake Store destination enhancement - You
can now use the destination in cluster batch pipelines. You can
also process binary and protobuf data, use record header
attributes to write records to files and roll files, and
configure a file suffix and the maximum number of records that
can be written to a file.

Elasticsearch destination enhancement - The
destination now uses the Elasticsearch HTTP API. With this API,
the Elasticsearch version 5 stage library is compatible with all
versions of Elasticsearch. Earlier stage library versions have
been removed. Elasticsearch is no longer supported on Java 7.
You’ll need to verify that Java 8 is installed on the Data
Collector machine and remove this stage from the blacklist
property in $SDC_CONF/sdc.properties before you can use it.

You can also now configure the destination to perform any of the
following CRUD operations: create, update, delete, or index.

record:fieldAttributeOrDefault (<field path>,
<attribute name>, <default value>) - Returns the
value of the specified field attribute. Returns the
default value if the attribute does not exist or
contains no value.

What's New in 2.2.0.0

Data Collector version
2.2.0.0 includes the following new features and enhancements:

Event Framework

The Data Collector event framework enables the pipeline to trigger tasks in
external systems based on actions that occur in the pipeline, such as
running a MapReduce job after the pipeline writes a file to HDFS. You can
also use the event framework to store event information, such as when an
origin starts or completes reading a file.

Java requirement - Oracle Java 7 is supported but now
deprecated. Oracle announced the end of public updates for
Java 7 in April 2015. StreamSets recommends migrating to
Java 8, as Java 7 support will be removed in a future Data
Collector release.

Core installation includes the basic stage library only
- The core RPM and tarball installations now include the basic stage
library only, to allow Data Collector to use less disk space.
Install additional stage libraries using the Package Manager for
tarball installations or the command line for RPM and tarball
installations.

Previously, the core installation also included
the Groovy, Jython, and statistics stage libraries.

LDAP authentication - If you use LDAP authentication,
you can now configure Data Collector to connect to multiple LDAP
servers. You can also configure Data Collector to support an LDAP
deployment where members are defined by uid or by full DN.

Java garbage collector - Data Collector now uses the
Concurrent Mark Sweep (CMS) garbage collector by default. You can
configure Data Collector to use a different garbage collector by
modifying Java configuration options in the Data Collector
environment configuration file.

SDC_JAVA_OPTS - Includes configuration options for all Java
versions. SDC_JAVA7_OPTS - Includes configuration options
used only when Data Collector is running Java 7.

SDC_JAVA8_OPTS - Includes configuration options used only
when Data Collector is running Java 8.

New time zone property - You can configure the Data
Collector UI to use UTC, the browser time zone, or the Data
Collector time zone. The time zone property affects how dates and
times display in the UI. The default is the browser time zone.

New Salesforce origin - Reads data from Salesforce. The
origin can execute a SOQL query to read existing data from
Salesforce. The origin can also subscribe to the Force.com Streaming
API to receive notifications for changes to Salesforce data.

Directory origin enhancement - You can configure the
Directory origin to read files from all subdirectories when using
the last-modified timestamp for the read order.

JDBC Query Consumer and Oracle CDC Client origin enhancement - You can now
configure the transaction isolation level that the JDBC Query
Consumer and Oracle CDC Client origins use to connect to the
database. Previously, the origins used the default transaction
isolation level configured for the database.

Processors

New Spark
Evaluator processor - Processes data based on a Spark
application that you develop. Use the Spark Evaluator processor to
develop a Spark application that performs custom processing within a
pipeline.

Field Flattener processor enhancements - In addition to
flattening the entire record, you can also now use the Field
Flattener processor to flatten specific list or map fields in the
record.

Field Type Converter processor enhancements - You can
now use the Field Type Converter processor to change the scale of a
decimal field. Or, if you convert a field with another data type to
the Decimal data type, you can configure the scale to use in the
conversion.

Field
Pivoter processor enhancements - The List Pivoter
processor has been renamed to the Field Pivoter processor. You can
now use the processor to pivot data in a list, map, or list-map
field. You can also use the processor to save the field name of the
first-level item in the pivoted field.

JDBC Lookup and JDBC Tee processor enhancement - You can now configure
the transaction isolation level that the JDBC Lookup and JDBC Tee
processors use to connect to the database. Previously, the origins
used the default transaction isolation level configured for the
database.

Scripting processor enhancements - The Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator processors can generate event records
and work with record header attributes. The sample scripts now
include examples of both and a new tip for generating unique record
IDs.

XML Flattener processor enhancement - You can now
configure the XML Flattener processor to write the flattened data to
a new output field. Previously, the processor wrote the flattened
data to the same field.

XML
Parser processor enhancement. You can now generate
records from XML documents using simplified XPath expressions. This
enables reading records from deeper within XML documents.

File suffix enhancement. You can now configure a file suffix, such
as txt or json, for output files generated by Hadoop FS, Local
FS, MapR
FS, and the Amazon
S3 destinations.

JDBC Producer destination enhancement - You can now
configure the transaction isolation level that the JDBC Producer
destination uses to connect to the database. Previously, the
destination used the default transaction isolation level configured
for the database.

Kudu destination enhancement - You can now configure the
destination to perform one of the following write operations:
insert, update, delete, or upsert.

Data Formats

XML processing enhancement - You can now generate
records from XML documents using simplified XPath expressions with origins that process
XML data and the XML Parser processor. This enables reading records
from deeper within XML documents.

Add labels to pipelines from the Home page - You can now add labels
to multiple pipelines from the Data Collector Home page. Use labels
to group similar pipelines. For example, you might want to group
pipelines by database schema or by the test or production
environment.

Reset the origin for multiple pipelines from the Home page - You can
now reset the origin for multiple pipelines at the same time from
the Data Collector Home page.

Rules and Alerts

Metric
rules and alerts enhancements - The gauge metric type can
now provide alerts based on the number of input, output, or error
records for the last processed batch.

Expression Language Functions

New file functions - You can use the following new file
functions to work with file paths:

file:fileExtension(<filepath>) - Returns the file
extension from a path.

file:fileName(<filepath>) - Returns a file name from a
path.

file:parentPath(<filepath>) - Returns the parent path of
the specified file or directory.

file:pathElement(<filepath>, <integer>) - Returns the
portion of the file path specified by a positive or negative
integer.

file:removeExtension(<filepath>) - Removes the file
extension from a path.

New pipeline functions - You can use the following new
pipeline functions to determine information about a pipeline:

pipeline:name() - Returns the pipeline name.

pipeline:version() - Returns the pipeline version when the
pipeline has been published to Dataflow Performance Manager
(DPM).

New time functions - You can use the following new time
functions to transform datetime data:

time:extractLongFromDate(<Date object>, <string>) -
Extracts a long value from a Date object, based on the
specified date format.

time:extractStringFromDate(<Date object>, <string>) -
Extracts a string value from a Date object, based on the
specified date format.

time:millisecondsToDateTime(<long>) - Converts an epoch
or UNIX time in milliseconds to a Date object.