Data Collector version 3.3.0 introduces cluster streaming mode with support for Kafka security features such as SSL/TLS
and Kerberos authentication using Spark 2.1 or later and Kafka 0.10.0.0 or later.

If you use cluster pipelines that run in cluster streaming mode and you are upgrading from a version earlier than
2.3.0.0, you must upgrade to Data Collector version 2.3.0.0 before upgrading to the latest version.

When you upgrade an installation from the RPM package, the new version uses the default configuration, data, log,
and resource directories. If the previous version used the default directories, the new version has access to the
files created in the previous version.

When an external system is upgraded to a new version, you can continue to use existing Data Collector pipelines that connected to the previous version of the external system. You simply configure the pipelines to work
with the upgraded system.

Pre Upgrade Tasks

In some situations, you must complete tasks before you upgrade.

Upgrade to Spark 2.1 or Later

Data Collector
version 3.3.0 introduces cluster streaming mode with support for Kafka security features
such as SSL/TLS and Kerberos authentication using Spark 2.1 or later and Kafka 0.10.0.0 or
later.

However, this means that using Spark 1.x for cluster streaming mode, the Spark Evaluator
processor, and the Spark executor was deprecated as of version 3.2.0.0. Support for
Spark 1.x is removed in version 3.3.0. If you are using cluster streaming mode, the
Spark Evaluator processor, or the Spark executor, you must upgrade to Spark 2.1 or
later. In addition, if you are using cluster streaming mode for Kafka, you must also
upgrade to Kafka 0.10.0.0 or later.

Note: You can continue to use Kafka 0.9.0.0 in
standalone pipelines. Or you can continue to use an earlier version of Data Collector
to use Kafka 0.9.0.0 in cluster pipelines until you can upgrade Kafka.

Since Spark 1.x is no longer supported and since Kafka 0.9.0.0 is no longer supported in
cluster pipelines, the following stage libraries have changed:

Category

Stage Libraries

New stage libraries

The following new stage libraries include the Kafka Consumer
origin for cluster mode pipelines:

streamsets-datacollector-cdh-spark_2_1-lib

streamsets-datacollector-cdh-spark_2_2-lib

streamsets-datacollector-cdh-spark_2_3-lib

Changed stage libraries

The following stage library no longer includes the Kafka Consumer
origin for cluster mode pipelines:

streamsets-datacollector-hdp_2_4-lib

The following stage libraries were upgraded to use Spark
2.1:

streamsets-datacollector-hdp_2_6-lib

streamsets-datacollector-mapr_5_2-lib

streamsets-datacollector-mapr_6_0-mep4-lib

Removed stage libraries

The following stage libraries are removed:

streamsets-datacollector-cdh_5_8-cluster-cdh_kafka_2_0-lib

streamsets-datacollector-cdh_5_9-cluster-cdh_kafka_2_0-lib

streamsets-datacollector-cdh_5_10-cluster-cdh_kafka_2_1-lib

streamsets-datacollector-cdh_5_11-cluster-cdh_kafka_2_1-lib

streamsets-datacollector-cdh_5_12-cluster-cdh_kafka_2_1-lib

streamsets-datacollector-cdh_5_13-cluster-cdh_kafka_2_1-lib

streamsets-datacollector-cdh_5_14-cluster-cdh_kafka_2_1-lib

During the upgrade process, these removed stage libraries
are replaced with the new
streamsets-datacollector-cdh-spark_2_1-lib stage
library.

Removed legacy stage libraries

The following legacy stage libraries are removed:

streamsets-datacollector-cdh_5_4-cluster-cdh_kafka_1_2-lib

streamsets-datacollector-cdh_5_4-cluster-cdh_kafka_1_3-lib

streamsets-datacollector-cdh_5_5-cluster-cdh_kafka_1_3-lib

streamsets-datacollector-cdh_5_7-cluster-cdh_kafka_2_0-lib

Changed legacy stage libraries

The following legacy stage libraries no longer include the Spark
Evaluator processor:

streamsets-datacollector-cdh_5_4-lib

streamsets-datacollector-cdh_5_5-lib

To continue to use cluster streaming mode, you must upgrade to a newer Cloudera CDH or
Hortonworks Hadoop distribution and to Kafka 0.10.0.0 or later. The major Hadoop
distribution vendors provide a means for Spark 1.x and Spark 2.x to coexist on the same
cluster, so you can use both versions in your clusters. Data Collector
supports the following Spark 2.x versions for the Hadoop distribution vendors:

Cloudera - Cloudera Distribution of Spark 2.1 release 1
or later is supported. For more information, see Spark 2 Requirements.

Hortonworks - Hortonworks Data Platform (HDP) 2.6 or
later includes Spark 2.2.0. For more information, see the HDP 2.6 Release Notes.

In addition to selecting the upgraded stage library version for each stage that connects
to the upgraded CDH, HDP, or Kafka system, you might need to perform additional tasks
for the following stages:

Spark Evaluator processor - If the Spark application was
previously built with Spark 2.0 or earlier, you must rebuild it with Spark 2.1.
Or if you used Scala to write the custom Spark class, and the application was
compiled with Scala 2.10, you must recompile it with Scala 2.11.

Spark executor - If the Spark application was previously
built with Spark 2.0 or earlier, you must rebuild it with Spark 2.1 and Scala
2.11.

Verify Installation Requirements

The minimum requirements for Data Collector
can change with each version. Before you upgrade to a new Data Collector
version, verify that the machine meets the latest minimum requirements as described in
Installation Requirements.

Migrate to Java 8

Data Collector
version 2.5.0.0 requires Java 8. If your previous Data Collector
version ran on Java 7, you must migrate to Java 8 before upgrading to the latest Data Collector
version.

All services that use Data Collector JAR files also must run on Java 8. This means that
your Hadoop cluster must run on Java 8 if you are using cluster pipelines, the Spark
Executor, or the MapReduce Executor.

To migrate to Java 8, complete the following steps before upgrading to the latest Data Collector
version:

Shut down Data Collector.

Install Java 8 on the Data Collector machine.

If you customized Java configuration options in the SDC_JAVA7_OPTS environment
variable and if those options are valid in Java 8, migrate those customizations to
the SDC_JAVA8_OPTS environment variable.

Restart Data Collector and verify that it works as expected.

If any pipelines include the JavaScript Evaluator processor, open the pipelines and
validate the scripts on Java 8.

Upgrade Cluster Streaming Pipelines

If you use cluster pipelines that run in cluster streaming mode and you are upgrading
from a version earlier than 2.3.0.0, you must upgrade to Data Collector version 2.3.0.0
before upgrading to the latest version.

Prior to 2.3.0.0, Data Collector
used the Spark checkpoint mechanism to recover cluster pipelines after a failure.
Starting in version 2.3.0.0, Data Collector maintains the state of cluster pipelines
without relying on Spark checkpoints.

Warning: If you upgrade from a version
earlier than 2.3.0.0 directly to the latest version - without first upgrading to
version 2.3.0.0 - cluster pipelines fail when starting.

Before you upgrade to the latest version, complete the following general tasks:

Upgrade to Data Collector version 2.3.0.0.

Start the upgraded Data Collector version 2.3.0.0 and run the cluster pipelines
so that they process some data.

After verifying that the upgrade to Data Collector version 2.3.0.0 was successful,
upgrade to the latest version.