DataStax Enterprise 5.0 Analytics includes integration with Apache Spark. Starting with this version Hadoop is deprecated for use with DataStax Enterprise. DSE Hadoop and BYOH (Bring Your Own Hadoop) are also
deprecated.

DataStax Enterprise 5.0 Analytics includes integration with Apache Spark. Starting with this version Hadoop is deprecated for use with DataStax Enterprise. DSE Hadoop and BYOH (Bring Your Own Hadoop) are also
deprecated.

Use DSE Analytics to analyze huge databases. DSE Analytics includes integration with Apache Spark. BYOH (bring
your own Hadoop) and DSE Hadoop are deprecated for use with DataStax Enterprise and will be removed in DataStax
Enterprise 5.1.

You can run analytics on Cassandra data using Hadoop that is integrated into DataStax Enterprise. The Hadoop component
in DataStax Enterprise enables analytics to be run across the DataStax Enterprise distributed, shared-nothing architecture.
Hadoop is deprecated for use with DataStax Enterprise. DSE Hadoop and BYOH (Bring Your Own Hadoop) are also
deprecated.

About Spark

Information about Spark architecture and capabilities.

Spark is the default mode when you start an analytics node in a packaged
installation. Spark runs locally on each node and executes in memory when possible.
Spark uses multiple threads instead of multiple processes to achieve parallelism on a single
node, avoiding the memory overhead of several JVMs.

Spark runs locally on each node and executes in memory when possible. Spark can cache
intermediate RDDs in RAM, disk, or both. Spark stores files for chained iteration in memory,
instead of using temporary storage in HDFS like Hadoop does. Spark uses multiple threads
instead of multiple processes to achieve parallelism on a single node, avoiding the memory
overhead of several JVMs.

Spark architecture

The software components for a single DataStax Enterprise analytics node
are:

Spark Worker, on all nodes

Cassandra File System (CFS)

DataStax Enterprise File System (DSEFS)

Cassandra

A Spark Master controls the workflow, and a Spark Worker launches executors that are
responsible for executing part of the job that is submitted to the Spark Master. Spark
architecture is described in the Apache documentation.

Spark supports multiple applications. A single application can spawn multiple jobs and the
jobs run in parallel. An application reserves some resources on every node and these
resources are not freed until the application finishes. For example, every session of Spark
shell is an application that reserves resources. By default, the scheduler tries allocate
the application to the highest number of different nodes. For example, if the application
declares that it needs four cores and there are ten servers, each offering two cores, the
application most likely gets four executors, each on a different node, each consuming a
single core. However, the application can get also two executors on two different nodes,
each consuming two cores. You can configure the application scheduler. Spark Workers and
Spark Master are spawned as separate processes and are very lightweight. Workers spawn other
memory-heavy processes that are dedicated to handling queries. Memory settings for those
additional processes are fully controlled by the administrator.

As you run Spark, you can access data in the Hadoop Distributed File System (HDFS) or the
Cassandra File System (CFS) by using the URL for one or the other.

Highly available Spark Master

The Spark Master High Availability mechanism uses a special table in the
dse_system keyspace to store information required to recover Spark
workers and the application. Unlike the high availability mechanism mentioned in Spark
documentation, DataStax Enterprise does not use ZooKeeper.

If the original Spark Master fails, the reserved one automatically takes over. To find the
current Spark Master, run:

dse client-tool spark master-address

If you enable password authentication in Cassandra, DataStax Enterprise creates special
users. The Spark Master process accesses Cassandra through the special users, one per
analytics node. The user names begin with the name of the node, followed by an encoded node
identifier. The password is randomized. Do not remove these users or change the passwords.