Docker IV: Spark for Cassandra Data Analysis

Spark in a general cluster-computing framework, and in our case we will use it to process data from the Cassandra cluster. As we saw in Part I, we cannot run any type of query on a Cassandra table. But by running a Spark worker on each host running a Cassandra node, we can efficiently read/analyse all of its data in a distributed way. Each Spark worker (slave) will read the data from its local Cassandra node and send the result back to the Spark driver (master).

The Docker Image

No official image exists for Spark so we need to create our own image. The image I created is very basic, it is simply a Debian with Python base image on top of which I’ve installed a Java 8 JDK. And the Spark 2.0.1 binaries are simply extracted from the original release tarball to the
/app/ folder.

There is just a small subtlety for running the container. The Spark startup scripts run in background mode, so they cannot be used as the main container process, otherwise the container would die immediately. The main process of the container is the
entrypoint.sh script, and to keep it alive forever I simply run
tail -f the log folder. This way the logs are also written to the standard output and can be seen when running
docker logs <container>.

Another problem is port mapping. Spark opens a lot of random ports to communicate between master and slaves. This is difficult to manage with Docker, so will simply use the
--net=host network mode to let the containers open any port they want on the host.

Running the Spark Containers

We will install the Spark containers on top of our existing Cassandra cluster from Part I, as illustrated in the introduction above.

To start a cluster with the master (driver) on ubuntu0 and slaves (workers) on ubuntu[1-3]:

The Cassandra Data

In part I we had 3 Cassandra nodes on which we created a table called posts. We saw that rows of this table are partitioned (or sharded) across the 3 nodes based on the primary key, which is the username. If the replication factor is 2, then each partition will have 2 copies (on separate nodes).

For this example I reused the same posts table but set a replication factor of 1. This can be updated by altering the keyspace:

nicolas | 399dfa90-a5bf-11e6-a21b-6d2c86545d91 | Python has the best syntax ! It looks clean and simple. And it is easy to understand for non-programmers.

nicolas | 39aa56a0-a5bf-11e6-a21b-6d2c86545d91 | Java, Scala and Groovy are compiled to Java-Bytecode, which is the instruction set of the Java Virtual Machine. This bytecode and can be run on any kind of platform which has JRE installed !

nicolas | 61f60af0-a5bf-11e6-a21b-6d2c86545d91 | PHP is lame ! I don't want to talk about it... But my blog is run by PHP :(

nicolas | 681ee2d0-a5bf-11e6-a21b-6d2c86545d91 | HTML and CSS are declarative languages. They describe what the end result must be, but not how to achieve that goal.

lepellen | 30621650-a5bf-11e6-a21b-6d2c86545d91 | Paris is a nice city with lots of cultural stuff to do and lots of wine to drink

lepellen | 306fd1f0-a5bf-11e6-a21b-6d2c86545d91 | Singapore is really clean, but it is too small.

lepellen | 313bd480-a5bf-11e6-a21b-6d2c86545d91 | Tokyo is great !

(13 rows)

So the data is composed of posts:

By lepellen, about travelling, stored in the cass1container on ubuntu1machine.

By arnaud, about jazz music, stored in the cass2 container on ubuntu2 machine.

By nicolas, about programming, stored in the cass3 container on ubuntu3machine.

In the next section we will use this data to see if the deployed workers take advantage of data locality.

Using the Cassandra Connector

First let’s run a spark shell (Scala) in the master container:

1

2

3

4

5

6

dock@ubuntu0:~$ docker exec -ti sparkmaster bash

root@ubuntu0:/app# ./spark/bin/spark-shell \

--master spark://ubuntu0:7077 \

--packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11 \

--conf spark.cassandra.connection.host=ubuntu1

There are 3 options:

–master : This time we specify the master url to use it in standalone mode with all its workers.

–packages: This downloads a given library (in this case the Datastax Cassandra connector) from https://spark-packages.org/ as well as its dependencies.

–conf : We specify a Cassandra node in a property which will be used by the connector to auto-discover all other nodes.

The startup output show that the connector library is downloaded from spark-packages, and then Apache Ivy takes care of downloading all its dependencies from the Maven central repository:

We can then use the input format provided by the connector to load the table into a distributed DataFrame:

Scala

1

2

3

4

scala>:paste

valposts=spark.read.format("org.apache.spark.sql.cassandra")

.options(Map("table"->"posts","keyspace"->"posts_db"))

.load()

Now we can perform distributed computing tasks such as a Word Count. But let’s do something more interesting and create an inverted index which maps each word in all posts to the IP-address of the worker which processed it:

Below are the results of this job. If you compare the mapping with the data described in the previous section, you can see that each worker processed the data from the Cassandra node of the same host. The connector did its job well !

Conclusion

Using Spark we can query and process Cassandra data any way we want. Now that we have all the containers we need, let’s move on to the orchestration part and see how we can organize their deployment in a cluster.