The easiest way to try out Apache Spark from Python on SherlockML is
in local mode. The entire processing is done on a single server. You
thus still benefit from parallelisation across all the cores in your
server, but not across several servers.

Spark runs on the Java virtual machine. It exposes a Python, R and
Scala interface. You can interact with all these interfaces on
SherlockML, but the installation procedure differs slightly.

To use PySpark on SherlockML, create a custom environment to install
PySpark. Your custom environment should include:

openjdk-8-jdk in the system section;

pyspark in the Python section, under pip.

Start a new Jupyter server with this environment. Unfortunately,
PySpark does not play well with Anaconda environments. You therefore
need to set environment variables telling Spark which Python
executable to use. Add these lines to the top of your notebook:

This example hard-codes the number of threads and the memory. You may
want to set these dynamically based on the size of the server. You can
use the NUM_CPUS and AVAILABLE_MEMORY_MB environment variables
to determine the size of the server the notebook is currently running
on:

Apply this environment to a Jupyter or to an RStudio server. If you
now open a new terminal, you can run spark-shell to open a Spark
shell.

While the Spark shell allows for rapid prototyping and iteration, it
is not suitable for more significant Scala programs. The normal route
for developing these is to create a Scala application, package it as a
jar
and run it with spark-submit. To write a Scala application, you
will need to install sbt. You can
install sbt reproducibly by creating an environment with the following
commands in the scripts section:

To use SparkR
from an RStudio server in SherlockML, create the environment that
installs Spark outlined in the previous section. After you have
applied that environment to an RStudio server, you should be able to
access Spark by executing the following lines in your R session:

Spark runs a dashboard that gives information about jobs which are currently
executing. To access this dashboard, you can use the command line client sml from your local computer to open a
tunnel to the server: