To enable interaction with the Spark cluster, Apache Livy must be
installed. Contact your cluster administrator to arrange installation -
documentation on installation is available here.

We also strongly recommend to use Spark 2, which provides a much easier to
use interface for data science than Spark 1. Indeed, MLlib, the Spark
machine learning library has already deprecated their RDD (Spark 1)
interface. You can check your Spark version by running
print(sc.version) inside a Spark session as described below. Contact
your cluster administrator to install Spark 2 and configure Apache Livy to
use it.

Any libraries or other dependencies needed by your code must be installed
on the Spark cluster, not on your SherlockML server. Using
sparkmagic/pylivy and Apache Livy, the code you run inside a %spark
cell is run inside the external cluster, not in your notebook.

The -u/--url option sets the URL where Livy is running - you can get
this from your system administrator. You can create R and Scala Spark sessions
by setting the -l/--language option to r or scala respectively.
For documentation of all options run %spark?.

You can also list running sessions with:

%sparklist

and delete sessions with:

%sparkdeletesessionname

sparkmagic also provides the %manage_spark command, which returns a widget
for managing Spark sessions on the Livy server, which you may prefer to the
above interface.

Once you’ve created a Spark session as above, execute a cell on the cluster by
decorating a cell with the %%spark cell magic (note the two %):

%%sparkprint('I am being executed on the external cluster')

Note

It’s important to bear in mind the distinction between code executed in the
external Spark cluster from code that is executed in the notebook in your
SherlockML server. Cells that have the %%spark magic are executed on
the external cluster, and will only see variables that exist there, and
cells without that magic are executed on your SherlockML server, and will
only see variables that exist there.

If you get errors like NameError: name ‘df’ is not defined, it may be
because the variable you meant exists in the other context.

Transfer of data between the external cluster and SherlockML notebook must
be done explicity, as described below.

The spark SparkSession (Spark 2 only) and sc SparkContext objects will
be inserted into the session automatically. For example, to create a Spark
DataFrame from a CSV file in the cluster’s HDFS filesystem:

%%sparkdf=spark.read.csv('hdfs:////data/sample_data.csv')

Variables created in one cell will persist in the session and will be available
in other later cells. For example, we can run a second cell that counts the
number of rows in the Spark DataFrame created above:

%%sparkprint(df.count())

Any output generated by your code in the cluster will be retrieved and
displayed as the output of the notebook cell in SherlockML.

Often, you’ll want to retrieve the contents of a Spark DataFrame from the
cluster so you can do additional processing and modelling in your normal
Jupyter notebook. You can do this with the -o option:

%spark-odf

This will evaluate and collect the Spark DataFrame df on the external
Spark cluster, and save its data into a Pandas DataFrame in your SherlockML
notebook, also called df.

Note

Using %spark-o will attempt to load all of the values from a Spark
DataFrame into the memory on your SherlockML server. If this is very large,
as is often the case with Spark DataFrames, it may crash your server due to
running out of memory!

You can also use -o with a %%spark cell magic. The below code creates a
Spark DataFrame in the external cluster called top_ten, then collects it
into the SherlockML notebook as the Pandas DataFrame top_ten.

pylivy provides the LivySession class, which creates a Spark session and
shuts it down automatically when finished. To execute code in the session, pass
it as a string to the run() method on the session:

The read() method on the session allows you to evaluate and retrieve the
contents of a Spark DataFrame. Pass it the name of the Spark DataFrame you want
to read, and it will return it as a Pandas DataFrame: