Running Spark

Spark Local: your computer

From the Spark download page, get Spark version 2.4.4 (the version we'll be using on the cluster), “Pre-built for Hadoop 2.7 and later”, and click the “download Spark” link. Unpack that somewhere you like. Set an environment variable so you can find it easily later:

Cluster

You will need a newer version of Spark than the default, which you can enable with (each time you log in):

module load spark

Spark will be in your path and you can get started:

pyspark
spark-submit sparkcode.py

Debugging & Exceptions

On the cluster, you won't be able to see any exceptions your code throws (since they are happening on some node out there in the cluster). There are two ways to deal with this…

After a run, you can (as always) see each process' output:

yarn logs -applicationId application_1234567890_1234 | less

Or you can run a job in yarn-client mode. This moves the driver (your code) to the login node, so any exceptions thrown there are visible. Please don't do this often, since it shifts work from the cluster back to the login node (which can run out of memory if everybody is doing this and running pyspark):

spark-submit --master yarn-client sparkcode.py

Monitoring Jobs

In the YARN web front end (http://localhost:8088 if you have your ports forwarded as in the Cluster instructions), you can click your app while it's running, then the “ApplicationMaster” link.

If you're on campus, the link will work. If not, you can replace “nml-cloud-199.cs.sfu.ca” with “localhost” in the URL and it should load. (Or if you really want, in your OS' /etc/hosts file, add 127.0.0.1 nml-cloud-199.cs.sfu.ca and the links will work.)

After the job has finished, you can also use the yarn logs command to get the stdout and stderr from your jobs, as described in the Cluster instructions.

Spark and PyPy

PyPy is a Python implementation that includes a Just-In-Time compiler that can be astonishingly fast. It can be used with Spark to speed up the Python code execution. (In Python Spark, your logic is split between the Scala/JVM implementation of the core logic and the Python implementation of your logic and parts of the PySpark API.)

In general for Spark, you need to set the PYSPARK_PYTHON variable to the command to start the pypy executable.

On the cluster, you can do this:

module load spark
module load pypy3

This sets the PYSPARK_PYTHON variable to point to a recent PyPy version and SPARK_YARN_USER_ENV so the PYSPARK_PYTHON is set on the executors as well.