Re: Loading Python libraries into Spark

For third party libraries the simplest way is to use Puppet [1] or Chef [2] or any similar automation tool to install packages (either from PIP [2] or from distribution's repository). It's easy because if you manage your cluster's software you are most probably already using one of these automation tools, and thus need only to write one more recipe to keep your Python packages healthy.

For your own libraries it may be inconvenient to publish them to PIP just to deploy to your server. So you can also "attach" your Python lib to SparkContext via "pyFiles" option or "addPyFile" method. For example, if you want to attach single Python file, you can do the following:

conf = SparkConf(...) ...

sc = SparkContext(...)

sc.addPyFile("/path/to/yourmodule.py")

And for entire packages (in Python package is any directory with "__init__.py" and maybe several more "*.py" files) you can use a trick and pack them into zip archive. I used following code more my library:

Python has one nice feature: it can import code not only from modules (simple "*.py" files) and packages, but also from a variety of other formats including zip archives. So when you write in your distributed code something like "import mylib", Python finds "mylib.zip" attached to SparkContext and imports required modules.

Re: Loading Python libraries into Spark

Hi Andrei,

Thank you for your help! Just to make sure I understand, when I run this command sc.addPyFile("/path/to/yourmodule.py"), I need to be already logged into the master node and have my python files somewhere, is that correct?

2. You create SparkContext, which initializes your application. At this point application connects to master and asks for resources.

3. You modify SparkContext object to include everything you want to make available for mappers on other hosts, e.g. other "*.py" files.

4. You create RDD (e.g. with "sc.textFile") and run actual commands (e.g. "map", "filter", etc.). SparkContext knows about your additional files, so these commands are aware of your library code.

So, yes, in these settings you need to create "sc" (SparkContext object) beforehand and make "*.py" files available on application's host.

With pyspark shell you already do have "sc" object initialized for you (try running "pyspark" and typing "sc" + Enter - shell will print spark context details). You can also use spark-submit [1], which will initialize SparkContext from command line options. But essentially idea is always the same: there's driver application running on one host that creates SparkContext, collects dependencies, controls program flow, etc., and there are workers - applications on slave hosts, that use created SparkContext and all serialized data to perform driver's commands. Driver should know about everything and let workers know about what they need to know (e.g. your library code).

Thank you for your help! Just to make sure I understand, when I run this
command sc.addPyFile("/path/to/yourmodule.py"), I need to be already logged
into the master node and have my python files somewhere, is that correct?