Usage

Option 1. Mesos-mastered Spark Jobs

<strong>Install Mesos with Docker Containerizer and Docker Images</strong>: Install a Mesos cluster configured to use the Docker containerizer, which enables the Mesos slaves to execute Spark tasks within a Docker container.

Install/configure the cluster:<pre><code>./mesos/1-setup-mesos-cluster.sh</code></pre>Optional: ./1-build.sh if you prefer instead to build the docker images from scratch (rather than the script pulling from Docker Hub)

B. <strong>Manual Installation</strong>: Follow the general steps in mesos/1-setup-mesos-cluster.sh to manually install:

<strong>Run the client container on a client host</strong> (replace 'username-for-sparkjobs' and 'mesos-master-fqdn' below): <pre><code>./5-run-spark-mesos-dockerworker-ipython.sh username-for-sparkjobs mesos://mesos-master-fqdn:5050</code></pre>*Note: the client container will create username-for-sparkjobs when started, providing the ability to submit Spark jobs as a specific user and/or deploy different IPython servers for different users.

Option 2. Spark Standalone Mode

Installation and Deployment - Build each Docker image and run each on separate dedicated hosts

<div><strong>Tip</strong>: Build a common/shared host image with all necessary configurations and pre-built containers, which you can then use to deploy each node. When starting each node, you can pass the container run scripts as <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html">User data</a> to initialize that container at boot time</div>

<strong>Prerequisites</strong>

Deploy Hadoop/HDFS cluster. Spark uses a cluster to distrubute analysis of data pulled from multiple sources, including the Hadoop Distrubuted File System (HDFS). The ephemeral nature of Docker containers make them ill-suited for persisting long-term data in a cluster. Instead of attempting to store data within the Docker containers' HDFS nodes or mounting host volumes, it is recommended you point this cluster at an external Hadoop deployment. Cloudera provides complete resources for installing and configuring its distribution (CDH) of Hadoop. This repo has been tested using CDH5.

Update the Hadoop configuration files in runtime/cdh5/<hadoop|hive>/<multiple-files> with the correct hostnames for your Hadoop cluster. Use grep FIXME -R . to find hostnames to change.

Generate new SSH keypair (dockerfiles/base/lab41/spark-base/config/ssh/id_rsa and dockerfiles/base/lab41/spark-base/config/ssh/id_rsa.pub), adding the public key to dockerfiles/base/lab41/spark-base/config/ssh/authorized_keys.

(optional) Comment out any unwanted Python packages in the base Dockerfile image dockerfiles/base/lab41/python-datatools/Dockerfile.

<strong>Get Docker images</strong>:

<div>Option A: If you prefer to pull from Docker Hub:<pre><code>docker pull lab41/spark-masterdocker pull lab41/spark-workerdocker pull lab41/spark-client-ipython</code></pre><div>Option B: If you prefer to build from scratch yourself:<pre><code>./1-build.sh</code></pre><div>If you are creating common/shared host images, this would be the point to snapshot the host image for replication.</div>

<strong>Deploy cluster nodes</strong><div>Ensure each host has a Fully-Qualified-Domain-Name (i.e. master.domain.com; worker1.domain.com; ipython.domain.com) for the Spark nodes to properly associate</div>

<strong>Run the master container on the master host</strong>: <pre><code>./2-run-spark-master.sh</code></pre>