This is mainly a guide for myself to reference again at a later date, but I figured it would be useful to others. There are a few guides similar to this one floating around, but a lot of them have not been updated in a while. This guide assumes some basic familiarity with the command line and AWS console.

Step 1: Create an Amazon EC2 Instance

For simplicity we will use the Free Tier Micro Instance using Ubuntu 64x. Use all the default settings, except set All Traffic instead of SSH when editing the Security of the instance. This guide can be expanded to multiple instances or larger instances.

Step 2: SSH into that instance (Windows)

For Windows users, you’ll need to use PuTTY . Amazon has a really good set of instructions located here. Follow those to the point where you have the connection to the Ubuntu console.

Step 3: SSH into that instance (Linux/Mac OS)

An SSH client is already built into Mac and Linux (usually). You can just do a straightforward ssh command into your EC2 instance using your .pem file that you downloaded. Amazon also has instructions on this here.

You should now have successfully connected to the command line of your virtual Ubuntu instance running on EC2. The rest of the guide will tell you commands to put into this terminal.

Step 4: Download and Install Anaconda

Next we will download and install Anaconda for our Python. You can replace the version numbers here with whatever version you prefer (2 or 3).

You’ll get asked some general questions after running that last line. Just fill them out with some general information.

Step 8: Edit Configuration File

Next we need to finish editing the Jupyter Configuration file we created earlier. Change directory to:

$ cd ~/.jupyter/

Then we will use visual editor (vi) to edit the file. Type:

$ vi jupyter_notebook_config.py

You should see a bunch of commented Python code, this is where you can either uncomment lines or add in your own (things such as adding password protection are an option here). We will keep things simple.

Press i on your keyboard to activate -INSERT-. Then at the top of the file type:

c = get_config()

# Notebook config this is where you saved your pem certc.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'

# Run on all IP addresses of your instancec.NotebookApp.ip = '*'

# Don't open browser by defaultc.NotebookApp.open_browser = False

# Fix port to 8888c.NotebookApp.port = 8888

Once you’ve typed/pasted this code in your config file, press Esc to stop inserting. Then type a colon : and then type wq to write and quit the editor.

Step 9: Check that Jupyter Notebook is working

Now you can check to see that jupyter notebook is working. In your ubuntu console type:

$ jupyter notebook

You’ll see an output saying that a jupyter notebook is running at all ip addresses at port 8888. Go to your own web browser (Google Chrome suggested) and type in your Public DNS for your Amazon EC2 instance followed by :8888. It should be in the form:

https://ec2-xx-xx-xxx-xxx.us-west-2.compute.amazonaws.com:8888

After putting that into your browser you’ll probably get a warning of an untrusted certificate, go ahead and click through that and connect anyway, you trust the site. (Hopefully, after all you are the one that made it!)

You should be able to see Jupyter Notebook running on you EC2 instance. Great! Now we need to go back and install Scala, Java, Hadoop, and Spark on that same instance to get PySpark working correctly. Use Crtl-C in your EC2 Ubuntu console to kill the Jupyter Notebook process. Clear the console with clear and move on to the next steps to install Spark.

Step 10: Install Java

Next we need to install Java in order to install Scala, which we need for Spark. Back at your EC2 command line type:

$ sudo apt-get update

Then install Java with:

$ sudo apt-get install default-jre

Check that it worked with:

$ java -version

Step 11: Install Scala

Now we can install Scala:

$ sudo apt-get install scala

Check that it worked with:

$ scala -version

(Optional: You can install specific versions of Scala with the following, just replace the version numbers):

Step 12: Install py4j

We need to install the python library py4j, in order to this we need to make sure that pip install is connected to our Anaconda installation of Python instead of Ubuntu’s default. In the console we will export the path for pip: