Azure Big Data: Spark 2.3 Centos VM Standalone

As of late I’ve been investigating all the options of running a Spark Big Data platform on Azure using blob and datalake for data storage. So far I’ve poked around with the following – which I may blog about if I get time:

IaaS Linux VM Build (standalone and clustered)

HDInsight

Databricks

Spark on Azure Container Cluster (AKS preview) i.e. Kubernetes

This is basic how to install Spark 2.3 on a standalone Centos VM in Azure. Basically the latest and greatest build of Spark 2.3, Centos 7.4 (Linux), Scala 2.11.12 and Java 8+. There are later versions of Scala but Spark 2.3 requires Scala 2.11 max as covered here:

Preparing Your Client Machine

Install bash client

Create ssh rsa key – we need this before creating the Azure VM

We’re setting up a linux server to run Spark on a Centos VM in Azure. I’m not going to bother with a Linux Desktop or remote desktop but we’ll need a client bash terminal to connect to the machine in order to:

Administrate the Centos & install software

Use the Spark terminal

I run mac OS and windows 10; mostly Windows 10. If you’re running with a mac you don’t need a bash client terminal since you have one already. Windows however does need a bash client.

There is a new Microsoft Linux Subsystem available in the Windows 10 Fall Creators update but I hit some issues with it so wouldn’t advise it yet. It’s not just a Bash client; it emulates a local Linux subsystem which provides some irritating complications. The best experience by far I’ve had is with Git Bash so go ahead and install this if you’re using Windows.

Once we have Bash we need to create public private key so that we can use the Secure Shell (SSH) command to securely connect to our Linux VM.

Open Git Bash and execute the following:

ssh-keygen -t rsa -b 2048

This creates a an 2048 bit rsa private public key pair that we can use to connect to our Azure VM. You’ll be prompted for a filename and passphrase. See here:

As we can see it says that it created the key in:

C:\Users\shaun\.ssh\id_rsa

In my case it didn’t however and demokey.pub and demokey can be found here, which is my bash home directory:

C:\users\shaun\demopub.key (this is the public key)
C:\users\shaun\demopub (no extension this is the private key)

Review these files using notepad. Copy the private key to the respective .ssh folder and rename it to id_rsa:

C:\Users\shaun\.ssh\id_rsa

Keep a note of the public key which looks something like below because this lives on the Linux server and we need it when creating the Linux VM in Azure. Also don’t forget the passphrase you entered because we need that to login using the ssh command.

Create Centos VM in Azure

Login into the Azure Portal, click Create a Resource and search for Centos. Choose the CentOS-based 7.4 and hit create.

Fill in the necessaries in order to create your VM choosing the most affordable and appropriate machine. For a demo learning standalone I tend to go for about 4 cpu’s and 32GB (remember spark is an in-memory optimised big data platform). The important bit is to copy and paste our public rsa key into the SSH Public Key input box so it can be placed on the VM when provisioned. When Azure has provisioned your VM it leaves it up and running.

Connect to CentOS VM

So hopefully that all went well and we’re now ready to connect. You can give your VM a DNS name (see docs) however I tend to just connect using the IP. Navigate to the VM in the portal and click the connect button. This will show you the SSH command with the server address that we can enter into a bash client in order to connect.

Enter the SSH command, enter the passphrase and we’re good to go:

Patch the OS

Ensure the OS is patched, the reboot will kick you out of your ssh session. So you’ll need to sign back in.

sudo yum update -y
sudo reboot

Install Java 8

Install open JDK 1.8 and validate the install

sudo yum install java-1.8.0-openjdk.x86_64
java -version

Set the following home paths in your .bash_profile so that everytime we login our paths are set accordingly. To do this we’ll use the nano text editor.

sudo nano ~/.bash_profile

Add the following path statements, since they’re required by the scala config:

This is because there is no release directory in the $JAVA_HOME path directory which the scala script looks for; see a more thorough explanation here. It’s not vitally necessary but I got around this by just creating a release directory at $JAVA_HOME.

Install Spark 2.3.0

Final step! Install spark. Download the 2.3.0 rmp package. We’ll use wget again and download the package from a mirror url listed from on this page. I’m using the 1st listed mirror url but adjust as you see fit.