In last few posts of our kubernetes series, we discussed about the various abstractions available in the framework. In next set of posts, we will be
building a spark cluster using those abstractions. As part of the cluster setup, we will discuss how to use various different configuration available
in kubernetes to achieve some of the import features of clustering. This is the fifth blog of the series, where we will discuss about building a spark
2.0 docker image for running spark stand alone cluster. You can access all the posts in the series here.

Need for Custom Spark Image

Kubernetes already has documented creating a spark cluster on github. But currently it uses old version of the spark. Also it has some configurations which are specific to google cloud. These configurations are not often needed in most of the use cases. So in this blog, we will developing a simple spark image which is based on kubernetes one.

This spark image is built for standalone spark clusters. From my personal experience, spark standalone mode is more suited for containerization
compared to yarn or mesos.

Docker File

First step of creating a docker image is to write a docker file. In this section, we will discuss how to write a docker file needed
for spark.

The below are the different steps of docker file.

Base Image

FROM java:openjdk-8-jdk

The above statement in the docker file defines the base image. We are using
a base image which gives us a debian kernel with java installed. We need
java for all spark services.

Define Spark Version

ENV spark_ver 2.1.0

The above line defines the version of spark. Using ENV, we can defines a variable and use it in different places in the script. Here we are building the spark with version 2.1.0. If you want other version, change this configuration.

Now we have our docker script ready. To build an image from the script, we need docker.

Installing Docker

You can install the docker on you machine using the steps here. I am using docker version 1.10.0.

Using Kubernetes Docker Environment

Whenever we want to use docker, it normally runs a daemon on our machine. This daemon is used for building and pulling docker images. Even though we can build our docker image in our machine, it will be not that useful as our kubernetes runs in a vm. In this case, we need to push our docker image to vm and then only we can use the image in kubernetes.

Alternative to that, another approach is to use minikube docker daemon. In this way we can build the docker images directly on our virtual machine.

To access minikube docker daemon, run the below command

eval$(minikube docker-env)

Now you can run

docker ps

Now you can see all the kubernetes containers as docker containers. Now you have successfully connected to minikube docker environment.

Building image

Clone code from github as below

git clone https://github.com/phatak-dev/kubernetes-spark.git

cd to docker folder then run the below docker command.

cd docker
docker build -t spark-2.1.0-bin-hadoop2.6 .

In above command, we are tagging (naming) the image as spark-2.1.0-bin-hadoop-2.6.

Now our image is ready to deploy, spark 2.1.0 on kubernetes.

Conclusion

In this post, we discussed how to build a spark 2.0 docker image from scratch. Having our own image gives more flexibility than using
off the shelf ones.

What’s Next?

Now we have our spark image ready. In our next blog, we will discuss how to use this image to create a two node cluster in kubernetes.