Kubernetes, Hadoop, Persistent Volumes and vSAN

At VMworld 2018, one of the sessions I presented on was running Kubernetes on vSphere, and specifically using vSAN for persistent storage. In that presentation (which you can find here), I used Hadoop as a specific example, primarily because there are a number of moving parts to Hadoop. For example, there is the concept of Namenode and a Datanode. Put simply, a Namenode provides the lookup for blocks, whereas Datanodes store the actual blocks of data. Namenodes can be configured in a HA pair with a standby Namenode, but this requires a lot more configuration and resources, and introduces additional components such as journals and zookeeper to provide high availability. There is also the option of a secondary Namenode, but this does not provide high availability. On the other hand, datanodes have their own built-in replication. The presentation showed how we could use vSAN to provide additional resilience to a namenode, but use less capacity and resources if a component like the Datanode has its own built-in protection.

A number of people have asked me how they could go about setting up such a configuration for themselves. They were especially interested in how to consume different policies for the different parts of the application.

In this article I will take a very simple namenode and Datanode configuration that will use persistent volumes with different policies on our vSAN datastore.

We will also show how through the use of a Statefulset, we can very easily scale the number of Datanodes (and their persistent volumes) from 1 to 3.

Helm Charts

To begin with, I tried to use the Helm charts to deploy Hadoop. You can find it here. While this was somewhat successful, I was unable to figure out a way to scale the deployment’s persistent storage. Each attempt to scale the statefulset successfully created new Pods, but all of the Pods tried to share the same persistent volume. And although you can change the persistent volume from ReadWriteOnce (single Pod access) to ReadWriteMany (multiple Pod access), this is not allowed on vSAN even though K8s does provide a config option for multiple Pods to share the same PV. On vSAN, Pods cannot share PVs at the time of writing.

However, if you simply want to deploy a single Namenode and a single Datanode with persistent volume, this stable Hadoop Helm chart will work just fine for you. Its also quite possible that this is achievable with the Helm chart. I’m afraid my limited knowledge of Helm meant that in order to create unique PVs for each Pod as I scaled my Datanodes, I had to look for an alternate method. This led me to a Hadoop Cluster on Kubernetes using flokkr docker images that was already available on GitHub.

Hadoop Cluster on Kubernetes

Let’s talk about the flokkr Hadoop cluster. In this solution, there were only two YAML files; the first was the config.yaml which passed in a bunch of environment variables to our Hadoop deployment (core-site.xml, yarn-site.xml, etc) via a configMap (more on this shortly). The second held details about the services and Statefulsets for the Namenode and Datanode. I will modify the Statefulsets so that the /data directory on the nodes will be placed on persistent volumes rather than using a local filesystem within the container (which is not persistent).

Use of configMap

I’ll be honest – this was the first time I has used the configMap construct, but it looks pretty powerful. In the config.yaml, there are entries for 5 configuration files required by Hadoop when the environment is bootstrapped – the core-site.xml, the hdfs-site.xml, the log4j.properties file, the mapred-site.xml and finally the yarn-site.xml. These 5 different piece of data can be seen when we query the configmap.

root@srvr:~/hdfs# kubectl get configmapsNAME DATA AGEhadoopconf 5 88m

When we look at the hdfs.yaml, we will see how this configMap is referenced, and how these configuration files are made available on a specific folder/directory in the application’s containers when the application is launched. First, lets look at the StatefulSet.spec.template.spec.containers.volumeMounts entry:

volumeMounts: - name: config mountPath: "/opt/hadoop/etc/hadoop"

This is where the files referenced in config.yaml entries are going to be placed when the container/application is launched. If we look at that volume in more detail in StatefulSet.spec.template.spec.containers.volumes we see the following:

volumes: - name: config configMap: name: hadoopconf

So the configMap in config.yaml, which is named hadoopconf, will placed these 5 configuration files on “/opt/hadoop/etc/hadoop” when the application launches. The application contains and init/bootstrap script which will deploy hadoop using the configuration in these files. A little bit complicated, but sort of neat. You do not need to make any changes here. Instead, we want to change the /data folder to use Persistent Volumes. Let’s see how to do that next.

Persistent Volumes – changes to hfds.yaml

Now there are a few changes needed to the hdfs.yaml file to get it to use persistent volumes. We will need to make some changes to the StatefulSet for the Datanode and the Namenode. First, we will need to add a new mount point for the PV, which for Hadoop, will of course be /data. This will appear in StatefulSet.spec.template.spec.containers.volumeMounts and look like the following:

volumeMounts: - name: data mountPath: "/data" readOnly: false

Next we will need to make some specifications around the volume that we are going to use. For the PV, we are not using a StatefulSet.spec.template.spec.containers.volumes as used by the configMap. Instead we will use the StatefulSet.spec.template.volumeClaimTemplate. This is what that would look like. First we have the Datanode entry, and then the Namenode entry. The differences are the storage class entries and of course the volume sizes. This is how we will use different storage policies on vSAN for the Datanode and Namenode.

As you can see, the Namenode storageClass uses a gold policy on the vSAN datastore. If we compare it to the Datanode storageClass, you will see that this is the only difference (other than the name). The provisioner is the VMware vSphere Cloud Provider (VCP) which is currently included in K8s distributions but will soon be decoupled along with other built-in drivers as part of the CSI initiative.

vSAN Policies – storagePolicyName

Now you may be wondering why we are using two different policies. This is the beauty of vSAN. As we shall see shortly, the datanodes have their own built-in replication mechanism (3 copies of each block are stored). Thus, it is possible to deploy the Datanode volumes on vSAN without any underlying protection from vSAN (e.g. RAID-0) by simply specifying a policy (silver). This is because if a Datanode fails, there are still two copies of the data blocks.

However if we take the Namenode, it has no such built-in replication or protection feature. Therefore we could offer to protect the underlying persistent volume using an appropriate vSAN policy (e.g. RAID-1, RAID-5). In my example, the gold policy provides this extra protection for the Namenode volumes.

Deployment of Hadoop

Now there are only a few steps to deploying the Hadoop application. (1) Create the storage classes, (2) create the configMap in the config.yaml and (3) create the services and Statefulsets in the hdfs.yaml. All of these can be done by the kubectl create -f “file.yaml” commands.

Post Deployment Checks

The following are a bunch of commands that can be used to validate the state of the constituent components of the application after deployment.

Post deploy Hadoop check

Since this is Hadoop, we can very quickly use some Hadoop utilities to check the state of our Hadoop Cluster running on Kubernetes. Here is the output of one such command which generates a report on the HDFS filesystem and also reports on the Datanodes. Note the capacity at the beginning, as we will return to this after scale out.

Post deploy check on config files

We mentioned that the purpose of the configMap in the config.yaml was to put in place a set of configuration files that can be used to bootstrap Hadoop. This will show you how to verify that this step has indeed occurred (should you need to troubleshoot at any point). First we will open a bash shell to the Namenode, and then we can navigate to the directory mount point highlighted in the hdfs.yaml to verify that the files exist, which indeed they do in this case.

Scale out the Datanode statefulSet

We are going to start with the current configuration of 1 Datanode statefulset and 1 Namenode statefulset and we will scale the Datanode statefulset to 3. This should create additional Pods as well as additional persistent volumes and persistent volume claims. Let’s see.

Now we can see how the Datanode has scaled out with additional Pods and storage.

Post scale-out application check

And now for the final step – let’s check to see if the HDFS has indeed scaled out with those new Pods and PVs. We will run the same command as before and get an updated report from the application. Note the updated capacity figure and the additional Datanodes.

Checking the Replication Factor of HDFS

Last thing we want to check is to make sure that the Datanodes are indeed replicating. There are a few ways to do this. The following commands create a simple file, and then validate the replication factor. In both cases, the commands return 3 which is the default replication factor for HDFS.

Conclusion

That looks to have scaled out just fine. Now there are a few things to keep in mind when dealing with StatefulSets, as per the guidance found here. Deleting and/or scaling a StatefulSet down will not delete the volumes associated with the StatefulSet. This is done to ensure data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.

With that in mind, I hope this has added some clarity to how we can provide different vSAN policies to different parts of a cloud native application, providing additional protection when the application needs it, but not consuming any additional HCI (hyperconverged infrastructure) resources when the application is able to protect itself through built-in mechanisms.