A virtual hadoop cluster on OpenStack leveraging local disks will look something like this:

As you can see from the diagram above, a compute node performs an additional role of a storage node, serving local disks to the instances( Virtual Machines).

The split of OpenStack services will be like the following

Compute + Storage Node

Nova-compute

Cinder-volume

Controller Node

Nova-api

Cinder-scheduler

Cinder-api

Glance

Neutron

In summary, we need to set up a compute node as a cinder volume node as well, so as to expose the local disks as cinder volumes. There are two cinder drivers which can be made use of – LVMVolumeDriver, BlockDeviceDriver.

I’ll be showing the usage of BlockDeviceDriver which allows using plain block devices with OpenStack instances.

Additionally, the following should be kept in mind:

The default cinder quota might not be sufficient for real usage, and therefore, you need to change it accordingly.

As of this writing, OpenStack doesn’t have a way to automatically ensure coexistence of instance, and cinder volume, on the same node. In other words, it could happen that the cinder volume that is being attached to an instance is from a remote node, ie. from a node not running the instance. This is depicted in Scenario-1. For applications, like hadoop we would ideally like to avoid scenario-1, where a disk is served to an instance over iSCSI. Although, network bandwidths have improved significantly, and in most of the cases iSCSI should be fine, but for hadoop workloads, this could get into potential scalability issues as the number of compute and data volume grows. Instead, we might want to have Scenario-2. There is a way to achieve this manually, and I’ll describe more details in subsequent sections.

In the following section, I’ll show you the configuration details by taking the example of IBM PowerKVM server as a compute node. The same configuration also applies to Intel/KVM compute node.