Running the Cloudera Training VM in VirtualBox

Update (May 1 2013): The post below, which is based on an outdated VM, is deprecated. Rather please see the Cloudera QuickStart VM, which runs on VirtualBox, VMware, and KVM.

Cloudera’s Training VM is one of the most popular resources on our website. It was created with VMware Workstation, and plays nicely with the VMware Player for Windows, Linux, and Mac. But VMware isn’t for everyone. Thomas Lockney has managed to get our VM image running on Virtual Box, and has written a step-by-step guide for the community. Thanks Thomas! – Christophe

I was quite pleased when I discovered that Cloudera had created a virtual machine image that could be used while working through their training material. It would make the process simpler, and it looked like a potentially useful environment for general Hadoop experimentation. However, their VM is built for VMware, which I stopped using a while back. However, as a heavy VirtualBox user, I knew that it would not be hard to get it running in my preferred desktop virtualization environment.

Here’s a step-by-step guide for getting Cloudera’s virtual machine image up and running. I’ll include screenshots for most of the steps to make it as clear as possible. I’ll assume you already have at least some familiarity with running VirtualBox (if not, there are plenty of good tutorials and references available online) and some experience with Ubuntu or some other fairly modern Linux desktop system.

1. The first step is to download the virtual machine from the Hadoop Training Virtual Machine page on Cloudera’s site. The version at the time of this writing is 3.1, and the filename you’ll end up with is cloudera-training-0.3.1.tar.bz2. Once you have downloaded the file (this may take a little while — it’s quite large), decompress it somewhere useful. On a Unix-based machine (e.g., Linux or OS X), you can do this by running the following command:

tar xjf cloudera-training-0.3.1.tar.bz2

2. Next, start up VirtualBox. Once it loads, go to the File menu and select Virtual Media Manager.

3. The Virtual Media Manager is where you set up new drive images. An image needs to be created before you can use it with a virtual machine. In this case, you’re creating a new image by pointing to the existing image, which was supplied with the Cloudera VM download. It’s a VMware image (a .vmdk file), which VirtualBox can read.

In the Virtual Media Manager window, click New to create a new image.

4. In the file dialog box that appears, browse to the directory where you extracted the download and select the file cloudera-training-0.2-cl3.vmdk. Please note that this name will likely change with later releases, so you might need to experiment to find the right file. If that is the case, you’ll be looking for files ending in .vmdk. Note that files with the s00# names are generally either snapshots or extensions to the base drive image (you can choose to have the image split up into multiple files).

5. After closing the Virtual Media Manager window, click the New button in the main VirtualBox window to create a new virtual machine.

6. From the Create New Virtual Machine dialog box, give your new machine a name. Select Linux as the operating system and Ubuntu as the version.

7. On the next screen, set the memory size. The VMware image that Cloudera created has 1024 MB assigned, but I’ve found I can get away with less for basic needs. If you plan to do full development in this VM, set it higher (if you have the space to spare).

8. Next, you’ll select the hard disk image, which we added earlier.

9. Double check the summary before clicking Finish.

10. After closing the Virtual Machine Wizard, you can select the Cloudera machine that you just created and click Start.

11. Assuming you’ve done everything correctly up to this point and your VirtualBox installation is working properly, you should see a window pop up with the boot-up messages for the new virtual machine. Watch this to make sure everything is booting fine. If you see error messages here or if your machine doesn’t boot up correctly, you may have missed a step earlier or selected the wrong file for the hard disk image.

12. After a few moments, you should see the desktop of your new image. If you’ve gotten this far, you can stop here if you want, but you’ll be missing out on the enhanced functionality that VirtualBox offers, such as better integration with your existing desktop, sharing of files, etc.

13. If you want full integration, open a terminal and run the following command:

sudo apt-get install build-essential linux-headers-uname -r

This will install the basics that you need before loading the VirtualBox additions.

14. Select Install Guest Additions from the Devices menu.

15. You should now see a pop-up window prompting you to run the installer for the guest additions. Click the Run button to continue.

16. If the dependencies installed correctly earlier, you’ll see a terminal window, which will show you the progress as the add-ons are installed.

17. At this point, you can select Shutdown from the system menu in the top menu bar, and then choose Restart to reboot your virtual machine. When the VM restarts and the desktop is fully loaded, you should be able to resize the window, use your mouse seamlessly between the virtual machine window and your desktop, and add a shared folder (see the VirtualBox documentation for instructions on this).

One last thing: there is a call at the very end of /etc/init.d/rc.local to /usr/bin/vmware-user that you might want to remove. It won’t hurt anything if you leave it there, but you will occasionally see error messages at startup or shutdown due to its presence. I finally hunted it down just now after running this VM for a while, so it’s really not a big deal.

16 responses on “Running the Cloudera Training VM in VirtualBox”

Thanks for this posting. This helps overcome some of the problems I had before with running the cloudera image in virtualbox. When I first ran the virtual machine on a laptop with Windows XP host, I got an error “this kernel requires the following features not present on the cpu pae”.

A quick check of the PAE box in the virtualbox setting fixed the problem and the virtual machine started up.

This does not seem to work anymore, even with the PAE box checked. The boot process times out waiting for the root partition to become available. Even increasing the rootdelay to 5 minutes does not load the system.

I suspect this has something to do with the training image being 0.2 but other files indicated a 0.3.3 version. Any tips?

To support the network, you must specify the MAC of the network adapter in virtual machine (Settings -> Network -> Additionally). It should be taken from the file “cloudera-training-0.3.4.vmx” as a parameter “ethernet0.generatedaddress”.

In case facing issues with enabling NAT or other network adapters in windows host, we need to change ‘/etc/network/interfaces’ to include eth1 also as an interface by copy-pasting eth0 part and/or replacing eth1 and then using ‘ifup eth1’.