SCM Express: Now Anyone Can Experience the Power of Apache Hadoop

Phil Langdale is a software engineer at Cloudera and the technical lead for Cloudera’s SCM Express product.

What is SCM Express?

As powerful and useful as Apache Hadoop is, anyone who has setup up a cluster from scratch is well aware of how challenging it can be: every machine has to have the right packages installed and correctly configured so that they can all work together, and if something goes wrong in that process, it can be even harder to nail down the problem.

Understandably, this can be a serious barrier to adoption. After all, how can you appreciate how great Apache Hadoop is if you never managed to get it set up in the first place? At Cloudera, we didn’t want to see anyone stopped before they could even start.

The Service and Configuration Manager (SCM) is a new part of Cloudera Management Suite in Cloudera Enterprise 3.5 that allows administrators to manage their Hadoop installations from a central console with just a few clicks of a mouse. It makes it easy to create and modify service installations and ensures that all the machines in a cluster are correctly and consistently configured.

This already provides a significant benefit to Hadoop administrators, however we wanted to take things even further and provide a tool that allows someone with no Hadoop experience to start up a fully functional cluster in a matter of minutes. We also wanted it to be freely available so that anyone can experience why Hadoop is so great.

So, SCM Express was born: from a single 500K download, you can bring up a Hadoop cluster of up to 50 nodes without editing a single configuration file, or even having to know what a Hadoop configuration file looks like. Additionally the cluster can be tuned to reflect the hardware on which it is running, so that it’s not just functional, but useful. We’ve codified many of the best practices and recommendations from our Solutions Architects so that the services you deploy can immediately benefit from their experiences and insights.

And it’s not just Hadoop; SCM Express can also install and manage Apache Zookeeper, Apache HBase and Hue. Hadoop is an ecosystem and not just a single product, so SCM Express lets you experience some of the breadth of that ecosystem.

How it works

When you download SCM Express for the first time, you get a small self-executing installer that will go through the process of installing the SCM Server. It sets up a package repository that’s appropriate for your Linux distribution and then installs the SCM Server from there. This will also allow you to download updates, just as you would for anything else installed on the machine.

Once the server is up and running, it provides a web-based user interface that walks through the process of identifying the hosts that you want in the cluster and then installing the necessary Apache Hadoop packages on them. In this way, you don’t need to do any manual work on those machines. As with the SCM Server, we install CDH (Cloudera’s Distribution Including Apache Hadoop, which is a packaged and tested distribution of open source Apache Hadoop and Apache ecosystem components) from our package repository, so that it too can be easily updated. The whole installation process is package based, so it’s easy to maintain in the long term.

After the cluster hosts have been identified and CDH is installed on them, SCM will create the services you select. At this time, it evaluates the physical characteristics of the hosts to decide which ones are best suited for which roles (which one should run the HDFS NameNode or the MapReduce JobTracker?) It also factors the size of the cluster into these calculations (for a small cluster, it makes sense to run the NameNode and JobTracker on the same machine, but for a large one, they should be separated). It will also use these physical characteristics to inform the configuration of the created services (the java heap size should reflect the amount of physical RAM in the machine, and the number of mappers and reducers should reflect the number of CPU cores).

Once the services are created, it will go through the process of bringing the services up for the first time. This isn’t always a simple matter of just starting processes; you have to format an HDFS filesystem before you can use it, for example.

When all that is done, your services are running and ready to go, and you’re also ready to appreciate the benefits that SCM Express provides in helping you maintain your newly deployed Hadoop cluster.

If you think you’re ready to take the plunge and upgrade to Cloudera Enterprise, it’s easy to switch over from SCM Express to full SCM; all your data and configuration carry over in-place.

We’re really proud that we’re able to offer SCM Express to the world. Apache Hadoop is an incredibly powerful tool for solving all sorts of problems and answering all kinds of questions from all your data, and now anyone can install it and experience it for themselves.

Filed under:

This looks like a really useful addition to the Hadoop tool stack, thanks! In fact, it seems like it would be useful for more than just Hadoop installations. As one who’s had to write dodgy shell scripts that SSH in to a bunch of hosts and then pray that everything works successfully, I’d love it if there were some way to implement support for custom services in SCM. Are there any plans to offer an API for doing so?

There are currently no plans to do so; obviously you never say never, but it’s unlikely to happen. Still, I can certainly sympathise with your plight. At least you won’t have to do that for Hadoop anymore!

Great tool, saved us a lot of time deploying a training cluster.
However, so far I haven’t find a way to add new node/server to an existing cluster managed by SCM (at least it is not documented anywhere). So it looks like it is not ready for production use yet.

I setup Hadoop on two machines (old laptops) and connected them manually in a cluster. It was a long and never ending process and I must say I am afraid if I want to add one more machine, I would be pretty painful process.

I came across SCM. Can you please clarify two things:
1. Can I run it on Ubuntu 64bit? I saw that it was not supported on Ubuntu 32bit but nowhere it was mentioned that it cannot run on 64bit Ubuntu. Please suggest as I have very limited experience of Ubuntu and would be a problem for me to migrate to another OS and learn some basic intricacies of the same again.

2. Is SCM free for 50nodes? I want to setup a cluster of 4-5 old laptops for parallel computing and I am finding it hard to believe that such a cool tool can be free of cost! Am I missing something in this? (Maybe some bundled offer in which I would need to pay a tonne for support etc.?)