Letting Go Is Hard

Last month, I talked about what goes into cluster nodes. In particular, I tried to make it clear that diskless booting of cluster nodes doesn’t exclude hard drives from those nodes. You can still provide local storage and use the Network File System (NFS) to share files among all nodes. If you grasp this idea, you may mutter the word “Aha! ” as you read the next few pages. If a bulb fails to light, well, read the article again.

This month, in classic “Cluster Rant” style I want to push diskless booting even further and see where it goes. As you’ll see, there are important advantages gained by removing the distribution from your nodes. Moreover, there are some things you cannot accomplish installing a classic Linux distribution on every node. I won’t mention any diskless booting package in particular, but I’m mostly thinking about the Warewulf Toolkit (http://www.warewulf-cluster.org/) and The Scyld Beowulf Distribution (http://www.scyld.com/. Both have the nuts and bolts to accomplish what I describe below.

The Diskless Way

In most cases, when the operating system comes from a central source, nodes can boot faster because the local disk drive is not part of the boot process. Diskless booting avoids disk reads, mounts, and file system checks that slow things down. As I mentioned, delivering just what the node needs to boot and some support files is often enough for most users to run a code on the nodes. Everything else can live in a shared file system. Pesky and stupidly long BIOS cycles aside, let’s put fast reboots on our list of advantages.

Another advantage of diskless booting is the ability to manage change from a central point in the cluster. Because disk imaging isn’t required, the changes can be implemented and distributed quickly. Let’s add rapid reconfiguration to the growing list of perks. Finally, consider open cluster software, or as I like to call it, the “open cluster plumbing.” Many users in the Linux high-performance computing (HPC) community, including myself, now take this for granted, but it is still an essential element in my grand vision so it should be mentioned.

In contrast to the traditional distribution on every node cluster, a diskless booted system can be rapidly rebooted and reconfigured with a totally open infrastructure.

Time to think about the possibilities.

Solving Some Nagging Issues

Have you ever walked into a server room to see racks of machine, fans whirling, lights blinking, and wondered, how many of the machines are actually being used? I’m sure the person who pays for the electricity has certainly asked that question.

What if you turned on the nodes only when you needed them? Of course, you could do this with a traditional cluster too, but fast booting makes this much more useful. In particular, if your nodes don’t have hard drives, booting could be very quick as the nodes could be put into standby mode and restarted with a Wake-on-LAN packet. Remember, when nodes have no hard drives, booting is very fast.

The best way to implement this feature would be through a scheduler like Sun Grid Engine (http://gridengine.sunsource.net/) or Torque (http://www.clusterresources.com/). The scheduler could keep a handful of nodes “at the ready,” so smallish jobs could be started right away. For clusters that exhibit burst-like usage, booting on demand could save some real cash. And, like any good hack, the user would never know about it. When a job needs to start, there would be a small delay, but it’s inconsequential compared to the overall runtime of the program.

But better power utilization is just the beginning. If the scheduler is booting nodes based on a request, then the schedule can also configure the nodes on-the-ely. Here’s where it gets interesting.

Suppose you want a node that has a specific kernel, driver, or library that is unique to your application. No problem! Just configure the node to boot with exactly what you need. For instance, many modern Gigabit Ethernet drivers have parameters that can effect application performance. For some codes, I’ve seen performance increase by 30-40 percent just by tuning these parameters. There are also user space Ethernet packages (such as GAMMA,http://www.disi.unige.it/project/gamma) that help some codes as well. Because the optimal settings vary between applications and no one set works well with every application, it’s highly desirable to set the best options and reboot the nodes before running your program (If you’re lucky enough to have a separate management network that uses a different Ethernet chip set, you can reload the module without rebooting. Switching to GAMMA, though, requires a reboot).

The use of centralized booting and configuration makes this easy to manage. Of course this could be accomplished with a traditional distribution full node, but then you are changing configuration files on specific nodes and must track the state of each node image. Tracking node state can be done, but can often lead to messy situations where the easiest solution is just a clean re-imaging of the node. Centralized booting makes node configuration easy because every boot is a “clean slate.”

Pushing this idea further is where the independent software vendors (ISV’s) tend to get interested., but first, our new cluster configuration needs another popular software tool that not only solves some problems, it’s quite trendy as well.

A Xen Thing

Linux is a challenge for ISVs. When an ISV creates and sells a software package, the vendor must also support its application. While the actual issues usually center around kernels and libraries, there’s enough variation within the Linux community that ISVs usually settle on a handful (if that many) of specific popular distributions.

If you use a cluster that only runs distribution X and your application needs distribution Y, you have two choices: get another cluster or convince everyone to use your ISVs’ preferred distribution on the cluster. (Good luck with that.) If you have two ISV applications that need different distributions on the same cluster, you’re “SOL,” as they say. (And, that would be “Stuck with One Linux,” for those who thought otherwise.)

Here’s a solution that is just asking to be implemented.

*First, virtualize a login/head node for any and all distributions of interest, and don’t plan to use the login/head node for computation. This step allows your users to access specific build/execution environments for specific applications. Virtualization software, such as Xen, is perfect for this.

*Next, choose one of the virtual machines to run the core of a scheduler; all of the other virtual machines submit to this scheduler.

*Finally, build software boot environments for your virtual machines, and tell the scheduler to submit these to compute nodes as executables when jobs are run from your virtual machine. The nodes you request are booted just as you need them.

You’re happy, your ISV is happy, and even the guy who hates your distribution is happy.

Ever Hear of OpenAFS?

At this point the capacity cluster people are probably starting to sigh and say, “Look: We need all the nodes the same with this list of software.” (Capacity clusters usually run lots of single process jobs and have a need for a consistent software environment on all nodes.) Well, that is also possible.

First, instead of creating specific, application-focused nodes, create robust nodes that can run a large list of single process user codes. One solution is to boot the core operating system and support files and run everything else over NFS. With lots of jobs, this approach might stress NFS, so having some files on local disk might be desirable.

A more interesting approach could include OpenAFS. If you mount your robust environment as an OpenAFS share and set aside some local disk for an OpenAFS cache, then the read-only files in your distribution will eventually show up and live on the local drives and be there when needed.

Remember, these are distribution files that the user is not allowed to change. There is no real reason to copy all of those files on every node in the cluster. In a way, using something like OpenAFS and diskless booting allows your nodes to optimize themselves to the users needs. Oh, and when it comes time to upgrade, well, you do that once; OpenAFS makes sure new files find their way to nodes. Come to think of it, this would be a great way to run desktop nodes as well. The administrators dream: centralized operating system and distribution management using stateless workstations.

Why This May Not Be Stupid

At this point, the non-believers are probably saying, “Why bother. Just buy a node, install Fedora, some RPM’s, and run the code. I don’t want to give my RAM or network to the wacky boot schemes.” But take a look at what is happening in the commodity hardware world.

*Clusters are getting bigger

*Gigabit networks are getting cheaper

*Memory is getting cheaper

*Hard drive capacity is continually increasing

Bigger clusters mean more power consumption. More power means more cost. Hard drives also use power and are prone to failure because they have moving parts. Most motherboards now have two Gigabit Ethernet connections and room for at least eight gigabytes of memory. Hard drives are getting larger, and the amount of space used even by the largest distribution is only a small fraction of the total space available.

Here’s how I see it: A one gigabyte hard drive is plenty of room for a cluster node to store the system and user application files. (Most well-designed systems could easily get away with less than half of this, but I want to drive the point home.) As I write this, a one gigabyte stick of memory costs about as much as a hard disk drive. So instead of putting a hard drive in your node, put an extra stick of RAM and run your node completely from memory. Memory is faster, cooler, and it breaks far less often than a hard drive.

Remember, I am not talking about storing user files or critical data files in RAM. For these, a parallel file system, or in many cases NFS, works just fine where important files can sleep safely in the confines of a RAID bunk bed on a storage node. The memory resident files I’m referring to are all the system read-only files needed to run the node and user programs — you know, the files that have been optimally tuned for your specific application and placed on the node by the scheduler. Yes, those files can live quite comfortably in RAM with room to spare.

If you want to get really fancy, or if you have a capacity based cluster, you could set up an OpenAFS cache in RAM and let the node load up the files it needs. In the above scenario, a 512 MB cache would work quite nicely.

A Hard Goodbye

By now you should have had the “aha moment.” Maybe even your own variation on this idea is emerging right now.

As I’ve described, getting away from full distribution nodes has some big advantages, including saving money, enhanced performance, and easier maintenance. Of course, there are issues to be resolved and these approaches may not work for all cases, but I think this represents a better solution in most cases and is a direction worth pursuing.

The really exciting thing? All of the building blocks are there waiting to be glued together. Some coding is needed, but there are no huge barriers in the way of testing a few of these ideas.

Now that I’ve thrown my hat into the ring, I’m interested in your comments and feedback. In the meantime, I’m going to tinker with some of these ideas. Stay tuned.