Architecture:
One head and many nodes is the basic idea. The head is the machine that contains all the private and public daemons to run the cluster. It consists of two gigabit network cards. One with a public ip plugged into one of my public switches and the other plugged into a dedicated cluster switch with no uplink to the internet or public network. All of the nodes contain one network that is plugged directly into this ‘private’ switch. You can think of it as a NAT network with the head acting as the firewall/router.

Hardware:
This particular cluster was pieced together out of pieces to fit specific needs. We wanted to maximize the CPU qty per rack units used so we went with the Dual AMD Opteron machines with Dual Core CPUs. That gave us a total of 4 CPUs for what was supposed to be 2 rack units of space. Some where along the line between the university’s purchasing department and the vendor we ended up with 3U machines.

Each machine has a DVD rom so I don’t need to swap CDs during OS installs.
Each node uses a Tyan Thunder K8SD Pro motherboard because it supports our CPU choice, it has integrated video and dual gigabit ethernet, PCIX slots to support raid controllers and a can accept up to 32gigs of PC3200 ram.

The head contains 2 3Ware SATA RAID controllers. One is an 8 port that has 1.5 terabytes of storage running RAID5 for the storage array. The other is a 4 port that has 3x120gig drives running RAID1 with 1 hot spare for the OS drive.

We then have Type 1 and Type 2 nodes, the only difference being the Type 1 has 2 gigs of RAM and the Type 2 has 16. The qty break down is as follows.

Head: 1
Type1: 3
Type2: 1

setup:
The setup should have been far easier than it was. Lets just say that I have had a heck of a time making 3Ware sata raid controllers work on x86_64 RedHat based linux based OSes. It has made me decide that I will be using LSI MegaRaid cards from now on. This isn’t the first server I have had these problems on.

head setup:
I stuck the boot DVD in the head and at the boot prompt typed ‘frontend’ to indicate that it shouldn’t try and boot a kickstart install. That feature is very useful as you will be installing far more computing nodes then you will heads (one to many).

The OS install is fairly similar to any other text based RedHat install. I think it said it could do a graphical install but apparently it doesn’t support the integrated ATI XL on the Tyan mother board.

I won’t go through step by step, but I will spew out a few useful tidbits of wisdom that I discovered along the way.

Let it automatically set up your partitions. It was smart enough to turn my raid5 array into storage and break up the OS specific partitions on the smaller raid1 array. The partitions it set up for the OS where very sane.

Watch closely as you configure your networking. One is for your public interface and the other is the private. It will annoy you if you are trying to whiz through and assign them in the wrong order.

I installed every Roll (Rocks packages) available on the dvd as we plan on using both MPI and SGE.

Node setup:
On the head you will need to log in as root (either via ssh or locally) and run ‘insert-ethers’. This is a sort of wait-for-call screen like we had in the bbs days, only its used for initiating a kickstart with a node. Those familiar with redhat’s kickstart may wonder why they have to do this instead of just leaving the kickstart server running. The answer (as best as I can determine) is security. This way a machine can’t be introduced into your network without you (root) initiating it.

Next we stick the boot DVD or CD into the first node. Being that this is the first rack (referred to as cabinet in the ROCKS docs) and the first node it is compute-0-0. So the if you have multiple racks, the 3rd node in the second rack would be compute-1-2. That is if you choose to stick with their naming scheme. Honestly, I see no reason not to. Its descriptive and well thought out, besides you can assign the public interfaces host name to anything you like.

After it boots from the cd/dvd and connects to the kickstart server you will see it on ‘insert-ethers’ on the head. You can identify it by the MAC address that it displays. At this point the head sends the kickstart information to the new node and registers it with all necessary services including dhcp, its mysql database and so forth.

Do nodes and/or the head fail gracefully? What happens if a box crashes?

That, and in a “typical” data center environment, is there any benefit to going this route as opposed to having several netbooted boxes with identical config and using a firewall to load balance connections to them for network and cpu loads?

Good article. You might want to also checkout warewulf at http://www.warewulf-cluster.org it is another Linux cluster building tool and is quickly becoming popular in some of the labs, it allows you to build diskless HPC clusters (very scalable, very powerful and really simple.)
You should also checkout http://www.platform.com/Rocks – Platform has some additional rolls such as the PVFS2 Roll and Clumon Roll for cluster monitoring that may be useful to you as well (note: I work for Platform – I am not trying to sell it just have a look and if it is useful then cool)

Did you have any trouble at all getting Rocks to be stable with the SMP kernel on your dual-core, dual proc Opterons? We’re trying to set up a viz cluster here and the only stable kernel is the non-SMP one.

We have Tyan K8WE boards and Opteron 270s, so fairly similar to what you have. Rocks 4.1. Had to use i386 for now since viz roll not available for x86_64. However, I also witnessed the instability on x86_64 SMP kernel.

Thanks for taking the time to write this up. I have been collecting hardware for a small grid in my home office. This may just be the catalyst to start me running. I do a lot of Monte Carlo dose calculations at home and at work. I just feel that i should know this stuff.

Next time ask Penguin for a cluster quote, then you know it will be nice 1U dual opteron dual core nodes (up to 32 compute nodes = 128 cores in one rack), preracked, preconfigured, and all you have to do is turn it on. Also using Scyld it will mean that there is only an installation on the headnode, the compute nodes just PXE boot directly into RAM within seconds, they could be diskless or if your application desires local scratch space, then the disks completely belong to the application. No distribution pieces need to be installed on it to function as a compute node. All the mac address / cluster naming goes automatically, no manual steps involved besides turning on the nodes. And since it is a commercial distribution, you have a phonenumber to call for support if you ever need to.

Numbski : If one node dies its generally gracefully. Its easier to simply re-kickstart a node then it is to troubleshoot. I built the head as redundant as possible but if it went down the whole thing would be useless as it contains the storage arrays. What you were talking about with the load balancing is more of high availability and in my situation high performance is my goal.

L’HommeDeJava: Rack+switch+hardware+nodes =~ $25,000 us

Bill Bryce : Thank you for the info, I will have to check those sites out. By the sounds of it I may have another bioinformatics professor coming in so I may need to build another cluster in the near future.

Fran Fabrizio: For some reason the head was only machine I had a problem getting the x86_64 SMP to work with. I eventually troubleshot the problem down to the 3ware sata raid controller and it installed perfectly.

Well I was looking into some solutions for my MySQL database solution.. I am beating the crap out of the ONE MySQL serve I am using and do way to many UPDATE, INSERT and DELETE statements for replication..

I was reading and was considering building a Beowulf cluster, and running MySQL on it.. Any down sides to this? It seems stable.. How do I protect against a node going out? For instance if the Cluster distributing file storage for MySQL on more then one node, and that node goes out completely, is there any mirror on other nodes to recover the data?