Why not buy throwaway cheap-ass boards instead? And either rma to the
vendor or throw them out in the garbage? How often do you expect
the worst boards to fail (barring a tyan2466 bios-style problem that
extends across the whole cluster, which is easily avoided by testing testing
testing before purchase -- tyan 2466 being cheap-ass boards of course,
which is why we all expected them to have such nasty BIOS problems. DONT
BUY CHEAP EQUIPMENT! Expensive gear guarantees NO BIOS PROBLEMS!)
If you can split your nodes up into subclusters, then the failure of any one
particular node wont matter as much. A failure of a node in one of 8
subclusters totalling 512 nodes takes down only 64 nodes. The other 448 stay
up and running. Yes, you will have jobs running slower per subcluster, but
overall, the throughput will be higher (assuming there's no increased
efficiency by running on more nodes (hyper-scaling?!) or that you arent
limited by memory footprint requiring all 512 nodes to even run the job).
I am not quite sure why everyone is so bent on the fastest single job
time on a cluster, instead of optimizing for throughput, especially
when failure rates are non 0. Spending money to get higher availability
equipment (which can be a 30 to 500% markup) just to run jobs faster
serially seems odd, especially when I keep hearing stories that suggest
that in fact there are at least a half dozen if not dozens of users on
everyone's clusters around here, and barring that, that everyone has
many jobs to run non sequentially and that throughput DOES matter.
If you end up with 100% MORE equipment because you didnt buy HA gear, you
now have 2 clusters for the price of one - all you have to do is design your
cluster so that it can survive downtime. Checkpoint your jobs, and write a
script to make it easy for users to restart jobs on surviving nodes in that
subcluster. If yer 31337, automate it. (I have scripts to do this for
gromacs right now. Ya, I tear down the lam mesh, restart it on surviving
nodes, restart the job. Whee.)
Design your cluster to survive downtime and you dont have a problem. I dont
see how anyone in the 256-512+ nodes range can do anything but. Failures will
always happen. Spending exponentially more for a linear decrease in failure
rate makes no sense. Buying an assload more gear and exploiting
redundancies/higher throughput through smaller subclusters makes a lot more.
Guess Im just a fan of Redundant Arrays of Inexpensive Beowulf Clusters
(RAIBC)!
We've have had no end of success with the 'PC Chimps' (as some snide
beowulfers here called them) boards we've had running non stop for a year and
a half. 2 nodes out of 30 dead in that time, and about 15-20 random crashes
in that time (reset time 6-12 hrs, but checkpointed jobs continue on all
surviving nodes whenever the user notices and runs the restart script).
Even if we had a total downtime of *6 MONTHS* we're incredibly far ahead of
the competing design which had a total of 10 nodes. 30 nodes = 3 * 10. We'd
need 66% downtime to lose. We're not even close to 5 here. Even if we had to
throw out those 2 boards (instead of just handing them to our supplier who
RMAd them in a few days), even if we had to throw out 10 boards over that
time, even if we had to throw out 10 boards on *DAY 1* because we're idiots
and got 0 RMA, we'd STILL be ahead. (Hard to find a supplier without at least
1 year automatic RMA.) Total throughput per dollar has been STELLAR.
The only concern with this is now you have more nodes, its going to cost more
in power and cooling... what a horrible predicament to be in, huh? To have so
much computing power such that you need to worry about that! I guess HA
designs solve this problem quite well. :P
-----
Anyone exploring high FLOPS/watt clusters yet? I got these Via Eden
boards here and they run without even a CPU fan (yay no moving parts! wait,
is this thing on? :). I have a 533 Mhz version on my desk (pretty slow)
but I'll post some results soon. They're claiming some ~5W total power
usage for a running node at 800Mhz and there's a GHz version out too now.
Anyone played with? The do seem to run pretty cool, but now I need to
hunt for a 100W power supply to really achieve savings. (The fan on
the 300W enermax its on now is barely turning.)
Suddenly always buying the fastest and hottest (thermodynamically speaking)
cpu might not make sense anymore.
/kc
--
Ken Chase, math at velocet.ca * Velocet Communications Inc. * Toronto, CANADA