The Beowulf mailing list provides detailed discussions about
issues concerning Linux HPC clusters.
In this article I review some postings to the
Beowulf list about
Parallel Memory and packing in motherboards.
I think the discussion threads presented below provide some very
useful information despite the age
of the postings. And another good use for cookie sheets!

Parallel Memory

On Oct. 18, 2005, Todd Henderson
posted
a question about whether any tools, drivers, etc. that allowed distributed
nodes to have their collective memory appear as shared memory. In essence
a "PVFS" for memory. Todd also mentioned that he wasn't worried about
speed but just memory capacity (his application had memory usage that
scaled with the cube of the problem size). He wanted to know about any
approaches to distributed shared memory before he embarked on a large
MPI porting process.

{mosgoogle right}

The first one to reply was Mark Hahn (doesn't he ever sleep?). He
said
that there were some student projects around to do this kind of thing.
But he didn't think it was too worthwhile, "... unless you have some
pretty much completely sequential, high-locality access patterns."
Mark also pointed out that a memory access on a node is on the order
of 60 ns (nano-seconds), and to fetch a page of memory over a network
would be on the order of 80 micro-seconds. So the difference is about
a factor of 1000.
One suggestion that Mark made was to look at
Global Arrays

Paulo Afonso Lopes, then
suggested
that Todd take a look at SSI (Single System Image) projects and DSM
(Distributed Shared Memory) projects. One that he mentioned is,
Kerrighed.

Robert Brown then
posted
with a mention to a project at Duke called "Trapeze" but he wasn't
sure of the project was still around or not. He then went on with an
idea to let the node where the code is running to start swapping. But
the swap space is over NFS and on the NFS server, rather than use
disks, you create a ramdisk. Robert thought this would be an interesting
experiment to try. Of course your swap space would be
limited to the largest ramdisk on the NFS server node (about 64GB
for current commodity hardware). So if you combined this with the
memory of the node where the code is running, you could get about
128 GB of usable memory+swap. If you need to go larger you could
create swap files on various nodes using the same approach so that
the node where the code is running could swap to a number of swap
files.

Bogdan Costescu
responded
to Robert's post about swapping over NFS to
a ramdisk by saying that there had been a discussion on some of
the heavy duty Linux Kernel mailing lists about swapping over NBD (Network
Block Device) or iSCSI. He discussed a situation where you could
get a deadlock. (you need memory for the transfer, so what if you
need to swap to do the swap ...) So he suggested that swapping over the network
wasn't a good idea at the time (remember this is the end of 2005).

Robert
replied
to Bogdan's post with some discussion about the details of creating
a ramdisk. He then came up with a heuristic about the largest possible
ramdisk size (a corner case really). But this corner case showed that
you really don't get more than about 50% of the system memory for
creating a ramdisk (if you want to make sure this corner case can
never happen). His finishing comment was that it's probably better
to just write a simple parallel application with routines that do
the data management for you (Global Arrays, mentioned earlier, does
this).

Ashley Pittman
wrote
that in the 2.2 version of the kernel there was the ability to
swap over the network. It used sockets to communicate to a remote
server. The whole code was in user-space so it was probably
simpler than using NFS. Michael Will
chimed
in that he used to swap over a 10/100 network to a remote ramdisk
via NBD. He was using the swap to load, at that time, large gimp
images. He said that, "Qualitative statement: It seemed faster
than using the old IDE drive for swapping,
maybe because the image data came from the IDE drive as well and
so the extra 10MB/s channel via NBD was worth it."

Randy Wright
wrote
with a
link to a paper he
listened to at Cluster 2005. He said that they had a large quicksort
running at 1.7 times slower than the speed of doing it in local memory
only, but up to 21 times faster than using a local disk. He said that
on a good day, it worked, but it was fairly flaky.

Richard Walsh also
wrote
with a suggestion to look at
UPC project for C codes or the
Co-Array Fortran project for
Fortran codes. These are languages that allow you to use memory from
other nodes and/or to thread the application. He said that there were
some some libraries for common interconnects, allowing you to use
memory on other nodes. Then went on to talk about some details of
both UPC and Co-Array Fortran.

I like this discussion because once you get into clusters you eventually
ask the same question Todd asked - Isn't there a good way to do
distributed shared memory on clusters? While there were some good
suggestions, I recommend using Global Arrays. It allows you to
grab memory from distributed nodes to use locally and it handles
everything very simply.

Cluster of Motherboards

Once you get into building your own clusters you also ask the question,
what about building a cluster just using motherboards and no cases?
Well, Fernando
asked
this question on Nov. 4 2005.

Of course, the universal answer was - yes. Glen Gardner
wrote
with a
link to a cluster
he built using mini-itx boards. He also said that the latest version
of the cluster had 18 mini-itx motherboards in 12" space. He said that
you could get up to 18 mini-itx motherboards in a 19" wide by 12" high
by 26" deep rack space.

It looks like Robert Brown wasn't immune to
posting
about gadgetry (he's a DIY kind of guy!). He said that using just
motherboards has been done a number of times before. He did mention
a couple of idea people have tried (e.g. directly mounting motherboards
to shelving), and he also told Fernando to be careful when dealing
with electricity. Robert then mentioned that the list's EE guru, Jim Lux
should comment on the subject.

Being the good cluster-er that he is, Jim promptly
wrote
to the list with some good comments. His opening comments were,

Sure it's possible.
Your problems are going to be power, cooling, and structures (assuming
you're not in an environment where people care about electrical codes, RF
interference, etc.)

He then went to explain each of these in a little detail with some
warnings (e.g. watch for grounding loops).

Jim's comments led Robert to kick into "Watch Out There" mode and
offer
some warnings about creating dangerous conditions (he didn't mention
anything about running with scissors though). But he did make one
comment that I've seen in the past several times,

Yeah, I think that Jim's observation that you should think carefully
about the diminishing returns of building a free-form caseless cluster is
very apropos -- you'll save a bit of money on space and cases -- maybe
-- at the expense of more hands on work building the cluster and at the
risk of having to resolve problems with shielding and so on.

He did offer some very good suggestions though. He ended with this
comment.

If you go anywhere beyond this, I'd
REALLY recommend that you only proceed if you completely understand
electricity and electrical wiring and know what a ground loop IS and so
on.

A fitting comment if I've ever read one. This should be posted in
every electrical section in Home Depot and Lowe's. But then again
the
Darwin Awards just
wouldn't be the same.

On a more serious note, Josip Loncaric
wrote
that it's possible to find cheap cases for about $20 and these can save
you some work, but not necessarily shelf space.

Marh Hann
echoed
these comments by explaining the lure of inexpensive hardware to make
your cluster. He gave an interesting example of a 1,500 CPU cluster
where you allow some money for the CPUs, motherboard, chassis, power
supply, memory, and found that the sum was about 20% the cost of the
real thing. (this is tempting isn't it?) He did mention that Google
doesn't use cases. Rather they have bare motherboards on trays, perhaps
much like Fernando wants to do. Mark finished his comments
with the following.

In summary, subtracting the chassis sounds smart, but really only makes
sense if you follow through with the rest - cheap motherboard, cheap cpu,
minimal cpu, minimal network, cheap labor, workload that is embarrassingly
parallel, and not long-running...

In short, you get what you pay for. (I've been burned on cheap memory
several times in the past - never again).

What's remarkable to consider is that one of the very largest
(if not the largest?) data cluster systems in the world is a bare
motherboard system, strapped together with lots of simple
screws and Velcro.

That's Google, in case you did not know. I was shocked to see
this when I saw a presentation recently by one of the Google
guns here in NYC (actually, the inventor of Froogle).
He showed us pix of a bunch of nodes essentially
sitting on some insulating material, screwed to a simple
frame-style chassis with careful consideration of grounding
and power. His point was to emphasize that google considers
lots of very cheap, very simple nodes key to their growth, and cases
are 'right out' when you go to this scale (he would not share the
exact N of nodes with us, but alluded to something on the order of
100K, at that time, and this is *always* growing).

I had heard about this in 2005. I think it's fairly common
knowledge now. But it's still very interesting.

After a brief discussion about Google, Jim Lux
came
back with some interesting back of the envelope calculations. He
was interested in the amount of time it takes to drill holes in a
piece of sheet metal or aluminum as a base plate for a motherboard.
Assuming that you could do about 12 plates at the same time, he
ended up with an estimate that it takes about 30 minutes to
drill and screw in a single motherboard. If you guess about
$10/hour in labor costs plus the price of materials, and that
cheap $20 case looks pretty attractive. Jim then finished with a
true "Tool Time" suggestion.

There IS a faster way, for a bare system approach. Use double sided sticky
foam tape. Plenty strong, it will last 2 or 3 years.

Then Doug Eadline
weighed
in and strongly recommended using regular case. He mentioned our
Kronos Project to build the fastest system we could for $2,500. At that
time, we found a well engineered small for about $40. Now you can
find them for $30 or less. Doug followed up these comments with
a slightly philosophical comment.

One of things I have learned when building clusters is to take advantage
of mass produced anything (mostly hardware). Looking inside a diskless
node, I often get the urge to build a better enclosure, but then realize
that the cost and time to fuss around with everything is not worth it. As
a hobby, sure, it might be fun, but my interest is software, a "good
enough" solution that costs much less in both time and money always seems
to win the day for me. YMMV

I think this is well said (although I can think of situations
where a custom case is warranted). But, the siren song of commodity
pricing is very hard to resist.

{mosgoogle right}

Right after Doug, Andrew Piskorski
wrote
that his favorite custom packaging scheme du jour was cookie sheets!
He just uses basic cookie sheets and mounts the systems to the
sheets. He said that there are ready made racks for these sheets.
It's a long posting with lots of details, but he talked about
how many micro-ATX motherboards he could get in a single rack
(up to about 78) and how the density was more efficient than
using standard cases. Since my wife's family is in the doughnut
business, I think this is a great idea! This is really taking
advantage of commodity components.

By the way, Andrew
posted
some links to bare bones motherboard systems. All of the links
are still active. My favorite is the
zBox.

These types of discussions are always fun. You get to see the
creative side of various people come out and the contrasts are
always fun as well (I still love the cookie sheet idea). Some
of the ideas are worthwhile I think, but in many cases, it may
be more effective to just use micro-ATX cases with micro-ATX
boards.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home
computer (donations gladly accepted) so he can store all of the postings to
all of the mailing lists he monitors. He can sometimes be found lounging
at a nearby Fry's, dreaming of
hardware and drinking coffee (but never during working hours).

Unfortunately you have Javascript disabled, please enable Javascript in order to experience the comments correctly