Hmmm. If you are reading this footnote, it's possible
that you didn't understand that I meant ``100 BaseT Fast Ethernet
Switch'' or what that is. And I don't have the time or space to tell
you right here. Check out the Beowulf Hardware appendix.

I'm trying to write an entertaining book as much
as a useful one (if I can't please you one way, maybe I can please you
the other, eh). I'm therefore going to write in a ``folksy'' first
person instead of an ``academic'' (and dry) third person. I'm also
going to insert all sorts of parenthetical comments (like this one) as
parenthetical comments or footnotes. I can only hope that this doesn't
make you run screaming from the room after reading for five minutes.

The list of names of ``registered'' beowulfs on the
beowulf website contains entries like ``Grendel'' (Clemson), ``Loki''
(Los Alamos), ``Brahma'' (Duke), ``Medusa'' (New Mexico State), and
``Valhalla'' (University of Missouri) as well as more whimsical names
such as ``Wonderland'' (University of Texas at Austin).

See for example the Parallel
Virtual File System (PVFS) being developed at Clemson. This effort
promises to parallelize disk access in a beowulf as an integrated
part of the beowulf's parallel operations. Would I call a beowulf-like
collection of compute nodes mixed with disk nodes (and possibly other
kinds of specialized nodes) a beowulf? I would.

Moore's Law will be
discussed later. It is an empirical observation that at any given price
point computer performance has doubled approximately every 9-12 months
for the last forty or more years. No kidding. This means that a 16
node beowulf that you're very proud of initially can be replaced by a
single node within three or so years. Moore's Law makes any
node or beowulf or computer design a thing of purely transient beauty.

If, in fact, you bought this book at all. The truly
cheap will be reading it a painful page at a time off the website
instead of investing the truly trivial amount of money at their favorite
computer bookstore necessary to ensure that I actually make a royalty
and that they can read in bed. Once I actually get it published so you
CAN buy it, of course.

Except to pay, of
course. Over and over again. Repeatedly. Big Iron supercomputing is
expensive. Best of all, after you've owned your million-plus
dollar supercomputer for as few as four or five years, you can typically
sell it for as much as $3500 on the open market - they get bought to
recycle the gold from their contacts, for example. I almost bought
Duke's five year old CM5 some years ago for just about this much money
but I couldn't figure out how I was going to plug it in in my garage.
True story, no kidding, by the way.

The exact probability, of course, is subject to
discussion, and therefore has been discussed from time to time on the
beowulf list (where we love to discuss things that are subject to
discussion). Donning my flameproof asbestos suit to repel the
``flames'' of those that disrespectfully disagree I'd estimate that more
than 70% of the computationally intensive work that could be done with
a parallel supercomputer of any kind can be done on a relatively simple
beowulf design. Actually, I'd say 90% but my suit wouldn't withstand a
nuclear blast so I won't.

Parallel Virtual Machine. This is a set of
library routines designed to facilitate the construction of parallel
software for a ``virtual parallel supercomputer'' made up of similar or
dissimilar machines on a network. It is open source software that
significantly precedes the beowulf effort and in some fundamental sense
is its lineal ancestor. For more details see the appendix on beowulf
software.

Message Passing Interface. Where PVM was
from the beginning and open source effort, MPI was originally developed
by a consortium of parallel supercomputer vendors in response to demands
from their clients for a uniform API (Application Programming
Interface). Before MPI, all parallel supercomputers had to be
programmed more or less by hand and the code was totally non-portable.
The life cycle of a parallel supercomputer was: a) Buy the beast. Cost,
several million dollars. b) Learn to program the beast. Convert all
your code. Cost one or two years of your life. c) Run your code in
production for a year or so. d) Realize that a desktop computer that
now costs only a few thousand dollars can run your unparallelized code
just as fast. e) Sell your parallel supercomputer as junk metal for a
few thousand dollars, hoping to break even. I'm not kidding, by the
way. Been there, done that.

I have no idea what MOSIX
stands for. Perhaps nothing. Perhaps it was developed by a guy named
Moe, the way Linux was developed by a guy named Linus. MOSIX software
makes your networked cluster transparently into a virtual multiprocessor
machine for running single threaded code. Very cool.

Don't think Condor means anything either, but the
project has a cool logo. Where MOSIX is transparent (attached to the
kernel scheduler, if you like) Condor is a big-brother-like batch job
distribution engine with a scheduler and policy engine. By the way,
anticipating that you're getting bored with pages that are
half-footnote, in the future I'll generally refer to software packages
without an explanation. As I said before, look in the Beowulf Software
Appendix for URL's and more detailed explanatory text.

Linus Torvalds has fairly religiously rejected any
redesign of the core kernel scheduler that doesn't preserve ``perfect''
interactive response within the limitations imposed by the hardware.
Philosophically, the Linux kernel is willing to make long-running
background tasks wait a bit and perhaps lose 1% of the capacity of the
CPU to ensure that it responds to keyboard typing or mouse clicks now. As a consequence, a graphical user interface (GUI) user is
generally unable to tell whether one, two, or even three simultaneous
background jobs are ``competing'' with their GUI tasks for cycles. You
can read the appendix on my own early cluster experiences with a less
friendly operating system to see why this is a Good Idea.

This is my own personal favorite approach to parallel
supercomputing, and was a major design factor in Brahma, Duke's
original beowulfish cluster effort which I called a ``Distributed
Parallel Supercomputer'' as it had both dedicated nodes and desktop
nodes, mixing characteristics of a beowulf and a NOW. In the physics
department we kept Brahma running at over 90% of its capacity for
years and managed to cover a dozen or so desktops with very nice
systems indeed.

Which is why they buy computers
instead of ``thin'' client interfaces, which I personally have
repeatedly thought were a dumb idea, every time they've been reinvented
over the years. If people (other than corporate bean counters) wanted
``thin'' we'd still be timesharing with evolved VT100 terminals, for
God's sake. It is especially stupid with the marginal cost of ``thick''
at most a couple hundred dollars. IMHO, anyway.

It's a kind of high-performance massively parallel computer built
primarily out of commodity hardware components, running a free-software
operating system like Linux or FreeBSD, interconnected by a private
high-speed network. It consists of a cluster of PCs or workstations
dedicated to running high-performance computing tasks. The nodes in
the cluster don't sit on people's desks; they are dedicated to running
cluster jobs. It is usually connected to the outside world through
only a single node.

Some Linux clusters are built for reliability instead of speed. These
are not Beowulfs.

This is not really a fairy tale. Occurences
frighteningly close to this have been reported from time to time on the
beowulf list. Remember, for a beowulf to be useful, its nodes have to
have a low probability of crashing during the time of a
calculation, which can easily be days or even weeks. Does this match
your experience of Win-whatever running anything you like? Enough
said. Linux nodes (to my own direct and extensive experience) don't
crash. Well, they do, but most often only when the hardware breaks or
one does something silly like exhaust virtual memory. Uptimes of months
are common - most linux nodes get rebooted for maintenance or upgrade
before they crash.

By ``tiny fraction'' I mean as
little as a few percent. If the ``top500'' supercomputing list ranked
computers in aggregate calculations per second per dollar, instead of
just in aggregate calculations per second (hang the cost), big-iron
solutions likely wouldn't even make the list. Beowulf-style clusters
would own it.

We'll refer to ``the task''
although of course you may well want to build a beowulf to do more than
one task. In that case you will need to do most of the work associated
with this protocol for all the tasks and, if necessary, make
trade-off decisions. Beowulves designed for one task won't always do
well on another. Be warned

This ``currently available'' hardware, some of it
fairly state of the art as of fall 2000, is probably obsolete by the
time you read this. Your pocket calculator is likely faster if you're
reading this in 2005. Sigh.

Noting carefully that scalability of
the prototype and your code is one of the things you should be measuring
on the prototype, so that the term ``often'' should properly be
``always''. Except when it doesn't work, of course.

Just kidding, again, jeeze, can't you take a
joke? Seriously, real computer scientists are likely to look just
like your average, run of the mill pocket protector adorned geek. You
can hardly ever pick one out of a crowd of geeks just by looking.

They were really big and really slow. Your
pocket calculator or your kid's Nintendo today could probably
out-calculate them, and might even have a bigger memory and better
programming language. However, Moore's Law, which is another IBM
invention, has made computers smaller and faster ever since almost like
a law of nature. Moore's Law is discussed a bit later in the text.

Literally, ``to drink''. See Robert A. Heinlein's
``Stranger in a Strange Land''. Properly speaking, it would be more
correct of you to grok some beer while working on understanding Amdahl's
Law. If you have a fridge handy, go on, get a cold one. I'll wait.

Physics purists beware: Yes, this really should be
called the ``power'' of a computer program, not the speed, and power
would indeed be a more precise term for what it describes, even though
this ``work'' has nothing to do with force through distance. Work is
used in the sense of accomplishing some set of tasks with no proper
underlying metric, but ultimately it is related to a sort of free
energy.

Think of it as an assignment for the conceptually
challenged. I know you're busy, but you're going to have to do your
homework if you expect to learn from this course. So put on the stereo,
make sure the dinner in the oven, start a load of laundry, pop open a
beer, and while all that is going on write down three or four ways you
parallelize tasks in your home or office.

Except in those very rare
cases and small ranges of where it is better, see previous note.
Actually, a bigger danger is that a parallel version of your code will
be so different from the original serial version that Amdahl's Law is no
longer relevant as there is little left of the original ``serial
fraction'' of code. Unless you bother to write a useless ``serial''
version of the parallel code, you're comparing apples to orangutan's.

Did you ever glue down the cockpit canopy and then learn
that first you should have glued in the little bitty
pilot-in-a-chair assembly?. Oops. The cockpit-canopy gluedown step has
to wait until the pilot-in-a-chair is done even if it means that
you or your friend remains idle.

For
example, the one by Amalsi and Gottlieb, from which I cribbed a lot of
this stuff. Or there are some great resources on the web, for example
the excellent online book on designing parallel programs by Ian Foster
at Argonne National Labs, http://www-unix.mcs.anl.gov/dbpp/. This
latter resource is particularly awesome (and free) and will be ignored
only by the terminally ignorant.

Unless, of course, you are really very smart or a real
computer scientist (in which case you're probably sneering at this
miserable excuse for a real book on computer science) or both. I mean,
somebody invented them, why shouldn't you reinvent them?
Seriously, check out Ian Foster's collection to see why.

One must, of course,
get over your embarrassment if your task turns out to be embarrassingly
parallel. It's like being embarrassed at being a billionaire or being
endowed with a perfect life and great health while others in the world
lead flawed lives. A moment or two is all right (to show that you're
compassionate), but then it becomes maudlin. Look, the work I do in my
physics research is embarrassingly parallel. There, I admitted it. You
can too. Let it all out. Maybe we'll start a support group.

If you have 1000 friends, of course, you'd never be
reading this. You'd either be partying constantly or in politics. Only
in the latter case would you be tempted to distribute 1000 model
airplane kits, but you wouldn't pay for them.

I'm assuming that you are getting out a
used envelope, or buying a new one if you have to, and actually working
out the relative rates to determine that it actually took 11/10 longer than it would have taken you to build them working alone, for a
speedup - uh, slowdown - of 10/11. This is known as doing ``back of
the envelope calculations'' and unfortunately you're going to be forced
to do this if you hang out with physicists, especially theorists. No
self-respecting theorist ever goes anywhere without a pocket full of
envelopes. Used is best, but there are few pleasures to compare with
filling a crisp new envelope (or the back of your check or the
tablecloth itself, in a pinch) with mind-boggling equations and then
leaving them on the table for the waitperson to ogle after a four-beer
physics lunch. But I digress.

This often leads to a
certain amount of cursing, some crawl-by shootings, and some good
natured road rage. I find that the same is true when one discovers a
critical bottleneck in your parallel cluster, especially after
writing a many-shekel grant proposal to buy it and build it.

You can always tell an Old Guy in computing
because we still use quaint terms like ``core'' to describe something
that hasn't been an actual core since before most of the folks reading
this were born. I've actually seen and held in my hands antique memory
cores, which looked like a funny 3-dimensional grid of wires and beads
in a tube. Chip-based memory is boring by comparison. You can see a
picture of a memory core (complete with a nifty ``magnifier'' that scans
the beads) at
http://www.physics.gla.ac.uk/fdoherty/IDRG/lense.html at the time
of this writing.

If one is inclined to argue, imagine the speedup if the
four nodes in question don't have remote swap or any swap at all.
In that case the job takes an infinite amount of time on a single node
as it just won't run, and will take a finite amount of time on four
nodes. How's an infinite speedup for a violation, eh?

As I discuss later below, running the job on a 512
MB system with 5 GB of virtual memory in the form of swap could
conceivably transform a job that would take a mere day or two on a
system with 5 GB of memory into one that would probably finish just in
time for the next millenial celebration. No kidding. Six orders of
magnitude will do that to you.

One which
it is always nice to see is shared by others. For example, a quote from
the aforementioned web page on memory hierarchy:

All other things equal, more cache is better than less. Clearly no
desktop user is going to blow $3800 to get a Xeon processor that, for
ordinary applications, will be almost indistinguishable from a $150
Celeron. Just because a Pentium III system is within reach, is it really
worth the money?

Vendors have an incentive to sell more expensive units. Some customers
will opt for the more expensive machine because they think they're worth
it. However, most casual users will get along quite nicely with a
Celeron. The Pentium III is engineered best for a workstation or server
with two CPUs. If it ever makes sense, the Xeon is designed for
corporate servers with 4 or 8 processors.

This is part of the Yale ``PC Lube and Tune'' website, which is actually
rather nice and well worth bookmarking as an extended resource:
http://pclt.cis.yale.edu/pclt/default.htm

It is worth mentioning that linear operations like
this can be very significantly sped up (by a factor of 2 to 3 or more)
by precisely organizing them to match the actual sizes of the caching
subsystems. ATLAS (Automatically Tuned Linear Algebra System) is a
project that has written some very clever code-building code that
creates linear algebra libraries (BLAS and LAPACK) with loop and block
sizes precisely tuned to cache size to yield empirically the best
possible execution times. Very, very cool. See
http://www.netlib.org/atlas/index.html to get the package.

Before you even start to think about this in
detail, I'd strongly recommend a trip to
http://www.faqs.org/faqs/electrical-wiring/part1/ to learn a little bit
about electricity and how it is distributed. I cover some of it in
summary below, but you should take this subject fairly seriously.

In the United States,
anyway. Sorry, I know, you probably live in Europe or India or Korea or
South America, but I don't know anything about electricity and
electrical codes there and hence couldn't help you anyway. If a
volunteer from the beowulf list sends me details about other countries,
I'll certainly break this section up into subsections, one per
contribution.

By stopping your heart; not,
as a general rule, by cooking your brain like a hot dog unless you are
messing with really high voltage and current. 60 Hz turns out to be a
bad frequency because it interferes with the biological
frequencies that keep your heart cranking along. Oops. 1000 Hz would
have been a much safer choice.

Aha!
You thought that I'd present the calculation here, didn't you. Admit
it. You're just being lazy, and I'm tempted to tell you to get out an
envelope, but you probably don't remember the latent heat of fusion for
water (333.5 kJ/kg) or the number of kilograms in a pound (0.4535) or
the number of seconds in a day (86400) and all that. So one ton of air
conditioning can remove
kilowatts from a room continuously

Scyld, at http://www.scyld.com, is commercial,
Clustermatic, at http://www.clustermatic.org, is non-commercial. Both
are open source; both are based on Erik Hendriks ``bproc'' program. The
Scyld solution comes with hot and cold running commercial support, but
costs money. The Clustermatic solution comes with the usual hot and
cold running support-by-other-users, and is ``free''. Ya pays your
money (or not) and takes your choice...

Let's see, they are slow, they are broken and don't work,
they are clumsy, they require that you have graphical interface
(the X console) working, they are broken and don't work, they are
security risks, they prevent you from ever learning proper scalable
systems management techniques for a beowulf, they are broken and they
don't work right where you damn well need them to, they are utterly
different on different distributions and hence what you learn isn't
portable at all...Did I mention that they are often broken and don't
always work?

Many beowulfers would disagree that
ssh is necessary on the nodes, as they are typically on a private
network behind a system that functions as a de facto firewall (all of
which is true). Tough, they're wrong and I'm right. They can write
their own book. Even on strictly functional grounds, rsh sucks (in
addition to being, as Mr. Protocol poetically put it in Sun Expert
11.3, a ``rampaging security hole masquerading as a convenient remote
command execution facility''). It needs to die, die, die and it won't
as long as there are still pitiful fools who still use it.

Oooo, fighting words on any linux list. So let
me be specific to try to avoid stimulating a supernova in my
neighborhood. This is an introductory book, right? So I mean the best
way for relatively unskilled people to achieve a soundly designed,
clearly scalable, simply manageable, and robust beowulfish cluster. If
you know enough to argue with the word ``best'', relax. It isn't
intended for you anyway. Best is relative.

Amazing Facts: Yes Virginia, humans have on
multiple occasions been struck by meteors. On many of those occasions
they have lived. Humans have (to the best of my knowledge) never
succeeded in writing a single script that would install Red Hat, SuSE,
Caldera, Debian, Slackware, TurboLinux (and the list goes on) in a
beowulfish configuration. I rest my case.

Meaning that one has to boot a host at least one time with
an SVGA card and monitor and keyboard attached, which is a pain that
adds thirty minutes to each node install. It's your time, but I'd just
buy the damn $30 card if I were you.

Actually, I'm joking, here, sort of, in case you
were taking the word ```conspiracy'' too seriously. Diskless operation
requires a sophisticated operating system and seamless
remote disk services, and, well, who would ``conspire'' to keep their
operating systems hopelessly unsophisticated and incapable of real
networking? It must be a cruel accident; nobody could be that stupid.

From this
point of view, 101 is a foolish choice - I should start the beowulf at
192.168.1.128, 192.168.1.129, ... (for example) so that I can identify
all ``node'' addresses by a mask on the highest bit. True enough, but I
don't need to subnet and it's certainly easier to guess the IP number of
b57 if one starts on 101.

Note that if you don't know what a Markov process is or
how importance sampling Monte Carlo works, don't worry about it.
Imagine me rolling lots and lots and lots of electronic dice and
playing snakes and ladders. It's close enough.

Narrow minded of them, don't you think? Do your computer
science homework on time or advance the cause of science? Decisions,
decisions. Actually, they rapidly learned to just reboot the systems
when they discovered my job running, temporarily foiling my automated
job spawner...

This was long
ago when if you'd said ``linux'' to an Adaptec rep they'd have said
``Huh?''. Adaptec at the time had this nasty habit of changing the card
BIOS without changing their revision number. I believe they still do.