One
of the more interesting projects I had the pleasure to tackle over the
last 12 months has been the commissioning, testing and operational use
of a large Pentium cluster. In this month's feature I will endeavour to
provide a practical perspective on many of the issues discussed in the
1998 "Supercomputing on a Shoestring" feature.

The subject of this discussion is the parametric processing
cluster. Sadly the term cluster has been used and abused by the vendor
community every which way, as a result of which the average sysadmin's
first question will always be "what kind of cluster ?" A parametric
processing cluster is commonly referred to in the literature as a COW
(Cluster of Workstations), NOW (Network of Workstations) or PoPC (Pile
of PCs), and is a processing arrangement in which a large number of
like or different machines are connected via a high speed network, and
via the use of appropriate software tools, harnessed to work on a
single problem.

The essential idea in clustering is to achieve via the use of
a large number of "small" machines the same aggregate number of
computing cycles, and aggregate amount of memory, as would otherwise
only be found in a "classical" supercomputer such as a Cray, or a very
heavily configured Unix multiprocessor.

All supercomputing applications are by their very nature
sensitive to the type of computational problem to be solved. The
"classical" supercomputer, typified by the Cray family, the Convex
"Crayettes" or the i860 "microCray", achieves its high performance via
the use of vector processing techniques, and in larger machines, the
generous use of Gigabytes of extremely fast and expensive RAM.

In a vector processing machine the conventional machine
instruction set is expanded to include vector instructions, which are
designed to operate on arrays of operands, or vectors, rather than
single operands. An example of a vector instruction would be a dot
product or cross product operation, where the operands (arguments) are
pointers to arrays (vectors) of data, and a vector size. When such an
instruction is encountered, the processor will repetitively crunch
through each and every individual operand in the vector, using
dedicated integer or floating point vector processing hardware. Since
such a vector operation involves a large number of like computations,
it can be very efficiently pipelined for maximum speed. Moreover, since
there is no need to execute an opcode fetch cycle for every individual
operation, processor time and bus bandwidth is not wasted. As a result,
vector processors implemented in any given hardware technology can
usually perform vector intensive computations many times faster than a
conventional processor would.

The snag with vector processing is that its superb performance
potential is confined to problems which lend themsleves to being
vectorised. Large scale matrix intensive work, such as finite element
analysis techniques used in engineering or science are a typical
example. Problems of this ilk can be easily remapped into a form where
the vector processor has an unbeatable advantage over any other player
in the market.

Difficulties will arise if the problem cannot be easily
remapped into a series of large scale matrix bashing operations. Many
"real world" computing problems fall into this category, and until now
users needing to do such work were severely hampered by the limitations
of vector processors, many of which delivered very pedestrian
non-vector compute performance.

The answer for many such problems lies in the general area of
parallel processing, where a large number of general purpose processors
are harnessed in parallel. The "gotcha" of course lies in the issue of
how to parallelise the application in the tidiest manner to get this to
work.

One approach which was fashionable for many years was the
parallelising compiler, which would take the source code for the
program, analyse it for those components which have no mutual
dependencies in data (ie the results of these computations do not
depend upon each other's results), and execute these on separate CPUs
in the parallel multiprocessor.

The sad reality is that there are few such compilers out in
the market (I have yet to see one of a commercial standard). There are
good reasons why, insofar as such a compiler has to be capable of
splitting the program into multiple threads in various places, running
these on individual CPUs, and then returning to a single thread of
execution as appropriate. This is not an easy thing to do, if it is to
be completely hidden from the user. The compiler must incorporate a
complete run time environment for forking off multiple processes in the
"parallel" portions of the program, and then merging the results of
these, all transparently. If the program has many portions of code
easily parallelised, and bits which cannot be parallelised in between,
significant overheads can be incurred at runtime. The "granularity" of
the parallelisable portions of the program can create problems of its
own.

For many computational problems this is not really necessary,
and a particular class of such problems are the category of "parametric
problems". A parametric problem is one in which a specific computation,
ie program, must be repeatedly executed using different starting
conditions or other parameters. A good example is any simulation which
must be run for a large number of scenarios.

Consider a computation where you are trying to analyse the
behaviour of a storm water drainage system in a large city. Rain can
fall with varying intensities across various suburbs, at varying times.
Therefore a parametric modelling approach would see us writing a
simulation for the flow rates through the drains, and running it a huge
number of times for different rainfall maps, and then comparing the
results of each of these many compute runs to identify the conditions
under which specific drains overflow and flood their neighbourhoods.

A great many practical problems fall into this category. My
problem for instance involved analysing the behaviour of large mobile
ad hoc networks, under varying conditions of weather, rainfall, cloud
and operating frequency. Many of the computations were very non-linear,
indeed so much so that a Cray would have been almost utterly useless
for this purpose. However, a cluster of 60 Pentiums did the job very
nicely indeed.

A great many environmental, economics, engineering and
operations research problems are prime candidates for parametric
computing, and we can expect to see this market progressively expand in
coming years as users in these specialities come to appreciate what
they missing.

Implementing a COW, NOW
or PoPC

At a first glance implementing a COW/NOW/PoPC or cluster is a
trivial chore, we simply hook up a large number of machines and get
down to work. In practice there are many considerations involved, and
much careful thought is required to get the desired end result.

The starting point, and arguably the central issue, is the
choice of the software runtime environment, since this imposes
constraints upon other areas. A number of clustering toolsets exist in
the marketplace at this time, most of which are oriented toward the
"message passing" model. This model is based upon the idea of splitting
a large problem into many small chunks, and using a message passing
mechanism to exchange results across the boundaries in the problem,
these boundaries being the result of partition into smaller chunks. The
other alternative is to employ a parametric processing toolset, if your
problem falls into this category.

The choice of the runtime environment and style of parallel
processing will impose constraints in other areas.

The first is the choice of the machine architecture and
operating system you can use. While most modern clustering toolsets are
available for a range of architectures, some thought will need to be
given to what is the best choice for the application.

Applications which will generate large amounts of I/O traffic
to local or network disks will be better served by Unix workstations
since these have traditionally been designed for server and power user
applications, and thus have the I/O bandwidth to cope well.
Applications which are compute and memory bound, rather than I/O bound,
are less demanding of the basic hardware, and can be run on commodity
PC hardware. Where I/O bandwidth is not critical, a typical Pentium or
top end Intel clone will be highly competitive against a Unix
workstation chip, especially in a parametric computing environment,
where more cheaper CPUs can be harnessed to do the same volume of work.

The choice of machine architecture and operating system is of
course constrained by the supported runtime toolset ports, but within
that range of choices we still have some play. If we expect to be
running compute bound problems, each of which takes relatively little
time to execute, then a Linux solution would be very attractive. On the
other hand, if our individual problems take many hours if not days to
execute on our CPU of choice or involve high I/O rates, a more stable
and commercially tuned operating system like a BSD or SVR4 variant
would be preferable choice.

This aspect of the exercise should not be underestimated,
insofar as any problems you encounter running your code on a single CPU
will be magnified N-fold with an N processor cluster. If the operating
system you employ, certain proprietary Intel hosted desktop products
coming to mind here, likes to go belly up about once a day, on a 32 CPU
cluster this is 32 crashes to deal with per day. Poor choices in this
respect can be worth much more trouble than may be incurred by
rehosting the basic application code to a better, even if more
expensive platform.

The choice of processor type and operating system is but one
portion of the hardware problem to be solved. The next issues are
hardware packaging and networking.

Hardware packaging can significantly impact both the life
cycle cost and the uptime of the planned installation. We assume
"headless" hosts in the cluster.

If you are planning a centralised cluster installation,
serious consideration should be given to the use of rackmounted against
desktop machine packaging. Chassis designed for installation in RETMA
19" racks have two important advantages over conventional desktop
chassis in rackmount trays.

The first is that they are designed for the thermal
environment of a rack, and usually have inlets and exhausts located on
the front and the back. A typical arrangement sucks air in through the
front panel and exhausts out the back, thus precluding reingestion of
heated exhaust air, or ingestion of heat dumped from machines lower
down in the racking system. Since hardware failure rates increase with
operating temperature, higher initial expenditure will be offset by
lower through life expenditure and downtime.

The other factor favouring rack mounted chassis is access for
repair or upgrades, since typical rackmount chassis are designed to
slide out on rails for maintenance. A tray mounted desktop chassis has
to be carried away for such work.

Needless to say, rackmounting does allow a higher packing
density in the cabinet which can be a issue for many smaller sites.

The networking problem can also be a source of much heartache
if not considered carefully. The issues revolve around the behaviour of
the application and its implications for performance and reliability.
Since the network is a single point of failure and a potential
performance bottleneck, it is an item which must be done right if it is
not to become a major headache in the production phase.

With a message passing runtime environment, a variant of a
hypercube topology, or a mesh topology will be the most suitable choice
since this allows localised inter-processor messaging traffic to be
isolated from the rest of the cluster. Judicious choice of topology may
allow the use of very cheap networking hardware, or even 10 Base 2
thinwire connections, with no serious performance penalties.

In a parametric processing runtime environment, where the
basic model is essentially a star topology, the issue is one of whether
to use a switch or a dumb hub, and what channel bit rate and adaptor
performance is necessary. The implicit topological model creates in
turn an implicit performance bottleneck.

Given the available hardware in the marketplace, the choices
boil down to 100 Base T, ATM and FDDI. The latter is arguably a legacy
item, but is worth mentioning since some sites may choose to build up a
cluster from retired servers and desktop workstations for which the
former interfaces may not be cheaply available, whereas pre-loved FDDI
hardware may be.

In assessing how much to spend on the networking side, the
central issue is whether the application demands a high I/O rate to a
central storage device. If this is the case, an example being any
application which uses a large common dataset of some kind, then it is
advisable to aim for the best possible network performance. Therefore
each host in the cluster should use a network I/O card with an embedded
processor (if the card can do the front end TCP/IP stack processing,
this is even better), and a fast switch should be employed.

Where the application can get by with a copy of the central
dataset on local machine disk, or doesn't use large datasets, then
embedded network interfaces on the motherboard and a hub may well
suffice.

The final issue is one of sizing the hosts in the cluster and
tuning their performance.

This boils down to the classical host sizing and tuning
problem, and no rocket science is involved. The conventional strategy
of running the application on a single testbed system is employed, to
determine how much memory and how much local disk to install. Swap
space sizing is also conventional. The ground rules are to ensure that
swapping is avoided where possible, and that the I/O performance of the
disk used is compatible with the application.

A compute bound application on an Intel architecture will
mostly get by with commodity IDE disks, with an I/O intensive
application wide ultra-SCSI and high performance adaptor boards will be
required.

As a final point it is worth reiterating that careful planning
pays off in the building of clusters, since any problems seen in
one-off installations are by the nature of a cluster magnified N-fold.
If it bites you on the desktop, it will devour you on a cluster.

Activetools Clustor

As an example of a parametric processing toolset I will
explore the Activetools Clustor product (http://www.activetools.com/),
since I have used it extensively, much of this discussion was the
subject of my AUUG 99 paper. Many of the general attributes and issues
will be applicable to other parametric toolsets.

Activetools Clustor is the commercial development of the
Nimrod toolset, devised and implemented by Prof David Abramson and Rok
Sosic. It has been widely marketed in the US and is used by a number of
academic, government and commercial clients to support a range of
applications. Clustor has been ported to Linux on Intel, Solaris on
SPARC, Irix on MIPS, AIX on PowerPC, OSF/1 (or whatever it is being
called these days) on Alpha and HP/UX on HP-PA, as well as NT on Intel
for those enamoured of proprietary operating systems. Activetools will
port to any reasonable architecture at a client's request. Freebie
demos for all supported architectures, with timebombed licences, are
available on their website for users interested in exploring the
product.

The basic model used in Clustor is that of running a gaggle of
"robot users" on a central "root node" machine, each of which will
execute the application using telnet or rsh on one of the cluster's
many "client nodes". The toolset provides an environment in which the
individual parametric "scenarios" can be centrally programmed prior to
runtime, and the activity of the "robot users" managed from a single
point. At runtime, each "robot user" logs into its pre-programmed
client, sets up its local runtime environment, fires up the
application, and upon its completion gathers the results of the
computation and cleans up after itself.

The simplicity of this basic model, analogous in concept to
the remote terminal emulation model used in performance analysis,
belies much of the complexity required to implement it.

At runtime Clustor will create on the target client node a
directory tree, and using a configuration script called a "plan file",
suffixed .pln, it will copy across as required data files, parameter
files and if required the executable binary of the application itself.
When the application has completed, it will use the same script to save
the results back to a specified place on the "root node". Should the
application fail and return with an error code, Clustor will save every
single file into an error directory on the "root node" to facilitate
debugging.

The Clustor plan file uses a powerful scripting language. The
file can be manipulated using a text editor, a technique preferred by
most users with experience in Unix scripting or languages like PERL, or
using a windowed front end "preparator" tool (implemented in X11 or
Windoze depending on the port). A plan file is typically divided into
three portions, a generic setup phase, a runtime phase, and a cleanup
phase.

For Clustor execution the plan file must be preprocessed by
the atcgenerator tool which creates a runtime command file, termed a
"run file", suffixed .run, which is human readable and is directly
interpreted at runtime by Clustor.

To execute the application on the cluster, the "atcdespatcher"
tool is run, using the run file. It will fork off a gaggle of
"atcjobmanager" processes, one for each client CPU. These processes
each run a specific instantiation of the run file, with the programmed
parameters for that specific computational run. The atcdespatcher can
be run in the background, or using a GUI activity monitor display.

Clustor is capable of substituting both command line
arguments, and data fields in configuration files, to parametrize the
job. In the simulations I performed, I used both, to manipulate
arguments and .rc file contents.

Porting the application to run on a cluster, using Clustor,
requires several steps.

Port the application, if necessary, to the client node
operating system. This will typically require disabling of any GUI
features and incorporation of an .rc file or command line argument
interface, as well as accommodating system idiosyncrasies.

Validate the port by suitable regression testing.

Produce the Clustor plan file.

Test the Clustor plan file for a small number of CPUs.

Validate the results from the cluster test runs against
the same results run on individual machines.

The time consumed for this will vary with applications and the
complexity of running the application, which is reflected in the
complexity of the plan script. I expended several days to move my
FreeBSD/Irix port to Linux, and generate the scripts, upon which I
expended 2-3 weeks testing and debugging on the cluster.

It is worth noting that Clustor makes for an excellent
regression testing environment, indeed my experience was that nothing
beats large scale cluster compute runs for finding bugs buried in your
code for months !

Clustor will allow inhomogeneous OS and architecture
environments, such as the use of Unix on the root node and NT on the
client nodes, or varying Unix variants across the cluster. The only
caveat is to ensure that the application binary is available on each
client, and that endian issues are resolved in binary data files.
Inhomogeneous clusters can be advantageous insofar as the root node, a
single point of failure, can be implemented using a more reliable,
higher performance (and expensive) machine and operating system to the
client nodes. Soaking up spare cycles across a site by using
underworked desktop machines to supplement a cluster is entirely
feasible.

Space limitations preclude a more detailed discussion of
issues such as load management in multi-user environments, and specific
experience with particular operating system variants. Interested
readers are invited to read my AUUG99 presentation slides, posted at http://www.csse.monash.edu.au/~carlo/auug99-clustor.ppt
, details of the Monash cluster installation are posted at http://hathor.csse.monash.edu.au/
.

From my perspective as a user I was very pleased with the end
result. The use of a cluster allowed me to process around 1000
simulations, each of which could take up to days on a single Pentium
CPU, over a period of several months, with a very modest management
overhead. The use of the scripts to sort the resulting data proved to
be a major time saver. Using the Monash cluster, which ultimately
incorporated 60 Pentiums across two campuses in Melbourne, allowed me
to complete a simulation project on a scale which was simply not
implementable using the conventional "single box" approach.

Cheap commodity processors and tools such as Clustor have the
long term potential to revolutionise many areas of engineering,
scientific, and operations research computing. It is fair to say that
at this time we are standing on the threshold of another new computing
paradigm.