Finding the truly optimum design can be difficult. In some cases the
only way to determine a program's performance on a given hardware
and software platform (or beowulf design) is to do a lot of prototyping
and benchmarking of the program itself. From this one can generally
determine the best design empirically (where hopefully one has enough
funding in these cases to fund the prototyping and then scale the
successful design up into the production beowulf). This is almost
always the best thing to do, if one can afford it.

However, even if you are able to prototype and benchmark your
actual application, the design process is significantly easier if one
possesses a detailed and quantitative knowledge of various microscopic
rates, latencies, and bandwidths and how they depend nonlinearly on certain system and program parameters and features.
Let's begin by understanding just what these things are.

A rate is a given number of operations per unit time, for
example, the number of double precision multiplications a CPU can
execute per second. We might like to know the ``maximum'' rate a CPU
can execute floating point instructions under ideal circumstances. We
might be even more interested in how the ``real world'' floating point
rate depends on (for example) the size and locality of the memory
references being operated upon.

A latency is is the time the CPU (or other subsystem) has to
wait for a resource or service to become available after it is
requested and has units of an inverse rate - milliseconds per disk
seek, for example. A latency isn't necessarily the inverse of a rate,
however, because the latency often is very different for an isolated
request and a streaming series of identical requests.

A bandwidth is a special case of a rate. It measures
``information per unit time'' being delivered between subsystems (for
example between memory and the CPU). Information in the context of
computers is typically data or code organized as a byte stream, so a
typical unit of bandwidth might be megabytes per second.

Latency is very important to understand and quantify as in many
cases our nodes will be literally sitting there and twiddling their
thumbs waiting for a resource. Latencies may be the dominant
contribution to the communications times in our performance equations
above. Also (as noted) rates are often the inverse of some latency.
One can equally well talk about the rate that a CPU executes floating
point instructions or the latency (the time) between successive
instructions which is its inverse. In other cases such as the network,
memory, or disk, latency is just one factor that contributes to overall
rates of streaming data transfer. In general a large latency translates
into a low rate (for the same resource) for a small or isolated request.

Clearly these rates, latencies and bandwidths are important determinants
of program performance even for single threaded programs running on a
single computer. Taking advantage of the nonlinearities (or avoiding
their disadvantages can result in dramatic improvements in
performance, as the ATLAS (Automatically Tuned Linear Algebra System)
[ATLAS] project has recently made clear. By adjusting both
algorithm and blocksize to maximally exploit the empirical speed
characteristics of the CPU in interaction with the various memory
subsystems, ATLAS achieves a factor of two or more improvement in the
excution speed of a number of common linear operations. Intelligent and
integrated beowulf design can similarly produce startling improvements
in both cost-benefit and raw performance for certain tasks.

It would be very useful to have automatically available all of the basic
rates that might be useful for automatically tuning program and beowulf
design. At this time there is no daemon or kernel module that can
provide this empirically determined and standardized information to a
compiled library. As a consequence, the ATLAS library build (which must
measure the key parameters in place) is so complex that it can take
hours to build on a fast system.

There do exist various standalone (open source) microbenchmarking tools
that measure a large number of the things one might need to measure to
guide thoughtful design. Unfortunately, many of these tools measure
only isolated performance characteristics, and as we will see below,
isolated numbers are not always useful. However, one toolset has
emerged that by design contains (or will soon contain) a full
suite of the elementary tools for measuring precisely the rates,
latencies, and bandwidths that we are most interested in, using a common
and thoroughly tested timing harness. This tool is not
complete9.1 but it has the
promise of becoming the fundamental toolset to support systems
engineering and cluster design. It is Larry McVoy and Carl Staelin's
``lmbench'' toolset[lmbench].

There are two areas where the alpha version 2 of this toolset used in
this paper was still missing tools to measure network throughput and raw
``numerical'' CPU performance (although many of the missing features and
more have recently been added to lmbench by Carl Staelin after some
gentle pestering). The well-known netperf (version 2.1, patch level 3)
[netperf] and a privately written tool [cpu-rate] were used
for this in the meantime.

All of the tools that will be discussed are open source in the sense
that their source can be readily obtained on the network and that no
royalties are charged for its use. The lmbench suite, however, has a
general use license that is slightly more restricted than the usual Gnu
Public License (GPL) as described below.

In the next subsections the results of applying these tools to measure
system performance in my small personal beowulf cluster[Eden] will
be presented. This cluster is moderately heterogeneous and functions in
part as a laboratory for beowulf development. A startlingly complete
and clear profile of system performance and its dependence on things
like code size and structure will emerge.