In order to achieve the best scaling behavior, we can see from the
previous chapter that we want to maximize the parallel fraction of a
program (the part that can be split up) and minimize the serial fraction
(which cannot). We also want to maximize the time spend doing work in
parallel on each node and minimize the time required to communicate
between nodes. We want to avoid wasting time by having some nodes sit
idle waiting for other nodes to finish.

However, we must be cautious and clearly define our real goals as
in general they aren't to ``achieve the best scaling behavior'' (unless
one is a computer scientist studying the abstract problem, of course).
More commonly in application, they are to ``get the most work done in
the least amount of time given a fixed budget''. When economic
constraints appear in the picture one has to carefully consider
trade-offs between the computational speed of the nodes, the speed and
latency of the network, the size of memory and its speed and latency,
and the size, speed and latency of any hard storage subsystem that might
be required. Virtually any conceivable combination of system and
network speed can turn out to be cost-benefit optimal and get the most
work done for a given budget and parallel task.

As the discussion proceeds, it will become clear why successful beowulf
design focusses on the problem as much as it does on the hardware.
One perfectly reasonable conclusion a reader can draw from this chapter
is that understanding the nuances of computer hardware and their
effect on program speed is ludicrously difficult and that only
individuals with some sort of obsessive-compulsive personality disorder
would ever try it. It is so much simpler to just measure the
performance of any given hardware platform on your program.

When you achieve this Satori, Bravo! However, be warned that the wisest
course is to both measure performance and understand at
least a bit about how that measured performance is likely to vary when
you vary things like the clock speed of your CPU, the CPU's
manufacturer, the kind and speed of the memory subsystem, and the
network.

The variations can be profound. As we'll see, when we double the
size of (say, vectors being multiplied within) a program it can take
twelve or more times as long to complete for certain ranges of
sizes. You could naively make your measurement where performance is
great, expecting it to remain great in production for much larger
vectors or matrices. You could then expend large sums of money buying
nodes, fail miserably to get the work accomplished that you expected to
accomplish for that sum, and (fill in your own favorite personal
disaster) get fired, not get tenure, lose your grant, go broke, lose the
respect of your children and pets. Alternatively, you could make your
measurement for large matrices, assume that fast memory systems
are essential and spend a great deal for a relative few of them, only to
find that by the time the problem is split up it would run just as fast
on nodes with slower memory that cost 1/10 as much (so you could have
bought 10x as many).

This isn't as hard to understand is it may now seem. I'll try to
explain how and why this can occur and illustrate it with enough
pictures and graphs and examples that it becomes clear. Once you
understand what this chapter has to offer, you'll understand how
to study your problem in a way that is unlikely to produce embarrassing
and expensive mistakes in your final beowulf design.