The FEATFLOW benchmark is a set of test calculations which is very
similar to the 1995-DFG benchmark `Flow around a cylinder'.
However in contrast to the
original one, all input parameters and even the triangulation
is fixed. The remaining interesting point is the total CPU cost and how
they are distributed to the separate tasks in a CFD code, as
mesh generation, assembling of matrices and right hand sides, solving
linear subproblems or postprocessing steps.

The original idea
behind this benchmark was to have a tool at our institute which helps us
in selecting the "optimal" workstations for our classes of problems. Being
an university institute, no money is available for supercomputers. However,
these might be even useless since most of our working time is spent with developing and testing methods and codes. Modern workstations are much more appropriate for these kind of tasks since, for
even less money than needed for one single supercomputer, every member of the
institute may get his own workstation. Furthermore, the new generations
of processors in workstation are very fast, even for
complex data structures. In contrast, these data
structures which are necessary for complex CFD applications
often behave in such a way that
they lead to performance problems on vector computers.
Moreover, parallel machines require additional work in the design of
algorithms and implementations which is in general a very hard job.
Hence, modern powerful workstations with 1 - 2 GByte RAM are the typical workhorses for numerical analysts, and this at relatively small cost.

The major question which typically arises is:
Which kind of workstation/processor is adequate to our work?
There are IBM's, DEC's, SUN's, HP's, SGI's and more,
and all promise to be the fastest or at least to be as fast as the other
competitors if quantities like SPEC's, MIPS or MFlops are compared.
Next, one takes into account the results from LINPACK tests which measure the
time for inverting a full matrix of size 100 x 100 or
1000 x 1000. Again, the results do not
differ much, but one may gain first hints that there are significant hidden problems.
Looking at the "official" LINPACK tests, there is one interesting fact which is not
well-know to everybody: LINPACK 100 is performed by applying a
FORTRAN77 compiler directly to the source code while LINPACK 1000 can be implemented
"highly-optimized" in ASSEMBLER language even. The result is often the following:

If one uses
the FORTRAN compiler with the (given) compiler options, in most cases the efficiency
rates given by the vendors can be repeated. Doing the same for
the LINPACK 1000 test, only 5 % to 50 % of the
"optimal" performance are usually obtained. Not only
the "peak" performance of the processors is decisive, but the distributed
compiler and, much more important, the knowledge of the "optimal"
compiler options, too. Furthermore, beside processor performance and
compiler options there is a third major point which often is the bottleneck:
the memory system and the time to handle data from the RAM memory to the
processor's cache. In fact, there are so many things which can influence
the performance of an
actual application that nobody can predict by theoretical considerations only
how fast it will run on a certain workstation configuration.

This knowledge and bad experiences with certain workstations were
the starting point to develop our own benchmark configurations, namely the
indicated FEATFLOW benchmark. The results should answer the question:

`Which workstation
configuration, that means in practice, which processor and which compiler
options give us the "best possible performance for our money", without
changing explicitly the code?'

Beside the fact that we obtained some valuable knowledge about the practical
behaviour of hardware and software components, we additionally
figured out that this performance analysis influences also the design of the
numerical and algorithmic components. One of the most surprising
results was that our preferred nonstationary solution method, the
discrete projection scheme, does not always run optimally on modern workstation
platforms. This was a further reason to think about improved methods which
are developed and described in the next chapter. It is a very important
aspect of scientific research which stands outside of mathematical
experience and which has to be mentioned explicitly in this context.

The following tests are performed with about 60,000 mesh cells in 2D, to
test computers with at least 64 Mbyte RAM, and in 3D with 75,000 cells for
machines with 128 Mbyte RAM or more.

We apply a coupled multigrid
approach with the mentioned Vanka smoother in a direct steady solver
(CC2D or CC3D), and the discrete projection scheme as nonstationary
scheme to iterate the stationary limit (PP2D/PP3D).
The Reynolds number for these
tests is about Re=20, and all calculations are performed with the
nonconforming rotated multilinear finite elements. Additionally, we
apply stabilization techniques of upwind (UPW) or streamline diffusion (SD)
type for the convective term, and we use SOR or ILU-schemes as smoothers in
the multigrid solver for the scalar subproblems if applying the
projection scheme. The following abbreviations for the
elapsed time in seconds are used: