"Linux Gazette...making Linux just a little more fun!"

Linux Benchmarking - Concepts

This is the first article in a series of 4 articles on Linux Benchmarking,
to be published by the Linux Gazette. This article deals with the fundamental
concepts in computer benchmarking, as they apply to the Linux OS. An example
of a classic benchmark, "Whetstone", is analyzed in more detail.

A benchmark is a documented procedure that will measure the time needed by a computer system to execute a well-defined computing task. It is assumed that this time is related to the performance of the computer system and that someh
ow the same procedure can be applied to other systems, so that comparisons can be made between different hardware/software configurations.

From the definition of a benchmark, one can easily deduce that there are two basic procedures for benchmarking:

Measuring the time it takes for the system being examined to loop through a fixed number of iterations of a specific piece of code.

Measuring the number of iterations of a specific piece of code executed by the system under examination in a fixed amount of time.

If a single iteration of our test code takes a long time to execute, procedure 1 will be preferred. On the other hand, if the system being tested is able to execute thousands of iterations of our test code per second, procedure 2 should be chosen.

Both procedures 1 and 2 will yield final results in the form "seconds/iteration" or "iterations/second" (these two forms are interchangeable). One could imagine other algorithms, e.g. self-modifying code or measuring the time needed to reach a steady s
tate of some sort, but this would increase the complexity of the code and produce results that would probably be next to impossible to analyze and compare.

Sometimes, figures obtained from standard benchmarks on a system being tested are compared with the results obtained on a reference machine. The reference machine's results are called the baseline results. If we divide the results of the
system under examination by the baseline results, we obtain a performance index. Obviously, the performance index for the reference machine is 1.0. An index has no units, it is just a relative measurement.

The final result of any benchmarking procedure is always a set of numerical results which we can call speed or performance (for that particular aspect of our system effectively tested by the piece of code).

Under certain conditions we can combine results from similar tests or various indices into a single figure, and the term metric will be used to describe the "units" of performance for this benchmarking mix.

Time measurements for benchmarking purposes are usually taken by defining a starting time and an ending time, the difference between the two being the elapsed wall-clock time. Wall-clock means we are not considering just CPU time, but the "real"
time usually provided by an internal asynchronous real-time clock source in the computer or an external clock source (your wrist-watch for example). Some tests, however, make use of CPU time: the time effectively spent by the CPU of the system being test
ed in running the specific benchmark, and not other OS routines.

Resolution and precision both measure the information provided by a data point, but should not be confused.

Resolution is the minimum time interval that can be (easily) measured on a given system. In Linux running on i386 architectures I believe this is 1/100 of a second, provided by the GNU C system library function times (see /usr/include/time
.h - not very clear, BTW). Another term used with the same meaning is "granularity". David C. Niemi has developed an interesting technique to lower granularity to very low (sub-millisecond) levels on Linux systems, I hope he will contribute an explanation
of his algorithm in the next article.

Precision is a measure of the total variability in the results for any given benchmark. Computers are deterministic systems and should always provide the same, identical benchmark results if running under identical conditions. However, since Linux is a
multi-tasking, multi-user system, some tasks will be running in the background and will eventually influence the benchmark results.

This "random" error can be expressed as a time measurement (e.g. 20 seconds + or - 0.2 s) or as a percentage of the figure obtained by the benchmark considered (e.g. 20 seconds + or - 1%). Other terms sometimes used to describe variations in results ar
e "variance", "noise", or "jitter".

Note that whereas resolution is system dependent, precision is a characteristic of each benchmark. Ideally, a well-designed benchmark will have a precision smaller than or equal to the resolution of the system being tested. It is very important to iden
tify the sources of noise for any particular benchmark, since this provides an indication of possibly erroneous results.

A commonly executed application is chosen and the time to execute a given task with this application is used as a benchmark. Application benchmarks try to measure the performance of computer systems for some category of real-world computing task. Measu
ring the time your Linux box takes to compile the kernel can be considered as a sort of application benchmark.

A benchmark or its results are said to be irrelevant when they fail to effectively measure the performance characteristic the benchmark was designed for. Conversely, benchmark results are said to be relevant when they allow an accurate prediction of re
al-life performance or meaningful comparisons between different systems.

The performance of a Linux system may be measured by all sorts of different benchmarks:

Kernel compilation performance.

FPU performance.

Integer math performance.

Memory access performance.

Disk I/O performance.

Ethernet I/O performance.

File I/O performance.

Web server performance.

Doom performance.

Quake performance.

X graphics performance.

3D rendering performance.

SQL server performance.

Real-time performance.

Matrix performance.

Vector performance.

File server (NFS) performance.

Etc...

Conclusion I: it's obvious that no single benchmark can provide results for all the above items.

Conclusion II: you must first decide what you are trying to measure, then choose an appropriate benchmark (or write your own).

Conclusion III: it's impossible to come up with a single figure (called Single Figure of Merit in benchmarking terminology) that will summarize the performance of a Linux system. Hence, no "Lhinuxstone" metric exists.

Conclusion IV: benchmarking always takes more time than you thought it would.

3. FPU tests: Whetstone and Sons, Ltd.

Floating-point (FP) instructions are among the least used while running
Linux. They probably represent < 0.001% of the instructions executed
on an average Linux box, unless one deals with scientific computations.
Besides, if you really want to know how well designed the FPU in your processor
is, it's easier to have a look at its data sheet and check how many clock
cycles it takes to execute a given FPU instruction. But there are more
benchmarks that measure FPU performance than anything else. Why ?

RISC, pipelining, simultaneous issuing of instructions, speculative
execution and various other CPU design tricks make the CPU performance,
specially FPU performance, difficult to measure directly and simply. The
execution time of an FPU instruction varies depending on the data, and
a continuous stream of FPU instructions will execute under special circumstances
that make direct predictions of performance impossible in most cases. Simulations
(synthetic benchmarks) are needed.

FPU tests are easier to write than other benchmarks. Just put a bunch
of FP instructions together and make a loop: voilà !

The Whetstone benchmark is widely (and freely) available in Basic,
C and Fortran versions, in case you don't want to write your own FPU test.

FPU figures look good for marketing purposes. Here is what Dave Sill,
the author of the comp.benchmarks FAQ, has to say about MFLOPS: "Millions
of Floating Point Operations Per Second. Supposedly the rate at which the
system can execute floating point instructions. Varies widely between different
benchmarks and different configurations of the same benchmarks. Popular
with marketing types because it's sounds like a "hard" value
like miles per hour, and represents a simple concept."

If you are going to buy a Cray, you'd better have an excuse for it.

You can't get a data sheet for the Cray (or don't believe the numbers),
but still want to know its FP performance.

You want to keep your CPU busy doing all sorts of useless FP calculations,
and want to check that the chip gets very hot.

You want to discover the next big bug in the FPU of your processor,
and get rich speculating with the manufacturer's shares.

Etc...

3.1 Whetstone history and general features

The original Whetstone benchmark was designed in the 60's by Brian Wichmann
at the National Physical Laboratory, in England, as a test for an ALGOL
60 compiler for a hypothetical machine. The compilation system was named
after the small town of Whetstone, where it was designed, and the name
seems to have stuck to the benchmark itself.

The first practical implementation of the Whetstone benchmark was written
by Harold Curnow in FORTRAN in 1972 (Curnow and Wichmann together published
a paper on the Whetstone benchmark in 1976 for The Computer Journal).
Historically it is the first major synthetic benchmark. It is designed
to measure the execution speed of a variety of FP instructions (+, *, sin,
cos, atan, sqrt, log, exp) on scalar and vector data, but also contains
some integer code. Results are provided in MWIPS (Millions of Whetstone
Instructions Per Second). The meaning of the expression "Whetstone
Instructions" is not clear, though, at least after close examination
of the C source code.

During the late 80's and early 90's it was recognized that Whetstone
would not adequately measure the FP performance of parallel multiprocessor
supercomputers (e.g. Cray and other mainframes dedicated to scientific
computations). This spawned the development of various modern benchmarks,
many of them with names like Fhoostone, as a humorous reference to Whetstone.
Whetstone however is still widely used, because it provides a very reasonable
metric as a measure of uniprocessor FP performance.

Whetstone has other interesting qualities for Linux users:

Its source code is short and relatively easy to understand, with a
clean, self-explanatory structure.

The C version compiles cleanly on Linux boxes with gcc.

Execution time is short: 100 seconds (by design).

It is very precise (small variations in the results).

CPU architecture digression: for the Whetstone benchmark, the object
code that gets looped through is very small, fitting entirely in the L1
cache of most modern processors, hence keeping the FPU pipeline filled
and the FPU permanently busy. This is desirable because Whetstone is doing
exactly what we want it to do: measuring FPU performance, not CPU/L2 cache/main
memory coupling, integer performance or any other feature of the system
under test. Note however that David
C. Niemi has provided some conclusive evidence that at least some interaction
with the L2 cache or main memory is taking place on Pentium (R) systems
(Pentium CPUs have a sophisticated FPU instruction pipeline and can dispatch
two FPU instructions on a single clock cycle. One pipe can execute all
integer and FP instructions, while the other pipe can execute simple integer
instructions and the FXCH FP instructions. This is quoted from Intel's
datasheet on the Pentium processor, available at Intel's
developers site). I wish somebody with a Pentium ICE equipment could
investigate this a little further...

3.2 Getting the source and compiling it

Getting the standard C version by Roy Longbottom.

The version of the Whetstone benchmark that we are going to use for
this example was slightly modified by Al Aburto and can be downloaded from
his excellent FTP site dedicated
to benchmarks. After downloading the file whets.c, you will have to
edit slightly the source: a) Uncomment the "#define POSIX1"
directive (this enables the Linux compatible timer routine). b) Uncomment
the "#define DP" directive (since we are only interested
in the Double Precision results).

Compiling

This benchmark is extremely sensitive to compiler optimization options.
Here is the line I used to compile it: cc whets.c -o whets -O2 -fomit-frame-pointer
-ffast-math -fforce-addr -fforce-mem -m486 -lm.

Note that some compiler options of some versions of gcc are buggy, most
notably one of -O, -O2, -O3, ... together with -funroll-loops can cause
gcc to emit incorrect code on a Linux box. You can test your gcc with a
short test program available at Uwe
Mayer's site. Of course, if your compiler is buggy, then any test results
are not written in stone, to say the least (pun intended). In short, don't
use -funroll-loops to compile this benchmark, and try to stick to the optimization
options listed above.

3.3 Running Whetstone and gathering results

First runs

Just execute whets. Whetstone will display its results on standard output
and also write a whets.res file if you give it the information it requests.
Run it a few times to confirm that variations in the results are very small.

With L1, L2 or both L1 and L2 caches disabled

Some motherboards allow you to disable the L1 (internal) or L2 (external)
caches through the BIOS configuration menus (take a look at the motherboard's
manual; the ASUS P55T2P4 motherboard, for example, allows disabling both
caches separately or together). You may want to experiment with these settings
and/or main memory (DRAM) timing settings.

Without optimization

You can try to compile whets.c without any special optimization options,
just to verify that compiler quality and compiler optimization options
do influence benchmark results.

3.4 Examining the source code, the object code
and interpreting the results

General program structure

The Whetstone benchmark main loop executes in a few milliseconds on
an average modern machine, so its designers decided to provide a calibration
procedure that will first execute 1 pass, then 5, then 25 passes, etc...
until the calibration takes more than 2 seconds, and then guess a number
of passes xtra that will result in an approximate running time
of 100 seconds. It will then execute xtra passes of each one of
the 8 sections of the main loop, measure the running time for each (for
a total running time very near to 100 seconds) and calculate a rating in
MWIPS, the Whetstone metric. This is an interesting variation in the two
basic procedures described in Section 1.

Main loop

The main loop consists of 8 sections each containing a mix of various
instructions representative of some type of computational task. Each section
is itself a very short, very small loop, and has its own timing calculation.
The code that gets looped through for section 8 for example is a single
line of C code:

x = sqrt(exp(log(x)/t1); where x = 0.75 and t1=0.50000025,
both defined as doubles.

Executable code size, library calls

Compiling as specified above with gcc 2.7.2.1, the resulting ELF executable
whets is 13 096 bytes long on my system. It calls libc and of
course libm for the trigonometric and transcendental math functions, but
these should get compiled to very short executable code sequences since
all modern CPUs have FPUs with these functions wired-in.

General comments

Now that we have an FPU performance figure for our machine, the next
step is comparing it to other CPUs. Have you noticed all the data that
whets.c asked you after you had run it for the first time? Well, Al Aburto
has collected Whetstone results for your convenience at his site, you may
want to download the data file
and have a look at it. This kind of benchmarking data repository is very
important, because it allows comparisons between various different machines.
More on this topic in one of my next articles.

Whetstone is not a Linux specific test, it's not even an OS specific
test, but it certainly is a good test for the FPU in your Linux box, and
also gives an indication of compiler efficiency for specific kinds of applications
that involve FP calculations.