"Linux Gazette...making Linux just a little more fun!"

GNU/Linux Benchmarking - Practical Aspects

This is the second article in a series of 4 articles on GNU/Linux Benchmarking, to be published by the Linux Gazette. The first article presented some basic benchmarking concepts and analyzed the Whetstone benchmark in more detail. The present article deals with practical issues in GNU/Linux benchmarking: what benchmarks already exist, where to find them, what they effectively measure and how to run them. And if you are not happy with the available benchmarks, some guidelines to write your own. Also, an application benchmark (Linux kernel 2.0.0 compilation) is analyzed in detail.

GNU/Linux is a great OS in terms of performance, and we can hope it will only get better over time. But that is a very vague statement: we need figures to prove it. What information can benchmarks effectively provide us with? What aspects of microcomputer performance can we measure under GNU/Linux?

Let's list some general benchmarking rules (not necessarily in order of decreasing priority) that should be followed to obtain accurate and meaningful benchmarking data, resulting in real GNU/Linux performance gains:

Use GPLed source code for the benchmarks, preferably easily available on the Net.

Use standard tools. Avoid benchmarking tools that have been optimized for a specific system/equipment/architecture.

Use Linux/Unix/Posix benchmarks. Mac, DOS and Windows benchmarks will not help much.

Don't quote your results to three decimal figures. A resolution of 0.1% is more than adequate. Precision of 1% is more than enough.

Report your results in standard format/metric/units/report forms.

Completely describe the configuration being tested.

Don't include irrelevant data.

If variance in results is significant, report alongside results; try to explain why this is so.

Comparative benchmarking is more informative. When doing comparative benchmarking, modify a single test variable at a time. Report results for each combination.

Decide beforehand what characteristic of a system you want to benchmark. Use the right tools to measure this characteristic.

Don't set out to benchmark trying to prove that equipment A is better than equipment B; you may be in for a surprise...

Avoid benchmarking one-of-a-kind or proprietary equipment. This may be very interesting for experimental purposes, but the information resulting from such benchmarks is absolutely useless to other Linux users.

Share any meaningful information you may have come up with. If there is a lesson to be learned from the Linux style of development, it's that sharing information is paramount.

These are some benchmarks I have collected over the Net. A few are Linux-specific, others are portable across a wide range of Unix-compatible systems, and some are even more generic.

UnixBench. A fundamental high-level Linux benchmark suite, Unixbench integrates CPU and file I/O tests, as well as system behaviour under various user loads. Originally written by staff members at BYTE magazine, it has been heavily modified by David C. Niemi.

BYTEmark as modified by Uwe Mayer. A CPU benchmark suite, reporting CPU/cache/memory , integer and floating-point performance. Again, this test originated at BYTE magazine. Uwe did the port to Linux, and recently improved the reporting part of the test.

Xengine by Kazuhiko Shutoh. This is a cute little X window tool/toy that basically reports on the speed with which a system will redraw a coloured bitmap on screen (a simulation of a four cycle engine). I like it because it is unpretentious while at the same time providing a useful measure of X server performance. It will also run at any resolution and pixel depth.

XMark93. Like xbench, this is a script that uses X11's x11perf and computes an index (in Xmarks). It was written a few years later than xbench and IMHO provides a better metric for X server performance.

Stream by John D. McCalpin. This program is based on the concept of "machine balance" (sustainable memory bandwidth vs. FPU performance). This has been found to be a central bottleneck for computer architectures in scientific applications.

Cachebench by Philip J. Mucci. By plotting memory access bandwidth vs. data size, this program will provide a wealth of benchmarking data on the memory subsystem (L1, L2 and main memory).

Netperf is copyright Hewlett-Packard. This is a sophisticated tool for network performance analysis. Compared to ttcp and ping, it verges on overkill. Source code is freely available.

Ttcp. A "classic" tool for network performance measurements, ttcp will measure the point-to-point bandwidth over a network connection.

Ping. Another ubiquitous tool for network performance measurements, ping will measure the latency of a network connection.

Perlbench by David Niemi. A small, portable benchmark written entirely in Perl.

Hdparm by Mark Lord. Hdparm's -t and -T options can be used to measure disk-to-memory (disk reads) transfer rates. Hdparm allows setting various EIDE disk parameters and is very useful for EIDE driver tuning. Some commands can also be used with SCSI disks.

Dga with b option. This is a small demo program for XFree's DGA extension, and I would never have looked at it were it not for Koen Gadeyne, who added the b command to dga. This command runs a small test of CPU/video memory bandwidth.

MDBNCH. This is a large ANSI-standard FORTRAN 77 program used as an application benchmark, written by Furio Ercolessi. It accesses a large data set in a very irregular pattern, generating misses in both the L1 and L2 caches.

We have seen last month that (nearly) all benchmarks are based on either of two simple algorithms, or combinations/variations of these:

Measuring the number of iterations of a given task executed over a fixed, predetermined time interval.

Measuring the time needed for the execution of a fixed, predetermined number of iterations of a given task.

We also saw that the Whetstone benchmark would use a combination of these two procedures to "calibrate" itself for optimum resolution, effectively providing a workaround for the low resolution timer available on PC type machines.

Note that some newer benchmarks use new, exotic algorithms to estimate system performance, e.g. the Hint benchmark. I'll get back to Hint in a future article.

Right now, let's see what algorithm 2 would look like:

initialize loop_count

start_time = time()

repeat

benchmark_kernel()

decrement loop_count

until loop_count = 0

duration = time() - start_time

report_results()

Here, time() is a system library call which returns, for example, the elapsed wall-clock time since the last system boot. Benchmark_kernel() is obviously exercising the system feature or characteristic we are trying to measure.

Even this trivial benchmarking algorithm makes some basic assumptions about the system being tested and will report totally erroneous results if some precautions are not taken:

If the benchmark kernel executes so quickly that the looping instructions take a significant percentage of total loop processor clock cycles to execute, results will be skewed. Preferably, benchmark_kernel() should have a duration of > 100 x duration of looping instructions.

We mentionned above that we used a straightforward wall-clock time() function. If the system load is high and our benchmark gets only 3% of the CPU time, we will get completely erroneous results! And of course on a multi-user, pre-emptive, multi-tasking OS like GNU/Linux, it's impossible to guarantee exclusive use of the CPU by our benchmark.

You can substitute the benchmark "kernel" with whatever computing task interests you more or comes closer to your specific benchmarking needs.

Examples of such kernels would be:

For FPU performance measurements: a sampling of FPU operations.

Various calculations using matrices and/or vectors.

Any test accessing a peripheral i.e. disk or serial i/o.

For good examples of actual C source code, see the UnixBench and Whetstone benchmark sources.

The more one gets to use and know GNU/Linux, and the more often one compiles the Linux kernel. Very quickly it becomes a habit: as soon as a new kernel version comes out, we download the tar.gz source file and recompile it a few times, fine-tuning the new features.

This is the main reason for proposing kernel compilation as an application benchmark: it is a very common task for all GNU/Linux users. Note that the application that is being directly tested is not the Linux kernel itself, it's gcc. I guess most GNU/Linux users use gcc everyday.

The Linux kernel is being used here as a (large) standard data set. Since this is a large program (gcc) with a wide variety of instructions, processing a large data set (the Linux kernel) with a wide variety of data structures, we assume it will exercise a good subset of OS functions like file I/O, swapping, etc and a good subset of the hardware too: CPU, memory, caches, hard disk, hard disk controller/driver combination, PCI or ISA I/O bus. Obviously this is not a test for X server performance, even if you launch the compilation from an xterm window! And the FPU is not exercised either (but we already tested our FPU with Whetstone, didn't we?). Now, I have noticed that test results are almost independent of hard disk performance, at least on the various systems I had available. The real bottleneck for this test is CPU/cache performance.

Why specify the Linux kernel version 2.0.0 as our standard data set? Because it is widely available, as most GNU/Linux users have an old CD-ROM distribution with the Linux kernel 2.0.0 source, and also because it in quite near in terms of size and structure to present-day kernels. So it's not exactly an out-of-anybody's-hat data set: it's a typical real-world data set.

Why not let users compile any Linux 2.x kernel and report results? Because then we wouldn't be able to compare results anymore. Aha you say, but what about the different gcc and libc versions in the various systems being tested? Answer: they are part of your GNU/Linux system and so also get their performance measured by this benchmark, and this is exactly the behaviour we want from an application benchmark. Of course, gcc and libc versions must be reported, just like CPU type, hard disk, total RAM, etc (see the Linux Benchmarking Toolkit Report Form).

Basically what goes on during a gcc kernel compilation (make zImage) is that:

Gcc is loaded in memory,

Gcc gets fed sequentially the various Linux kernel pieces that make up the kernel, and finally

The linker is called to create the zImage file (a compressed image file of the Linux kernel).

Step 2 is where most of the time is spent.

This test is quite stable between different runs. It is also relatively insensitive to small loads (e.g. it can be run in an xterm window) and completes in less than 15 minutes on most recent machines.

Getting the source.

Do I really have to tell you where to get the kernel 2.0.0 source? OK, then: ftp://sunsite.unc.edu/pub/Linux/kernel/source/2.0.x or any of its mirrors, or any recent GNU/Linux CD-ROM set with a copy of sunsite.unc.edu. Download the 2.0.0 kernel, gunzip and untar under a test directory (tar zxvf linux-2.0.tar.gz will do the trick).

Compiling and running

Cd to the linux directory you just created and type make config. Press <Enter> to answer all questions with their default value. Now type make dep ; make clean ; sync ; time make zImage. Depending on your machine, you can go and have lunch or just an expresso. You can't (yet) blink and be done with it, even on a 600 MHz Alpha. By the way, if you are going to run this test on an Alpha, you will have to cross-compile the kernel targetting the i386 architecture so that your results are comparable to the more ubiquitous x86 machines.