The Linpack Benchmark is a measure of a
computer’s floating-point rate of execution. It is determined by running a
computer program that solves a dense system of linear equations. Over the years
the characteristics of the benchmark has changed a bit. In fact, there are
three benchmarks included in the Linpack Benchmark report.

The Linpack Benchmark is something that grew out
of the Linpack software project. It was originally intended to give users of
the package a feeling for how long it would take to solve certain matrix
problems. The benchmark stated as an appendix to the Linpack Users' Guide and
has grown since the Linpack User’s Guide was published in 1979.

The Linpack Benchmark report is entitled
“Performance of Various Computers Using Standard Linear Equations Software”.
The report lists the performance in Mflop/s of a number of computer systems. A
copy of the report is available at http://www.netlib.org/benchmark/performance.ps.

The paper “The LINPACK Benchmark: Past, Present,
and Future” by Jack Dongarra, Piotr Luszczek, and Antoine
Petitet provides a look at the details of the benchmark and
provides performance data in graphics form for a number of machines on basic
operations. A copy of the paper is available at http://www.netlib.org/utk/people/JackDongarra/PAPERS/hpl.pdf.

Mflop/s is a rate of execution, millions of
floating point operations per second. Whenever this term is used it will refer
to 64 bit floating point operations and the operations will be either addition
or multiplication. Gflop/s refers to billions of floating point operations per
second and Tflop/s refers to trillions of floating
point operations per second.

What is the theoretical
peak performance?

The theoretical peak is based not on an actual performance from a benchmark
run, but on a paper computation to determine the theoretical peak rate of
execution of floating point operations for the machine. This is the number
manufacturers often cite; it represents an upper bound on performance. That is,
the manufacturer guarantees that programs will not exceed this rate-sort of a
"speed of light" for a given computer.The theoretical peak performance is
determined by counting the number of floating-point additions and multiplications
(in full precision) that can be completed during a period of time, usually the
cycle time of the machine. For example, an Intel Itanium 2 at 1.5 GHz can
complete 4 floating point operations per cycle or a theoretical peak
performance of 6 GFlop/s.

The first benchmark is for a matrix of order 100
using the Linpack software in Fortran. The results can
be found in Table 1 of the benchmark report. In order to run this benchmark
download the file from http://www.netlib.org/benchmark/Linpackd,
this is a Fortran program. In order to run the program
you will need to supply a timing function called SECOND which should report the
CPU time that has elapsed. The ground rules for running this benchmark are that
you can make no changes to the Fortran code, not even
to the comments. Only compiler optimization can be used to enhance performance.

The Linpack benchmark measures the performance
of two routines from the Linpack collection of software. These routines are
DGEFA and DGESL (these are double-precision versions; SGEFA and SGESL are their
single-precision counterparts). DGEFA performs the LU decomposition with
partial pivoting, and DGESL uses that decomposition to solve the given system
of linear equations.

Most of the time is spent in DGEFA. Once the
matrix has been decomposed, DGESL is used to find the solution; this process
requires O(n2) floating-point operations,
as opposed to theO(n3)
floating-point operations ofDGEFA. The
results for this benchmark can be found in Table 1 second column under “LINPACK
Benchmark n = 100” of the Linpack Benchmark Report.

The second benchmark is for a matrix of size
1000 and can be found in Table 1 of the benchmark report. In order to run this
benchmark download the file from http://www.netlib.org/benchmark/1000d,
this is a Fortran driver. The ground rules for running
this benchmark are a bit more relaxed in that you can specify any linear
equation solve you wish, implemented in any language. A requirement is that
your method must compute a solution and the solution must return a result to
the prescribed accuracy. TPP stands for Toward Peak Performance; this is the
title of the column in the benchmark report that lists the results.

Why is my performance
results below the theoritical peak?

The performance of a computer
is a complicated issue, a function of many interrelated quantities. These
quantities include the application, the algorithm, the size of the problem, the
high-level language, the implementation, the human level of effort used to
optimize the program, the compiler's ability to optimize, the age of the
compiler, the operating system, the architecture of the computer, and the
hardware characteristics.The results
presented for this benchmark suites should not be
extolled as measures of total system performance (unless enough analysis has
been performed to indicate a reliable correlation of the benchmarks to the
workload of interest) but, rather, as reference points for further evaluations.

Why are the performance
results for my computer different than the same machine’s results in the
Linpack Report?

There are many reasons why
your results may vary from results recorded in the Linpack Benchmark Report.
Issues such as load on the system, accuracy of the clock, compiler options,
version of the compiler, size of cache, bandwidth from memory, amount of
memory, etc can effect the performance even when the processors are the same.

The third benchmark is called the Highly
Parallel Computing Benchmark and can be found in Table 3 of the Benchmark
Report. (This is the benchmark use for the Top500 report). This benchmark
attempts to measure the best performance of a machine in solving a system of
equations. The problem size and software can be chosen to produce the best
performance.

The “ground rules” for running the first
benchmark in the report, n=100 case, are that the program is run as is with no
changes to the source code, not even changes to the comments are allowed. The
compiler through compiler switches can perform optimization at compile time.
The user must supply a timing function called SECOND. SECOND returns the
running CPU time for the process. The matrix generated by the benchmark program
must be used to run this case.

The “ground rules” for running the second
benchmark in the report, n=1000 case, allows for a complete user replacement of
the LU factorization and solver steps. The calling sequence should be the same
as the original routines.The problem
size should be of order 1000. The accuracy of the solution must satisfy the
following bound:

(On IEEE machines this is 2-53 ) and n is the size of the
problem. The matrix used must be the same matrix used in the driver program
available from netlib.

The “ground rules” for running the third
benchmark in the report, Highly Parallel case, allows for a complete user
replacement of the LU factorization and solver steps. The accuracy of the
solution must satisfy the following bound:

(On IEEE machines this is 2-53 ) and n is the size of the
problem. The matrix used must be the same matrix used in the driver program
available from netlib. There is no restriction on the
problem size.

In order to have an entry included in the
Linpack Benchmark report the results must be computed using full precision. By
full precision we generally mean 64 bit floating point arithmetic or higher.
Note that this is not an issue of single or double precision as some systems
have 64-bit floating point arithmetic as single precision. It is a function of
the arithmetic used.

For the 100x100 based Fortran
version, you need to supply a timing function called SECOND. SECOND is an
elapse timer function that will be called from Fortran
and is expected to return the running CPU time in seconds. In the program two
called to SECOND are made and the difference taken to gather the time.

The performance of the Linpack benchmark is
typical for applications where the basic operation is based on vector
primitives such as added a scalar multiple of a vector to another vector. Many
applications exhibit the same performance as the Linpack Benchmark. However,
results should not be taken too seriously. In order to measure the performance
of any computer it’s critical to probe for the performance of your
applications. The Linpack Benchmark can only give one point of reference.In addition, in multiprogramming environments
it is often difficult to reliably measure the execution time of a single
program. We trust that anyone actually evaluating machines and operating
systems will gather more reliable and more representative data.

While we make every attempt to verify the
results obtained from users and vendors, errors are bound to exist and should
be brought to our attention. We encourage users to obtain the programs and run
the routines on their machines, reporting any discrepancies with the numbers
listed here.

The Linpack package is a collection of Fortran subroutines for solving various systems of linear
equations. (http://www.netlib.org/Linpack/) The software in Linpack is based on
a decompositional approach to numerical linear
algebra. The general idea is the following. Given a problem involving a matrix,
one factors or decomposes the matrix into a product of simple, well-structured
matrices which can be easily manipulated to solve the original problem. The
package has the capability of handling many different matrix types and
different data types, and provides a range of options. Linpack itself is built
on another package called the BLAS. Linpack was designed in the late 70's and
has been superseded by a package called LAPACK.

The
ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing
research effort focusing on applying empirical techniques in order to provide
portable performance for the BLAS routines. At present, it provides C and Fortran77
interfaces to a portably efficient BLAS implementation, as well as a few
routines from LAPACK. For additional information see: http://www.netlib.org/atlas/

Linpack is not the most efficient software for
solving matrix problems. This is mainly due to the way the algorithm and
resulting software accesses memory.The
memory access patterns of the algorithm has disregard for the multi-layered
memory hierarchies of RISC architecture and vector computers, thereby spending
too much time moving data instead of doing useful floating-point operations.
LAPACK addresses this problem by reorganizing the algorithms to use block
matrix operations, such as matrix multiplication in the innermost loops. For each computer architecture block operations can be
optimized to account for memory hierarchies, providing a transportable way to
achieve high efficiency on diverse modern machines. We use the term
“Transportable” instead of “portable” because, for fastest possible
performance, LAPACK requires that highly optimized block matrix operations be
already implemented on each machine. These operations are performed by the
Level 3 BLAS in most cases.

LAPACK is a software collection to solve various
matrix problem in linear algebra. In particular, systems of linear equations, least squares problems,
eigenvalue problems, and singular value decomposition. The software is based on
the use of block partitioned matrix techniques that aid in achieving high
performance on RISC based systems, vector computers, and shared memory parallel
processors.

The Linpack Benchmark is, in some sense, an
accident. It was originally designed to assist users of the Linpack package by
providing information on execution times required to solve a system of linear
equations. The first ``Linpack Benchmark'' report appeared as an appendix in
the Linpack Users' Guide in 1979. The appendix comprised data for one commonly
used path in Linpack for a matrix problem of size 100, on a collection of
widely used computers (23 in all), so users could estimate the time required to
solve their matrix problem.

Over the years other data was added, more as a
hobby than anything else, and today the collection includes hundreds of
different computer systems.

You can contact Jack Dongarra and send him the
output from the benchmark program. When sending results please include the
specific information on the computer on which the test was run, the compiler,
the optimization that was used, and the site it was run on. You can contact
Dongarra by sending email to dongarra@cs.utk.edu.

In order to run the benchmark program you will
have to supply a function to gather the execution time on your computer. The
execution time is requested by a call to the Fortran
function SECOND. It is expected that the routine returns the accumulated
execution time of your program. Two called to SECOND are
made and the difference taken to compute the execution time.

The Performance API (PAPI)
project specifies a standard application programming interface (API) for
accessing hardware performance counters available on most modern microprocessors.
These counters exist as a small set of registers that count Events, occurrences
of specific signals related to the processor's function. Monitoring these
events facilitates correlation between the structure of source/object code and
the efficiency of the mapping of that code to the underlying architecture.

Should I run the single and double precision
of the benchmarks?

The results reported in the benchmark report
reflect performance for 64 bit floating point arithmetic. On some machines this
may be DOUBLE PERCISION, such as computers that have IEEE floating point
arithmetic and on other computers this may be single precision, (declared REAL
in Fortran), such as Cray’s vector computers.

The Top500 list the 500 fastest computer system being used today. In 1993 the collection was started
and has been updated every 6 months since then. The report lists the sites that
have the 500 most powerful computer systems installed. The best Linpack
benchmark performance achieved is used as a performance measure in ranking the
computers. The TOP500 list has been updated twice a year since June 1993.

To
be listed on the Top500 list you have to run the software that can be found at http://www.netlib.org/benchmark/hpl/
and the performance of the benchmark run must be within the range of the 500
fasted computers for that period of time.

What is HPL?

HPL is a software package
that solves a (random) dense linear system in double precision (64 bits)
arithmetic on distributed-memory computers. It can thus be regarded as a
portable as well as freely available implementation of the High Performance
Computing Linpack Benchmark.

In
order to find out the best performance of your system, the largest problem size
fitting in memory is what you should aim for. The amount of memory used by HPL
is essentially the size of the coefficient matrix. So for example, if you have
4 nodes with 256 Mb of memory on each, this corresponds to 1 Gb total, i.e., 125 M double
precision (8 bytes) elements. The square root of that number is 11585. One
definitely needs to leave some memory for the OS as well as for other things,
so a problem size of 10000 is likely to fit. As a rule of thumb, 80 % of the
total amount of memory is a good guess. If the problem size you pick is too
large, swapping will occur, and the performance will drop. If multiple
processes are spawn on each node (say you have 2 processors per node), what
counts is the available amount of memory to each process.

For HPL what block size NB should I use ?

HPL
uses the block size NB for the data distribution as well as for the
computational granularity. From a data distribution point of view, the smallest
NB, the better the load balance. You definitely want to stay away from very
large values of NB. From a computation point of view, a too small value of NB
may limit the computational performance by a large factor because almost no
data reuse will occur in the highest level of the memory hierarchy. The number
of messages will also increase. Efficient matrix-multiply routines are often
internally blocked. Small multiples of this blocking factor
are likely to be good block sizes for HPL. The bottom line is that
"good" block sizes are almost always in the [32 ..
256] interval. The best values depend on the computation / communication
performance ratio of your system. To a much less extent, the problem size matters
as well. Say for example, you empirically found that 44 was a good block size
with respect to performance. 88 or 132 are likely to give slightly better
results for large problem sizes because of a slightly higher flop rate.

For HPL what process grid ratio P x Q should I use ?

This
depends on the physical interconnection network you have. Assuming a mesh or a
switch HPL "likes" a 1:k ratio with k in
[1..3]. In other words, P and Q should be approximately equal, with Q slightly
larger than P. Examples: 2 x 2, 2 x 4, 2 x 5, 3 x 4, 4 x 4, 4 x 6, 5 x 6, 4 x 8
... If you are running on a simple Ethernet network, there is only one wire
through which all the messages are exchanged. On such a network, the
performance and scalability of HPL is strongly limited and very flat process
grids are likely to be the best choices: 1 x 4, 1 x 8, 2 x 4
...

For HPL what about the one processor case ?

HPL
has been designed to perform well for large problem sizes on hundreds of nodes
and more. The software works on one node and for large problem sizes, one can
usually achieve pretty good performance on a single processor as well. For
small problem sizes however, the overhead due to message-passing, local
indexing and so on can be significant.

For HPL why so many options in HPL.dat ?

There
are quite a few reasons. First off, these options are useful to determine what
matters and what does not on your system. Second, HPL is often used in the
context of early evaluation of new systems. In such a case, everything is
usually not quite working right, and it is convenient to be able to vary these
parameters without recompiling. Finally, every system has its own peculiarities
and one is likely to be willing to empirically determine the best set of
parameters. In any case, one can always follow the advice provided in the tuning section of the
HPL document and not worry about the complexity of the input file.

Can HPL be Outperformed ?

Certainly. There is always room for performance
improvements. Specific knowledge about a particular system is always a source
of performance gains. Even from a generic point of view, better algorithms or
more efficient formulation of the classic ones are potential winners.

Can I use Strassen’s Method when doing the matrix
multiples in the HPL benchmark or for the Top500 run?

The
normal matrix multination algorithm requires n3 + O(n2)
multiplications and about the same number of additions. Strassen's algorithm reduces the total number
of operations to O(n2.82) by recursively
multiplying 2n × 2n matrices using seven n × n matrix multiplications. Thus
using Strassen’s Algorithm will distort the true execution rate. As a result we
do not allow Strassen’s Algorithm to be used for the TOP500 reporting. As a
side note, in the "usual" matrix multiplication, we have an n2 error
term. In Strassen's method, the error exponent p for np
ranges from 2-3.85 and the numerical error can be 10-100 times greater than
that for standard multiplication.

There is software available that has been optimized
and many people use to generate the Top500 performance results.This benchmark attempts to measure the best
performance of a machine in solving a system of equations. The problem size and
software can be chosen to produce the best performance. A copy of that software
can be downloaded from:

Why would a machine
appear in the Linpack Benchmark report but not in the Top500 list?

There could be two reasons.
First the Linpack Benchmark report contains historic information. Even if a
computer is no longer in existence it can appear in the Linpack benchmark
report. This is unlike the Top500 which report the 500 fastest computers in
existence at a given point in time. The second reason is that the Top500 list come out twice a year and the Linpack Benchmark report
is updated continuously.

Why would a machine
appear in the Top500 list and not in the Linpack Benchmark report?

If a machine is in the Top500
list it should appear in the Linpack Benchmark report. If you see an instance
where this is not the case, its probably a mistake and
please send email to Jack Dongarra dongarra@cs.utk.edu
about the situation.

The norm.resid is a measure of the
accuracy of the computation. The value should be O(1).
If the value is much greater than O(100) it suggest
that the results are not correct.

The resid is the unnormalized quantity.

The term machep
measure the precision used to carry out the computation. On an IEEE floating
point computer the value should be 2.22044605e-16.

The values of x(1) and
x(n) are the first and last component of the solution. The problem is
constructed so that the values of solution should be all ones.

There are two sets of timings performed both on
matrices of size 100. The first one is where the 2-dimensional array that
contained the matrix has a leading dimension of 201, and a second set where the
leading dimension 200. This is done to see what effect, if any, the placement
of the arrays in memory has on the performance.

Times for dgefa and dgesl are reported. dgefa
factors the matrix using Gaussianelimination with partial pivoting and dgesl
solves a system based on the factoriuzation. dgefa requires 2/3 n3
operations and dgesl requires n2
operations. The value of total is the sum of the times and mflops
is the execution rate, or millions of floating point operations per second.
Here a floating point operations is taken to be
floating point additions and multiplications. Unit and ratio are obsolete and
should be ignored.

If the time reported is negative or zero then
the clock resolution is not accurate enough for the granularity of the work. In
this case a different timing routine should be used that has better resolution.

No archive is maintained of previous results.
However here is some information to provide a historical perspective.The numbers in the following tables have been
extracted from old Linpack Benchmark Reports.It took a bit of ``file archaeology'' to put the list together since I
don't have the complete set of reports.

(Full precision; the manufacture is allowed to
solve as large a problem as desired, maximum optimization permitted.)

Measured Gflop/s is the measured peak rate of
execution for running the benchmark in billions of floating point operations
per second.

Size of Problem is the matrix size at which the
measured performance was observed.

Size of ½ Perf is the
size of problem needed to achieve ½ the measured peak performance.

TheoreticalPeak
Gflop/s is the theoretical peak performance for the computer.

What is the HPC
Challenge benchmark?

The HPC Challenge
benchmark consists at this time of 7 benchmarks: HPL, STREAM, RandomAccess, PTRANS, FFTE, DGEMM and b_eff
Latency/Bandwidth. HPL is the Linpack TPP benchmark. The test stresses the
floating point performance of a system. STREAM is a benchmark that measures
sustainable memory bandwidth (in GB/s),RandomAccess measures the rate of random updates of memory.
PTRANS measures the rate of transfer for larges arrays of data from
multiprocessor’s memory. Latency/Bandwidth measures (as the name suggests)
latency and bandwidth of communication patterns of increasing complexity
between as many nodes as is time-wise feasible.

Where can I get additional information on the HPC
Challenge benchmark?

The Linpack Benchmark suite is built around
software for dense matrix problems. In May 2000 we started to put together a
benchmark for sparse iterative matrix problems. For additional information see:
http://www.netlib.org/benchmark/sparsebench/