HPL is a software package that solves a random, dense linear system of equations with N unknowns, in double-precision arithmetic on distributed-memory computers. It then prints the estimated performance of the distributed-memory computer as the ratio between the number of effective floating point operations divided by the amount of time in seconds it took to perform those operations.

It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing LINPACK Benchmark.

The parameters in the red cells are the most important tuning parameters, while the ones in the green cells are of somewhat lesser importance. The parameter space is huge and exploring the full parameter space is inefficient as there are more than a billion different combinations. If one considers only combination of the most important parameters the parameter space reduces to 50,000 combinations.

There are several algorithms that aim at cutting the amount of time required for tuning, but they are beyond the scope of this article. However, it is worth mentioning that one does not have to use the maximal problem size in tuning. Early termination can be used to cut down the time, and the fact that tuning algorithms perform well and seem to be robust suggest that for a given architecture the amount of important parameters might be smaller than suggested above.

The tuning parameters need to be stored in a file named HPL.dat.

Setting the values of HPL.dat parameters

The problem size N should be set to a large value but not too large. The amount of computations scales as the cube of that number which means that doubling N requires eight times more computations. Values of the order of 1000-10,000 are reasonable for modern systems. Make sure that there is no paging occurring while executing the problem by monitoring the vmstat command. A useful formula to estimate the problem size is:

N = sqrt ( 0.75 * Number of Nodes * Minimum memory of any node / 8 )

HPL uses the block size NB for the data distribution as well as for the computational granularity. Small values are better from a load balancing point of view, but too small values might limit performance. A value in the range [32,...,256] is good for modern systems with large values being more efficient for larger problem sizes.

The process decomposition should be roughly square with P greater than or equal to Q. The product PxQ should obviously match the number of processors/cores that are going to be used for HPL. The most important factor that affects the selection of this parameter is the type of physical network with which the nodes of the cluster are interconnected. Flat grids like 4x1, 8x1, 4x2, etc. are good for ethernet-based networks.

The Rfact and Pfact values need experimentation, but for very large clusters both parameters can be set to the right-looking part of the LU decomposition.

Once the panel factorization has been computed, this panel of columns is broadcast to the other process columns. There are many possible broadcast algorithms, and the software currently offers 6 variants to choose from. A brief note on some of these:

Algorithm 2: Increasing-ring (modified) is the usually the best choice

Algorithm 4: Increasing-2-ring (modified) is usually a second choice.

Algorithm 5: Long (bandwidth reducing) - this can be a good choice for systems with very fast nodes, but a very slow network.

A final recommendation is to ensure that HPL is not going to use more than 80% of the memory available to that systemThere are also a couple of web-based HPL.dat generators like the following:

HPL requires a distributed-memory system with a tuned implementation of the BLAS library and a working implementation of the MPI standard. The Bright Cluster Manager provides several modules of the OpenBLAS library tuned for different architectures as well as a general-purpose module (openblas/dynamic) that attempts to auto-detect the relevant architecture and use routines optimized for that particular system.