Consulting

Hyper-Threading

Edison includes Intel processors with Hyper-Threading Technology. When Hyper-Threading (HT) is enabled, the operating system recognizes each physical core as two logical cores. Each of the two logical cores has resources to store a program state, but they share most of their execution resources. Thus, two independent streams (i.e., processes or threads) can run simultaneously on the same physical core, but at roughly half the speed of a single stream. If a stream running on one of the logical cores stalls, the other stream can recoup the lost cycles by using the execution resources that would otherwise be idle. While HT improves some performance benchmarks by up to 30% [1], its benefits are strongly application dependent. For HPC codes, the consequences of HT also depend on inter-node communication topology and load balance.

For MPI applications, HT can be exploited in several ways:

"Nodes ÷2" A fixed number of MPI processes is packed onto half as many nodes.

"MPI x2" The node count is constant and the number of MPI processes doubles.

"Threads x2" The MPI and node counts are constant, and the number of threads per MPI task is doubled.

Compared to single-stream runs, the "MPI x2" and "Threads x2" runs have the potential to decrease runtimes for a given physical problem. Jobs using "Nodes ÷2" will be significantly slower, but may decrease use charges and increase throughput if the runtime is less than twice the single-stream result. For details about how to run with HT enabled on Edison, please refer to our Running jobs page.

HT effect on NERSC benchmark codes

Figure 1 compares the the HT use cases for several application benchmarks selected to represent the NERSC workload.[2] The most effective use of HT is through the "nodes ÷2" case, which reduces mpp charges by 12% on average. More detailed analysis of these results can be found in [3].

The simple experiment represented in the preceding figure merely shows that HT can increase throughput at a given MPI concurrency. Because most users care more about time to solution than mpp charges, it is more interesting and useful to examine the influence of HT over a range of concurrencies.

HT effect on selected top application codes at NERSC

We measured the run time of a number of top NERSC codes using a fixed number of nodes.

Figure 2. The runtime of VASP code over using HT (dual stream) and without using HT (single stream) at a range of node counts (strong scaling) with a test case containing 105 atoms. At each node count, the run with HT used twice the number of MPI tasks compared to the job without HT. As shown in Fig. 2, HT slows down the VASP code for all node counts instead of improving the performance. The slowdown is about 8% running on a single node, gets larger in percentage when running with a larger number of nodes, and is about 50% when running with eight nodes.

Figure 3. The runtime of NAMD code using HT (dual stream) and without using HT (single stream) at a range of node counts (strong scaling) with the STMV standard benchmark case. At each node count, the run with HT used twice the number of MPI tasks compared to the job without HT. Fig. 3 shows that NAMD runs about 13% faster with HT if running with one or two nodes, but slows down more than 40% if running with 16 nodes.

Figure 4. The runtime of NWChem code using HT (dual stream) and without using HT (single stream) at a range of node counts (strong scaling) with a test case, cytosine_ccsd.nw, from NWChem distribution. Similar to NAMD, but HT has less of an effect on this code, as the maximum performance gain is around 6% at the single node run.

Figure 5. The runtime of GTC code using HT (dual stream) and without using HT (single stream) at a range of node counts with a slightly modified NERSC-6 benchmark input to run smaller number of interations. Similarly, the GTC code runs around 12% faster with HT if running with 32 nodes, but the HT performance benefit decreases with the increase of the node counts. At around 256 node counts, HT starts to slow down the codes.

Above results show that HT performance benefit is not only application dependent, but also concurrency dependent, occurring at the smaller node counts. The fact that the HT benefit region and the parallel sweet spot do not overlap for the major NERSC codes may indicate that HT will have limited effect on the NERSC workload on Edison. However, a 6-13% performance gain without doing any code modification is siginificant. Users are recommened to explore the ways to make use of this performance benefit.

The overall HT performance effect is the competing result between the higher resource utilizations that HT enables and the various overheads that HT introduces to the program execution. Our analysis shows that the applications with higher cycles/instruction could be candidates for HT benefits, although this metric alone is not sufficient to predict the HT effect because HT is not able to address all the interruptions occurring in a program execution. In addition, for HT to realize any performance benefit, low communication overhead and high parallel efficiency (smaller sequential portion in the codes) are necessary; therefore, the HT benefits are likely to occur at relatively lower nodes counts in the parallel scaling region of applications unless the applications are embarrassingly parallel codes. More details of our analysis can be found in [4].