Introduction to Multithreading, Superthreading and Hyperthreading

We took some time to look into simultaneous multithreading (SMT), as hyper- …

Continued...

Shared resources are at the heart of hyper-threading; they're what makes the technique worthwhile. The more resources that can be shared between logical processors, the more efficient hyper-threading can be at squeezing the maximum amount of computing power out of the minimum amount of die space. One primary class of shared resources consists of the execution units: the integer units, floating-point units, and load-store unit. These units are not SMT-aware, meaning that when they execute instructions they don't know the difference between one thread and the next. An instruction is just an instruction to the execution units, regardless of which thread/logical processor it belongs to.?

The same can be said for the register file, another crucial shared resource. The Xeon's 128 microarchitectural general purpose registers (GPRs) and 128 microarchitectural floating-point registers (FPRs) have no idea that the data they're holding belongs to more than one thread--it's all just data to them, and they, like the execution units, remain unchanged from previous iterations of the Xeon core.?

Hyper-threading's greatest strength--shared resources--also turns out to be its greatest weakness, as well. Problems arise when one thread monopolizes a crucial resource, like the floating-point unit, and in doing so starves the other thread and causes it to stall. The problem here is the exact same problem that we discussed with cooperative multi-tasking: one resource hog can ruin things for everyone else. Like a cooperative multitasking OS, the Xeon for the most part depends on each thread to play nicely and to refrain from monopolizing any of its shared resources.?

For example, if two floating-point intensive threads are trying to execute a long series of complex, multi-cycle floating-point instructions on the same physical processor, then depending on the activity of the scheduler and the composition of the scheduling queue one of the threads could potentially tie up the floating-point unit while the other thread stalls until one of its instructions can make it out of the scheduling queue. On a non-SMT processor, each thread would get only its fair share of execution time because at the end of its time-slice it would be swapped off the CPU and the other thread would be swapped onto it. Similarly, with a time-slice multithreaded CPU no one thread can tie up an execution unit for multiple consecutive pipeline stages. The SMT processor, on the other hand, would see a significant decline in performance as each thread contends for valuable but limited execution resources. In such cases, an SMP solution would be far superior, and in the worst of such cases a non-SMT solution would even give better performance.

The shared resource for which these kinds of contention problems can have the most serious impact on performance is the caching subsystem.?

Caching and SMT

For a simultaneously multithreaded processor, the cache coherency problems associated with SMP all but disappear. Both logical processors on an SMT system share the same caches as well as the data in those caches. So if a thread from logical processor 0 wants to read some data that's cached by logical processor 1, it can grab that data directly from the cache without having to snoop another cache located some distance away in order to ensure that it has the most current copy.?

However, since both logical processors share the same cache, the prospect of cache conflicts increase. This potential increase in cache conflicts has the potential to degrade performance seriously.

Cache conflicts

You might think since the Xeon's two logical processors share a single cache, this means that the cache size is effectively halved for each logical processor. If you thought this, though, you'd be wrong: it's both much better and much worse. Let me explain.

Each of the Xeon's caches--the trace cache, L1, L2, and L3--is SMT-unaware, and each treats all loads and stores the same regardless of which logical processor issued the request. So none of the caches know the difference between one logical processor and another, or between code from one thread or another. This means that one executing thread can monopolize virtually the entire cache if it wants to, and the cache, unlike the processor's scheduling queue, has no way of forcing that thread to cooperate intelligently with the other executing thread. The processor itself will continue trying to run both threads, though, issuing fetches from each one. This means that, in a worst-case scenario where the two running threads have two completely different memory reference patterns (i.e. they're accessing two completely different areas of memory and sharing no data at all) the cache will begin thrashing as data for each thread is alternately swapped in and out and bus and cache bandwidth are maxed out.

It's my suspicion that this kind of cache contention is behind the recent round of benchmarks which show that for some applications SMT performs significantly worse than either SMP or non-SMT implementations within the same processor family. For instance, these benchmarks show the SMT Xeon at a significant disadvantage in the memory-intensive portion of the reviewer's benchmarking suite, which according to our discussion above is to be expected if the benchmarks weren't written explicitly with SMT in mind.?

In sum, resource contention is definitely one of the major pitfalls of SMT, and it's the reason why only certain types of applications and certain mixes of applications truly benefit from the technique. With the wrong mix of code, hyper-threading decreases performance, just like it can increase performance with the right mix of code.

Conclusions

Now that you understand the basic theory behind hyper-threading, in a future article on Prescott we'll be able to delve deeper into the specific modifications that Intel made to the Pentium 4's architecture in order to accommodate this new technique. In the meantime, I'll be watching the launch and the subsequent round of benchmarking very closely to see just how much real-world performance hyper-threading is able to bring to the PC. As with SMP, this will ultimately depend on the applications themselves, since multithreaded apps will benefit more from hyper-threading than single-threaded ones. Of course, unlike with SMP there will be an added twist in that real-world performance won't just depend on the applications but on the specific mix of applications being used. This makes it especially hard to predict performance from just looking at the microarchitecture.

The fact that Intel until now has made use of hyper-threading only in its SMP Xeon line is telling. With hyper-threading's pitfalls, it's perhaps better seen as a compliment to SMP than as a replacement for it. An SMT-aware OS running on an SMP system knows how to schedule processes at least semi-intelligently between both processors so that resource contention is minimized. In such a system SMT functions to alleviate some of the waste of a single-threaded SMP solution by improving the overall execution efficiency of both processors. In the end, I expect SMT to shine mostly in SMP configurations, while those who use it in a single-CPU system will see very mixed, very application-specific results.???