what is the relation between "hardware thread" and "hyperthread"?

what is the relation between "hardware thread" and "hyperthread"?

One of the Intel TBB webpages states that "a typical Xeon Phi coprocessor has 60 cores, and 4 hyperthreads/core". But this blog from Intel emphasizes that "The Xeon Phi co-processor utilizes multi-threading on each core as a key to masking the latencies inherent in an in-order micro-architecture. This should not be confused with hyper-threading on Xeon processors that exists primarily to more fully feed a dynamic execution engine."

I'm confused with these two conflicting statements. Could anyone explain the difference/similarity between hyperthread and hardware thread?

Besides, the software developer's guide says MIC has hardware multithreading by replicating complete architectural state 4 times (has this been used in xeon's hyperthreading, where one physical core is seen as two logical cores?), and further, MIC implements a “smart” round-robin multithreading. Could you explain the relation between these two multithreading techniques?

When we split words, it can be confusing can't it? The TBB documentation is wrong, thank you for pointing that out.

We choose to NOT call the hardware threads on the current Intel Xeon Phi Coprocessor (previously known by the code name Knights Corner) by the name "hyper-threads." The most important thing to know is that you'll usually need more Knights Corner threads per core to hit your best performance than you would with hyperthreading. That's consistent with the "highly parallel" optimized nature of an Intel Xeon Phi Coprocessor. The difference in these hardware threading techniques is instructive so I'll try to give an explanation of it that makes sense.

Regardless of what Intel device we talk about, a processing core will have one or more "hardware threads" per core. We use "hardware threads" as a very generic term that refers to multithreading achieved mostly by duplicating thread state and sharing most everything else in a processing core. Multithreading achieved by duplicating most everything, the whole "core," is what multicore and many-core designs are all about. Processors and coprocessor can have both "hardware threads" and lots of cores. "Hyperthreading" is a very specific form of implementing a "hardware thread" that is only found on dynamic (a.k.a. out-of-order) execution engines.

This highlights a difference in the Knights Corner microarchitecture and an Intel Xeon processor microarchitecture. The Knights Corner microarchitecture uses "in order" execution, so the hardware threads do a relatively simple round robin to feed the dual execution pipes of the microarchitecture. In this design, you can execute two vector (SIMD) instructions in parallel but they need to come from different threads. This is why we advise programmers to use at least two threads per core on Intel Xeon Phi Coprocessors. If you do not, the floating point (FP) peformance will peak at about half of what is possible. For most programmers, this is simply a matter of making sure OpenMP or TBB use at least 122 threads on a 61 core device. Many of us are in the habit of limiting FP intensive code to threads=cores on hyperthreaded machines. This is because on a hyperthreaded machines we find a microarchitecture with an out-of-order execution engine. In those designs, the full FP potential may be realized with a single thread. Additional threads on any device will put more pressure on caches and ask for more memory bandwidth. If your algorithm is already hitting peak FP usage, additional threads are not helpful unless they help with latency hiding. For the most part, out-of-order execution engines take care of latency hiding which an in-order design cannot. Therefore, with hyperthreading on an Intel Xeon processor you may hit peak performance with threads=cores. With the in-order execution design in the Knights Corner microarchitecture, at least two threads are needed to hit peak and latency hiding is often enhanced with even more threads. Many algorithms find three threads per core is their sweet spot while others prefer two or four.

In teaching programming for the Intel Xeon Phi coprocessor, we found that it was helpful to speak of this distinction mostly to encourage us all to experiment with how many threads per core served our applications best. Using OpenMP or TBB, this is as simple as setting a different parameter or environment variable and running several times to compare. No changes to a program can be avoided.

If we are used to always running threads=cores on a hyperthreaded machine, then it is useful to know that Knights Corner is not using hyperthreads and we should be at least using two threads per core (almost always) to get best performance.

That said, today's hyperthreading is much more advanced than it was a decade ago. If we've not ventured to test performance of our applications with hyperthreads recently, we should consider running some performance tests. If you are surprised how much better it is with hyperthreading than it used to be, please don't tell our marketing people... or they'll want to call them "hyper-thread PRO" or something else else I'll have to explain in a future blog. ;)

I hope this clears everything up.

Thank you for pointing out our error in the TBB documents - I'll look to correct them.