Threads or Cores: Which Do You Need?

Anyone contemplating a new computer purchase (for personal use or business) is confronted with new (and confusing) hardware choices. Intel and AMD have done their best to differentiate the x86 architecture as much as possible while retaining compatibility between the two CPUs, but the differences between the two are growing. One key differentiator is hyperthreading; Intel does it, AMD does not. This article explains what that really means, with particular attention to the way different server OSes take advantage (or don’t). Plenty of meaty tech stuff.

It is syntactically a word. And in any case, when an acronym is inflected, the inflections apply to the form of the acronym, not the underlying words that make it up. It is not a macro that ought to expand, it is its own lexical item.

Then thereâ€™s application hyperthreading, wherein an application is written to perform tasks in parallel.

In my youth, we called this just “multithreading”.

“Multi-threading” or “threads” in software is a completely different concept that doesn’t really have much to do with hyper-threading (in the same way that “process” isn’t the same as “processor”).

“Hyper-threading” is a marketing term Intel use. It’s usually called SMT or Symmetrical Multi-Threading by hardware people (in the same way as SMP or Symmetrical Multi-Processing is used).

To avoid confusion it’s best to use “logical CPU” instead of “thread” if you’re talking about SMT/hyper-threading. For example, “a physical CPU contains one or more cores, which contain one or more logical CPUs”.

Just realised “Then thereâ€™s application hyperthreading, wherein an application is written to perform tasks in parallel.” (and the confusing use of “threads”) isn’t the only mistake in the article.

There’s “Intel dropped the technology when it introduced the Core architecture in 2004 and brought it back with the Nehalem generation in 2008”. This isn’t quite right – Atom (released in June 2008) had hyper-threading before Nehalem (which was released in March 2009, not in 2008).

Then “The CPU executes one instruction at a time, according to its clock speed.” which is just plain wrong. A modern CPU’s pipeline actually has multiple execution units (e.g. Nehalem’s pipeline has 9 execution units – some for integer instructions, some for SSE, etc), and it’s possible to complete several instructions per cycle.

The article also says “Multithreading does not add genuine parallelism to the processing structure because itâ€™s not executing two threads at once.”. This is wrong because hyper-threading can give genuine parallelism (e.g. one logical CPU executing integer instructions while another logical CPU executes SSE instructions; where they both use different execution units in the same core at the same time).

I should probably add that while AMD’s marketing department probably doesn’t like “hyper-threading” (Intel’s marketing), AMD’s “Bulldozer” CPUs are planned to have something where 2 cores share some execution units to improve utilisation of those execution units. This will confuse things even more – the end result would be that a pair of AMD cores will behave a little bit like separate cores and a little bit like a single core with hyper-threading.

I should probably add that while AMD’s marketing department probably doesn’t like “hyper-threading” (Intel’s marketing), AMD’s “Bulldozer” CPUs are planned to have something where 2 cores share some execution units to improve utilisation of those execution units. This will confuse things even more – the end result would be that a pair of AMD cores will behave a little bit like separate cores and a little bit like a single core with hyper-threading.

-Brendan

From what I understand of Bulldozer it only shares the floating point units. This would mean that the cores will perform like full cores until they are fpu constrained.

it all depends on the programming of the app. most apps only recently are starting to really take advantage of mutli threading and multi core (even though the have been out for a lonnnnng time now).

We need to find ways to more easily program more efficient applications, most programs are written the way they are because it was easiest to write it that way and not because it was the most resource efficient or optimized. Though, in the last few years, there have been some pretty big big releases in the compilers I use (including Intel’s) that have made things much easier to code to take advantage of the hardware in a way that it should be.

“At some point, the pipeline may stall. It has to wait for data, or for another hardware component in the computer, whatever. Weâ€™re not talking about a hung application; this is a delay of a few milliseconds while data is fetched from RAM. Still, other threads have to wait in a non-hyperthreaded pipeline, so it looks like:

I don’t get a clear picture of this added to what they told me about OS scheduling at college. I remember that was something like “when waiting for resources, the process is kicked from the CPU and put in the ‘Blocked’ list”. See this beauty:

“At some point, the pipeline may stall. It has to wait for data, or for another hardware component in the computer, whatever. Weâ€™re not talking about a hung application; this is a delay of a few milliseconds while data is fetched from RAM. Still, other threads have to wait in a non-hyperthreaded pipeline, so it looks like:

I don’t get a clear picture of this added to what they told me about OS scheduling at college.

That’s because the article uses confusing terminology (I’m seriously wondering if the article’s author knows the difference between hyper-threading in hardware and multi-threading in software). It’s also a little misleading (“a few milliseconds while data is fetched from RAM” only makes sense if your RAM is *insanely* slow).

A (hopefully) better explanation might be:

If a core is pretending to be 2 logical CPUs; then if the core can’t execute instructions from one logical CPU (e.g. because data needs to be fetched from RAM) then it can still execute instructions from the other logical CPU, and the core doesn’t have to just sit there doing nothing…

But from what I read, and I am probably very wrong, it sounds like the virtualization going on uses two parts of the physical CPU for differing instruction sets, but wouldn’t they both need to be utilized in a clock cycle?

But from what I read, and I am probably very wrong, it sounds like the virtualization going on uses two parts of the physical CPU for differing instruction sets, but wouldn’t they both need to be utilized in a clock cycle?

I’m not sure I exactly understand what you’re saying, but you are probably right in that you are probably very wrong.

Here is how I would explain hyper-threading:

A logical CPU (which could be an entire CPU, a core, or a “hyper-thread”, depending on the setup) executes instructions from RAM. Mostly all these instructions can do is access RAM and manipulate CPU registers. So the only state that a logical CPU needs to run is the registers (since RAM is shared). So, a hyper-threading core (which is a physical section of the chip that has one set of the things that execute instructions) has two copies of all the registers, one for each “hyper-thread”. When it is executing a “hyper-thread”, and it has to wait for something slow (such as access to RAM), it can switch to the other set of registers. This, in effect, causes the other “hyper-thread” to run. It constantly switches back and forth so that the core is rarely waiting for something, and thus, performance is improved without adding an entire new core.

But from what I read, and I am probably very wrong, it sounds like the virtualization going on uses two parts of the physical CPU for differing instruction sets, but wouldn’t they both need to be utilized in a clock cycle?

:/ This is why I didn’t to EE and did CS instead.

I don’t know about the other virtualization technologies, but with VMware you can have two cores, each with two execution units (hyperthreading). This scenario often looks like 4 CPUs to an OS, even though there’s only 2 cores. VMware is smart enough to pick two execution units that are on separate processors when giving CPU time to a virtual machine granted two processors (vCPUs).

They are talking about the processor pipeline. On a very high level, in RISC machines lets say each operation takes five cycles to complete. However one part of the cycle needs something from memory, in this case the operation cannot continue because it does not have any data. This causes a delay in the execution of the instruction. This is a level below the operating systems thread scheduler, as its the actual cpu(s) making choices on what operations are to be done.

is that chip manufacturers are selling us products that are over engineered for most our needs. Most OS’s, and applications aren’t written to “really” take advantage of multiple processors, let alone identify and adjust performance for the number of cores present. I chuckle when i hear some guy at a client site talking about buying the newest Core i7 chip running at 3GHz blah blah, so I ask, what are you doing with it, and he says, “gaming man, gaming.” All the power in the world, and we game. Gotta love it.

Every CPU core has whatâ€™s called a pipeline. Think of pipelines as the stages in an assembly line, except here the process is the assembly of an application task. At some point, the pipeline may stall. It has to wait for data, or for another hardware component in the computer, whatever. Weâ€™re not talking about a hung application; this is a delay of a few milliseconds while data is fetched from RAM. Still, other threads have to wait in a non-hyperthreaded pipeline, so it looks like:

With hyperthreading, when the coreâ€™s execution pipeline stalls, the core begins to execute another program thatâ€™s waiting to run. Mind you, the first thread is not stopped. If it gets the data it wants, it resumes execution as well.

This is not HT, this is describing SoE (Switch on Event) Multi-Threading where you switch thread of execution when you hit a stall (like a cache miss).

HT, or SMT (Simultaneous Multi-Threading), is about having more than one active thread in the pipeline doing work: in some implementations you do issue instructions from more than one thread (in Pentium 4’s SMT instructions in the Trace cache [instructions already decoded] were tagged with the Thread ID).

SMT, as far as I know, is about this (at least the one Intel uses with HyperThreading):

say you have a CPU which can execute 4 instructions at a time and your main thread has only 1 instruction for your to issue at that time… is it better to just run 3 other NOP’s (nothing is executed) or to seek work from another thread and issue instructions from it?

The kicker comes when the OS scheduler doesn’t differentiate between “logical CPU” and “physical CPU” and treats them both as “physical CPUs” with the same number of integer/fpu/execution units. Then it tries to schedule two threads that use the same physical resources, thus causing them to interfere with each other and actually slow things down.

FreeBSD had this issue in 5.x and 6.x. SCHED_BSD (the default scheduler) had no concept of “logical CPUs” and treated each HT “core” as a complete physical CPU. Enabling HT in the BIOS would actually slow things down when more than one app was running. The recommendation at the time was to disable HT in the BIOS and to run a non-SMP kernel (unless there were multiple physical CPUs in the system, of course).

SCHED_ULE in 7.x gained support for scheduling HT cores.

And SCHED_ULE in 8.x has gained more knowledge about NUMA, SMP, SMT, and all the other fun stuff, allowing it to better schedule threads according to what’s actually available, where it’s running, where the RAM is connected, etc.

Not sure about Linux scheduling, other than it went through the same process, but a lot quicker.

Similarly for Windows XP, which has (I believe) no concept of SMT, and treats an HT core as a full physical CPU. I believe Vista gained support for SMT scheduling, and 7 improved upon it. But don’t know for sure.

If all you want is speed, wallet be damned, then sure go with the top end 6 core i7 or Xeon setup, 24Gb of ram, a GTX480 and a Fusion I/O Octal PCIe SSD. But your box will cost more then most cars and will heat your house as if you have the exhaust pipe from a chevy big block dumping into your room.

But if cost is a factor then DAAMIT is your best bet, an 870 or an 890GX paired with an Athlon II X4 620 or a Phenom II X6 1055T if you need the extra grunt gets you plenty of speed and cores for pretty much any typical day to day usage with room to spare.

On the workstation/server side the same applies, start with something like a Supermicro MBD-H8SGL series as your base and use either a cheap 8 core Opteron 6128 or if you need the cores, the 12 core Opteron 6168. If I remember correctly you can use standard DDR3 with these single socket boards to further increase your cost savings, though I wouldn’t not use ECC ram if the machine is running something mission critical for your business, just to be on the cautious side.

Though you with any dual socket G34 board and just get 2x 6128’s for a 16 core monster.

As for overclocking, AMD machines OC nicely, but 99% of people don’t even care about it, which means likely half of the readership here wont bother with an OC as this isn’t a hardware centric site.

x86 is CISC (complex instruction set computer), so they break the CISC instructions down into a series of RISC (reduced instruction set computer) instructions and then reorder them for execution. On each clock one or more RISC instructions may be issued to an available execution path.

The problems arise when data needs to be fetched before the next set of instructions can issues (this causes a bubble/hole in the execution path). The other place where problems happen is when doing a conditional branch (the compare can take up to 7 clocks to get a result before the processor knows what instruction is going to be ran next). The data fetch holes can mostly be fixed by intelligent pre-fetching and is not an issue with modern CPUs. The conditional branch is normally handled by making a best guess and being ready to roll back to the guess point once it’s determined that the guess was wrong.

Intel created HT as a way to solve the conditional branch and bubble problems. Basically all HT does as add a 2nd set of registers to the set of execution units. When a bubble is found, the core issues an instructions from the 2nd thread. When you find a conditional branch, just switch to the 2nd thread until we know what the correct path is.

The main problem is good compiler optimization will reduce the number of bubbles. Thus, the only time the threads switch state is when doing conditional branching; however, this is also something that good compilers try to avoid. This is why HT only gives a 20-30% speed boost; a 2nd core would give you something closer to 95% boost (you lose some speed due to memory bandwidth).

HT is cheaper; but it’s main drawback is that you get primary threads and secondary threads. The primary will run at full speed and the secondary one will at about 25% of the primary’s speed. Any application that has been optimized for multi-threading will have issues since they are internally breaking the task at hand up into small units and it expects each unit to run at the same speed.

I need to process 400 transactions and I have 2HT cores (4 threads) so I create 4 threads and have each thread process 100 transactions. I can’t report the task as finished until all 4 threads end; thus, once the threads are started I have to wait for them all to end before continuing. My actual run time would be the time to process 100 records (the 2 main threads) + the time to process 75 records (25% of the records where processed during the 1st time period and I’m assuming that the secondary threads become primary threads once the 2 primary threads end). On a 4 core machine the actual run time would be the time to process 100 records.

If the software assumes that each virtual thread is a core and schedules task accordingly, then HT can slow things down instead of speed them up.

If the software is single threaded, then HT will speed things up because a program getting some time is better then getting no time.

AMD bulldozer is HT but designed more like IBM’s style instead of Intel’s. IBM’s design is: create 2 instruction decode and issue engines and attach them to the same set of execution paths. If the decode engine can only issue 4 RISC instructions per clock and you have 6 execution paths (2 complex integer, 1 logical integer, 2 simple floating point, 1 complex floating point), why not just have the 2nd decode issue as many instructions into the unused paths as possible (it can issue to to 4 but in most cases they should be able to issue at lease 2). IBM did this with some of the PPC chips and the secondary thread was running at about 60% of the primary’s speed.

One more thing about your comment on bubbles and using SMT to solve them…

What Intel mentions in that PDF at page 22 is basically a software version of the Scout Threads feature mentioned for Niagara and Rock processors (which they meant to implement in HW and trigger dynamically).

The idea is to help single threaded programs, or programs with low thread level parallelism, by spawning additional threads whose only purpose is to trigger cache misses and pre-fetch data as well as run down both paths of a branch to pre-calculate the target address of both and do work in advance for the main thread.

This does not change the fact that the SMT hardware in Intel’s CPU’s is able to issue instructions to its execution units from more than one thread at once.

x86 is CISC (complex instruction set computer), so they break the CISC instructions down into a series of RISC (reduced instruction set computer) instructions and then reorder them for execution. On each clock one or more RISC instructions may be issued to an available execution path.

The problems arise when data needs to be fetched before the next set of instructions can issues (this causes a bubble/hole in the execution path). The other place where problems happen is when doing a conditional branch (the compare can take up to 7 clocks to get a result before the processor knows what instruction is going to be ran next). The data fetch holes can mostly be fixed by intelligent pre-fetching and is not an issue with modern CPUs. The conditional branch is normally handled by making a best guess and being ready to roll back to the guess point once it’s determined that the guess was wrong.

Intel created HT as a way to solve the conditional branch and bubble problems. Basically all HT does as add a 2nd set of registers to the set of execution units. When a bubble is found, the core issues an instructions from the 2nd thread. When you find a conditional branch, just switch to the 2nd thread until we know what the correct path is.

Again, this seems more like SoE MT and not SMT which are two different kinds of MT implementations altogether. Switching when a branch is encountered seems odd to me unless you employ something similar to HW scout threads like Niagara does (more on this later) or you are doing predication like Itanium/EPIC does and work on both paths of the conditional branch and resolve it at the end.

What you described kind of sounds like a good ol’ branch delay slot technique in a way, you defer the computation of the branch condition until you are sure that the operands are ready (not to stall when the branch needs to be processed, but it is mostly used in in order CPU’s that lack a high 90’s %success rate branch predictor Intel CPU’s have now.

Still, switching on a bubble (say a data dependency bubble, or a cache miss) is a way to do MT, but it is not the SMT way… SMT seems designed to allow the CPU to exploit parallelism in threads where the current thread does not have enough work to do.

The main problem is good compiler optimization will reduce the number of bubbles. Thus, the only time the threads switch state is when doing conditional branching; however, this is also something that good compilers try to avoid. This is why HT only gives a 20-30% speed boost; a 2nd core would give you something closer to 95% boost (you lose some speed due to memory bandwidth).

HT is cheaper; but it’s main drawback is that you get primary threads and secondary threads. The primary will run at full speed and the secondary one will at about 25% of the primary’s speed. Any application that has been optimized for multi-threading will have issues since they are internally breaking the task at hand up into small units and it expects each unit to run at the same speed.

I need to process 400 transactions and I have 2HT cores (4 threads) so I create 4 threads and have each thread process 100 transactions. I can’t report the task as finished until all 4 threads end; thus, once the threads are started I have to wait for them all to end before continuing. My actual run time would be the time to process 100 records (the 2 main threads) + the time to process 75 records (25% of the records where processed during the 1st time period and I’m assuming that the secondary threads become primary threads once the 2 primary threads end). On a 4 core machine the actual run time would be the time to process 100 records.

If the software assumes that each virtual thread is a core and schedules task accordingly, then HT can slow things down instead of speed them up.

If the software is single threaded, then HT will speed things up because a program getting some time is better then getting no time.

AMD bulldozer is HT but designed more like IBM’s style instead of Intel’s. IBM’s design is: create 2 instruction decode and issue engines and attach them to the same set of execution paths. If the decode engine can only issue 4 RISC instructions per clock and you have 6 execution paths (2 complex integer, 1 logical integer, 2 simple floating point, 1 complex floating point), why not just have the 2nd decode issue as many instructions into the unused paths as possible (it can issue to to 4 but in most cases they should be able to issue at lease 2). IBM did this with some of the PPC chips and the secondary thread was running at about 60% of the primary’s speed.

I do not think its primary purpose was HW support for scouting threads (see Niagara), but primarily a way not to leave execution units unfed if the current thread of execution does not contain a high degree of parallelism at the instruction level (ILP) but there are other threads with work that can start at that moment and run in parallel to what the CPU is doing at the moment (TLP or Thread Level Parallelism).

Its not that simple anymore, because AMD would release their ‘Buldozer’ CPU next year with two threads per core, but it should be a lot MORE efficent then Intel HT because AMD will put TWO integer units instead of just ONE in Intel sollution, which means not 5-20% increase but 100% for integer operations.

About general discussion, I need at least TWO processing CORES, how many THREADS/LOGICAL CORES they will run does not matter for me

Intel implements SMT with its support of HyperThreading while the implementation mentioned in the article, as well as mentioned when we talk about switching threads on bubbles, is SoE MT (Switch on Event MultiThreading).

Which CPUs? How much heat do they generate? Which caches are shared between cores? Would “two dual core CPUs” be ccNUMA (e.g. Opteron)? If it is ccNUMA does that mean you get 2 memory controllers and twice as much RAM bandwidth? What software would it run – is the software effected by caching and/or memory bandwidth (or just raw processing speed)? Does the software optimise memory allocation for ccNUMA?

What about prices (dual-socket motherboards are typically more expensive)?

Latency, with 2 CPU sockets sharing info from socket to socket the data must be transmitted over the FSB. The same thing happens with the Intel Core 2 Quads as there are 2 dual core CPUs set in the same CPU die, giving you a dual socket solution in a single socket.

Now all AMD quads and the Intel i series quads are real quads, the cores are able to talk to one another directly without having to transmit the data over the FSB.

Sun SPARC Niagara can switch to another thread in one clock cycle, so in effect Niagara does not idle when waiting for data from RAM. It works with another thread.

When common cpus switches thread it takes 100s of clock cycles. That is one of the reasons a normal x86 server waits for data from RAM, 50% of the time – under full load. Under max load, an x86 idles 50% of the time, waiting for data from RAM. According to studies from Intel.

So, a Niagara at 1.4GHz (with 64 threads) can easily beat a POWER6 at 5GHz on multithreaded workloads. Because Niagara does not idle, it just works with another thread.

Todays CPUs are very fast, but RAM is slow. If you have 10GHz cpu, then it will wait 90% of the time for data from RAM. The higher clocked cpus, the more it waits when the pipeline stalls.

There is no such thing as “one cycle” anymore. If you wanted to switch context instantaneously, you’d have to implement something equivalent to HT. And what HT does is not really a context switch – it’s a hardware level emulation of two CPU’s. From the system perspective, it looks like two CPU’s concurrently executing two threads. Except that it is really a single core, executing a single thread, and maybe another thread will have a chance to proceed if the first one happens not to be optimized enough. True in case of “regular” processes, very likely false in case of codecs, numerical procedures etc.

This mechanism interferes with the system scheduler by pretending to do something else than it really does. In certain scenarios (high load with number crunching apps) this may compromise stability of the whole system.

Just to put it in context, there were several stages of CPU’s development, each marked with different performance characteristics:

Stage 1. CPU. Ages ago, code execution speed was directly related to the CPU clock frequency and to the number of clock cycles per instruction. Memory access was either not a concern at all, or it could be easily alleviated by adding a simple cache. At the same time Moore’s law enabled us to scale clock frequencies and transistor densities exponentially. So far so good. Speeding up the CPU was the solution.

Stage 2. Going super-scalar. So we scaled the operating frequencies up, added some system level techniques like pipelining and additional caches to boost the CPU performance even more and then we hit a problem – a huge difference between a clock cycle duration and memory latency. When executing a typical application code the performance is no longer defined by CPU performance but by memory access delay. CPU works in bursts – 10 clock cycles of program execution – 100 cycles of waiting for data. Many more or less speculative techniques (super-scalar architecture) were developed to predict the data so that the CPU can continue to run with hope it will not have to discard these data all to often. This mechanism, although quite effective, is of course far from being perfect. This is the place where HT fits in by allowing another thread to execute when the original is waiting for memory (at the hardware level).

Stage 3. Copy-paste (multi-cores). All these techniques, combined with pipelining made it increasingly difficult to increase the clock frequency of the CPU. Design complexity, size, power consumption, all scale much worse than linearly with frequency. This is the primary reason why mutli-cores were proposed (and why Pentium 4 was a dead end). Except for IO’s, memory access circuits etc., it’s just like taking a CPU and copy-pasting it a few times. If supported by OS and applications they give a close to linear improvement of performance vs. circuit complexity and power consumption increase.

The cost of multi-threading and HT is the software interface. Most applications are still written as a sequential code and many of algorithms can’t be implemented in parallel. However, where multi-threading actually gives measurable and predictable speedup, HT complicates already over complicated bottleneck part of the CPU and its overall speedup can be as low as 0 if the code executed by the CPU happens to be well optimized (so that it doesn’t stall between memory accesses).

Also, elaborate caching schemes generally don’t play well with multiple CPU’s. It’s less a problem with multi-cores than it used to be with multiple standalone CPU’s but nevertheless, every time we write something to the memory we must make sure that all shadow copies of this memory cell are updated before they make their way into the CPU pipeline.

Stage 4. Performance/watt. System performance is now defined not by frequency, not by memory latency but by power consumption (and sadly enough that’s a physical, not design limitation). As fabrication process scales down single transistors become less power hungry (~linearly) but work at slightly higher frequency and there are much more of them (~quadratically). So the power consumptions is bound to deteriorate unless we start using most of the chip area for a cache (which beyond some point doesn’t improve performance significantly).

Although single-threaded performance of the super-scalar architecture is still the best, it’s slowly becoming a niche application. Many highly parallel applications simply don’t care about it at all. They just want to get maximum data throughput without frying the CPU. Similarly, in the embedded space, battery life is everything + these platforms usually come with some specialized hardware anyway. In these applications, the best way to built a CPU is to use a basic (and thus small and low power) core without any super-scalar extensions (or with just basic ones like static branch prediction etc.) and put many of them on the same chip with some extensive power and clock gating. That’s more in line of what modern GPU’s are nowadays, and this pattern is likely to spread even further as performance/watt becomes important.