Cores is Cores

CPU cores are all made the same, right? Hyper-Threading is just a fancy way of saying “Push the turbo button harder!”

Actually, Wikipedia informs me that I’m wrong and Hyper-Threading is a fancy (and trademarked) way of saying “You can do more than one thing on a core at the same time because computers are a pack of lies.”

I can assure you, these are not the same core dog.

The General Idea

Many people seem to operate under the assumption that any one core is as good as any other core and since two cores are better than one, why doesn’t my CPU with 4 physical cores run twice as fast when I turn on Hyper-Threading?

Because physics.

Hyper-Threading works by taking advantage of the idea that computers are usually off doing other things – waiting for storage, waiting for the network, waiting for RAM, waiting for you to click “Buy it now” on 10 pounds of socks. So while the CPU is waiting, it goes off and does something else, like sending your credit card numbers to hackers.

Hyper-Threading exists because computers spend most of their time doing nothing, so they might as well try to be productive.

What’s That Mean For Us?

For most workloads, Hyper-Threading is great. You’re usually waiting on storage, so you might as well go ahead and send those credit card numbers off elsewhere. For CPU intensive workloads, you have to use your brain a little bit and say “Wait a minute, if I can scale linearly to the number of physical cores, what happens when I’m pretending I have more cores than I really have?”

Since this is computers and not the global financial sector, circa 2007, you hit a performance cliff. When CPU is your bottleneck, faking it won’t make anything faster.

How can I say all of this so jovially? Because I broke my computer, that’s why.

Oh Crap, He Wrote Code!

That’s right, I wrote code. I wrote a program that I call The HyperThreader. It’s dumb as a brick – it counts from 1 to 10E8 and then computes the square root of that number. This is a CPU intensive workload, no disks were harmed. The program then does the same thing but across 6 workers (the number of cores I have) and then again across 12 workers (the number of logical cores I have).

1 thread – average time of 1151.857ms 6 threads – average time of 1194.262ms

So far so good. Execution time isn’t really changing, each task is off wandering around on its own processor core. We can account for the 40ms difference between these two because I was playing Paula Abdul’s greatest hits in the background.

12 threads – average time of 1831.81ms

Since this isn’t twice as slow, I’m going to assume that I’m not using all of my CPUs on each task (something could probably be more efficient), but this leads me to my conclusion…

IT’S ALL FILTHY DIRTY LIES!

This where people usually get tripped up. Execution gets around 53% slower when I start pretending I have resources available. Windows, and the .NET Framework, do their best to pretend that I have resources available. But, the fact is, that I don’t. I only have 6 cores, so the computer has to spend time switching between them. If resources were still available, the average execution time would be closer to what we saw with only 1 core.

If you’re wondering why your SQL Server In-Memory OLTP demo doesn’t scale beyond the number of physical cores, now you know – because you can’t imagine performance out of nothing. That’s like saying “This 4 cylinder car can haul a family of 4, so to take the extended family out and about, I need a V12” and then rushing out to by a supercar with only 2 seats.

Hat tip to Josh Bush and Dave Liebers for eyeballing the code to make sure it did what it claimed.

You get some performance out the CPU stalls that take place when there is a last level cache miss, which is roughly around 200 CPU cycles, potentially more if you have TLB misses on top, you will get ‘Something’ from OLTP workloads. On tests conducted by HP ( according to Joe Chang ) the first hyper-thread to hit the core gets roughly 70% of the cores capacity, the second thread may get the rest. And whilst you are accessing main memory you are still burning CPU cycles, but these are dead cycles, if you could write the perfect code that accesses a data structure that is layed out sequentially in a . . . sequential manner and nothing else comes along to pollute the CPU cache you would get no dead cycles, however this is practically unheard of, you can write code that is efficient in terms of minimizing CPU stalls, but writing code that if CPU stall free.

I can’t see your code, however by the sounds of things its likely ( based on the superficial explanation ) of what its doing that its very efficient in terms of incurring CPU stalls, which is why you are seeing little advantage from hyper-threading. Now if it was probing hash tables or performing sorts or pointer chasing, you would be seeing CPU stalls a plenty and you would get some benefit from hyper-threading.

And you’re right – the code is intended to be efficient. As most of the code in a database is going to be. The purpose is to provide a simple example as to why Hyper-Threading doesn’t provide the type of improvement that many people think it will.

As far as HT “sharing” a core’s power, I don’t think it’s quite as simple as “Billy got here first, so Billy gets most of the power.” And I don’t think it’s fair to use that as an explanation since that’s close to what goes on. From messing around with other tooling, you can see an increase in throughput but a corresponding increase in latency as CPU stalls as cache misses increase with an increased workload. I’ve specifically seen this called out in the Oracle world – HT increases the number of active connections available, but latency to serve queries also increases.

Doing parallel thread timing is tricky in .Net. You are so many layers above the actual CPU calls often framework overhead or quirks of optimizations with any particular test workload end up making a bigger difference than you expect. (Examples: Thread switching to bundle WhenAll result, IL/compiler + JIT optimizations, running on RyuJIT or legacy 4.5 JIT, memory locality, any GCs during the runs, was the assembly NGEN’d, size of registers used in the IL bytecode, etc.)

Besides, I’m not sure what the point of the article was. The articles lays out a very specific, dubious performance test, then your conclusions seem based on a large gap of extrapolation to make some fairly broad performance generalizations from a very narrow example. And there isn’t any action or advice revealed; just that multiple cores doesn’t mean faster performance in every case. Multiple cores help many workloads, but not every workload. So what is the user supposed to do after they read this article? Buy a single core machine or bring all their cloud instances down to as few cores as possible to save cost? But then there performance most likely will suffer if they never test. Wouldn’t a better conclusion be to advise the user to test their performance in multiple server configurations to discover where the sweet spot is for their workload? 2, 4, 8+ cores? How much memory? How much IO availability? Network performance. Etc.