Hyper-Threading Performance

The Hyper Threading technology returned to the Intel Xeon (and Core-ix) with processor codename Nehalem in 2009.
It was first introduced with Willamette in late 2000?, and the first Xeon in 2001.
But the Core 2 architecture, 2006, in the Xeon 5300 and 5400 series did not have this feature.
As with many (potentially) performance enhancing technologies and features,
there is a glaring absence of meaningful detailed performance data that might be of help to developers and information technology professionals.
And it is not as if Intel and Microsoft do not have the budget for this.
(Note: I am deliberately using both Hyper-Threading and Hyper Threading because it impacts search.)

The complex interaction between software and advanced hardware inherently implies that
there is no magic that always improve performance without risk of negative consequences.
I am of the impression that marketing people squash anything that could be interpreted as negative
because they want victims to believe that everything will be wonderful.

On our side we should expect that it takes talent and years of experience to build
up the skills necessary to command professional salaries.
So we need to see both good and bad in order to determine
the best strategy to realized the benefits of the positive while
containing the impact of any deficiencies.
If it were not, then why staff our positions with someone more expensive than necessary?

Hyper-Threading from Nehalem to Sandy Bridge

Nehalem and later Xeon Core-ix based processor implement temporal multi-threading.
The older Pentium 4 based processors had simultaneous multi-threading.
The details of both type are covered in the references at the end and will not be covered here.
I have a brief discussion on HT in
Processor Architectures and History.

When the Nehalem-EX (Xeon 7500 series) was near introduction,
the HP benchmark team suggested that the impact of HT was in the range of 30% improvement
for the OLTP oriented benchmarks and 10% for the DW/DSS benchmarks.
I presumed that this was on the 8-socket ProLiant DL980 as benchmarks for the ProLiant 300 and 500 systems
were handled by a different group.
My expectations are that HT should have greater impact on the 1 to 4 socket systems
because it is much more difficult to achieve performance gains on the more complex systems.

A few years ago, I did an assessment of a proprietary database engine
that only did parallel b-tree searches without locking.
Scaling was nearly linear with the number threads over both physical cores and logical processors.
This was not unexpected because of the pointer chasing code means the core
executes serialized memory accesses, so there should be plenty of no-op cycles for
two logical processor to share.
As a matter of factor, 4 threads (logical processors) per core might be even better.

HT in Small and Large Queries

Ideally we should investigate HT for both high-volume small queries (non-parallel execution plans)
and low-volume large queries with parallel execution plans.
Unfortunately my multi-threaded test harness is old and I do not have time to bring it up to date.

Recently I did work on a clustered system with two nodes.
For whatever reason, one had HT enabled and another had HT disabled.
Periodically, the active node would be switched to allow for patching.
This provide invaluable data on HT performance.

The average SQL statement on this system is 1-2 CPU-ms, comparable to TPC-E.
HT was off until mid-September, on until October, off until early November,
on until mid-January and off until beginning of March.
Certain SQL statements are not impacted by HT,
and some SQL could be more than 50% faster with HT on,
even queries without a parallel execution plan.
The overall steady CPU load on the system is typically 25% during the day.

For large queries, I used the TPC-H data generator at scale factor 100 and the set of 22 TPC-H queries.
The LineItem table is 80GB because of the use of the 4-byte date data type instead of the 8-byte datetime.
The total database is 140GB.

Test System

The test server is a Dell PowerEdge T110 II with Intel Xeon E3-1240 processor, quad-core 3.3GHz,
32GB memory, and LSI MegaRAID SAS-9260 controller connected to 8 SATA 6Gpbs SSDs.
There are 2 Crucial m4, 2 OCZ Vertex 3, 3 OCZ Vertex 3 Max IOPS and 1 OCZ Vector.
The T110 only has PCI-E gen 2 slots, so I will need to get a new system for gen 3 testing.
Below is a representation of the test system.

Hyper-Threading can be enabled or disabled in the UEFI (formerly BIOS, now Universal Extensible Firmware?).
Dell also allows setting the number of cores to enable.
I think is most useful in light of SQL Server per core licensing.
We can buy a single model for many functions, then enable the correct number of cores to balance performance and licensing cost.

The SQL Server DATE_CORRELATION_OPTIMIZATION setting was turned on.
It so happens that correlated dates within the TPC-H tables are within 120-days
of each other. Ship date is within 120 days of order date and so on.
The TPC-H queries are written citing a greater than condition on one date column
and a less than condition on a correlated column, so this setting helps.
If there is a not tight correlation, then the setting may not help.
In practice, it might be better to just explicitly specify both upper and lower bounds on
the cluster key column whenever possible.

Several settings were tried on the LSI SAS-9260. No read-ahead is better than read-ahead
and 128K stripe size was better than 64K, but results below are with no read-ahead and 64K
stripe size.

While the TPC-H scale factor 100 data set was used, the indexes were built different than
TPC-H requirements as I am not presenting official TPC-H results.
The significant difference is that the clustered index key on LineItem
is Shipdate+OrderKey+LineNumber, specified as unique.
This is per my practice of building unique clustered indexes,
and not to eliminate key lookups in the execution plan.

Hyper-Threading On versus Off

In all tests, only the execution time and CPU are reported.
Statistics generation and compile times were excluded.
Each test of the 22 TPC-H queries are from clean buffers and clean procedure cache.
From a freshly populated database, the statistics generation time is about 40 sec.
The compile time with statistics previously generated is about 4-5 sec for all 22 queries.

CPU and elapsed time were collected with SET STATISTICS TIME ON.
The DMV dm_exec_query_stats reports exactly the same elapsed time, but the worker time
is higher, perhaps including the compile time?

Below are elapsed time in milli-sec to execute each of the 22 TPC-C queries with Hyper-Threading off and on.
There was no restriction on Max degree of parallelism so HT off is DOP 4 and HT on is DOP 8
(4 physical cores).

Below are CPU (worker) time in ms for the same with HT off and on.

The sum total execution elapsed time for 22 queries is 692 sec with HT off and 559 sec with HT on.
Overall, the 22 queries run 24% faster with HT.
On the individual queries, all but one range from 0 to 65% faster.
The exception, query 15, runs 4 times faster with HT on instead of off.
Below is the elapsed time for HT off relative to HT on.

The sum total execution CPU time is 2253 sec with HT off and 3374 sec with HT on.
Overall, CPU is 50% higher with HT on.
Of course, there are twice as much available "CPU-sec" with HT on than off,
so this is not a negative.
With a perfect parallel execution plan and infinitely fast IO, all (logical) cores
should be at 100% during the query execution.

On the individual queries, CPU on all but Q 2 and Q15 range from 22 to 80% higher with HT on.
On the Q2, CPU is 137% higher.
On Q15, CPU was 45% lower with HT on than off, reflecting the 4X speedup with double the logical processors.
Below is the CPU time for HT on relative to HT off.

There is just the slight difference in the execution plans for Q15, shown below.
The first is with HT off and the second with HT on.

I do expect HT should improve the performance of this query
but it is probably reasonable to believe that most of the 4X gain in elapsed
time is attributed to the difference in execution plans.
I have no idea why the Query Optimizer would not use the same plan in both cases.
Perhap someone from the Microsoft QO team can help?

For anyone interested, I have provided the actual execution plans for the above
HT off and on sqlplan.
(Too bad the web browser does not render sqlplan files!)

Hyper-Threading Off and On at DOP 1

Below is elapsed time with HT off and on at degree of parallelism 1.
Amazingly, performance is better with HT on at DOP 1, 20% faster!

How can this be?
When this is the only activity, only one of two logical cores
on one of the four cores is in use.
The performance should be very similar to single-threaded HT off?
The range is 0 to 63% faster except Query 17 which 14% slower with HT on.

Below is CPU time with HT off and on at DOP 1.

Below is the relative elapsed time with HT off to HT on.

Query 1 is a simple table access aggregating 591M rows over nearly 80GB,
so there can be no difference in execution plan,
and yet it is 48% faster with HT on?
If someone knows how this can be, please advise?

I could speculate that with HT on, there are twice as many logical processors,
so it is less likely an OS task would disrupt the SQL thread handling the running query.
Perhaps is so happened that thread handling the SQL query and the thread handling IO were
on the same physical processor, so data was already in L2?
These are just wild guesses.

In the comments below, Chris A mentions that SQL Server 2012 tries to fit batches
into L2 cache.
What if it estimate the amount of available L2 too high?
Then with HT on, it would presume only half that to be available?
But with nothing running on the other LP, now single-thread HT on runs
better than HT off?

I like this hypothesis, but the HT-on benefit occurs for S2K8R2 too. Its never the theories that we like!

Hyper-Threading Off and On at DOP 2

Below is elapsed time for HT off and on at DOP 2.
Most of the individual queries are with 12% of each other, faster or slower,
except for Query 1, 28% faster with HT on.
Overall HT on is 2% faster, basically noise.

Below is CPU time for HT off and on at DOP 2

Hyper-Threading Off and On at DOP 4

Below is elapsed time for HT off and on at DOP 4.
The range with HT on is 2.5X slower (Query 6) to 90% faster (Query 15).
Overall, 13% slower with HT on.
It is possible that both logical processors are used on certain cores
instead of a single logical processor of each physical core.
The same may have happened at DOP 2.

Below is CPU time for HT off and on at DOP 4

Hyper-Threading Off, DOP 1, 2 and 4

Below is elapsed time with HT off at DOP 1, 2 and 4.
Notice that scaling is sometimes better than 2X from DOP 1 to 2.
I discussed this super-scaling effect in
Tpch Studies at SF10

Below is CPU time for HT off at DOP 1, 2 and 4

Below is the ratio of elapsed time at DOP 2 over 1, and DOP 4 over 2.
Correction: this is actually speedup from DOP 1 to 2, and 2 to 4,
it is the ratio of elapsed time at DOP 1 over DOP 2, and 2 over 4.
The conventional expectation is the scaling should be close to linear from DOP 1 to 2,
for a value of just under 2 on the graph.
From 2 to 4, scaling should be still be good but less than 1 to 2.
If the test system had more cores, there should be reduced gains in scaling to the higher DOP.
Query 10 is an example.

As it turns out, the behavior is all over the map of possibilities.
One effect is super-scaling where the parallel execution plan is actually more efficient than the non-parallel plan.
Four queries (2,11,13 and 18) actually exhibit super-scaling at both DOP 1-2 and DOP 2-4.

Other effects are that the parallel plan is much less efficient than the non-parallel plan, but otherwise scales with parallelism.
In this case, we would expect scaling from 1 to 2 to be poor, but good at 2 to 4.
Examples are queries 6 and 15.

In addition, because of the SQL Server cost formulas, we could be expect completely different
execution plans and characteristics at different degrees of parallelism.

Below is the ratio of worker time at DOP 2 over 1, and DOP 4 over 2.
The conventional expectation is progressively higher CPU with increasing DOP to account for the overhead of coordinating the threads.

But as with elapsed time, there are radical variations in behavior.
Per super-scaling effect, CPU at DOP 2 is actually lower than DOP 1 in many cases.

Hyper-Threading On, DOP 1, 2, 4 and 8

Below is elapsed time with HT on at DOP 1, 2, 4 and 8.
Notice the more predictable behavior of queries running faster at progressively higher DOP,
even when the number of threads exceeds the number of physical cores.
The exception is query 15.

Below is CPU time with HT on at DOP 1, 2, 4 and 8.
Again the expected behavior of progressively higher CPU with increasing DOP
except in the DOP 1 to 2 where the parallel execution plan is more efficient than the non-parallel plan.

Below is the ratio of elapsed time at DOP 2 over 1, DOP 4 over 2 and DOP 8 over 4.
Correction: DOP 1 over 2 DOP 2 over 4 and DOP 4 over 8

Below is the ratio of CPU time at DOP 2 over 1, DOP 4 over 2 and DOP 8 over 4.

HT Summary

Overall Hyper-threading is a significant positive for performance.
There are unexpected characteristics so it is important to collect sufficient performance data
for HT off and on, so that anomalies can be investigated.
It is also important to have data on query performance at range of DOP values.
In a production system, one should also consider the strategy for max degree of parallelism
(and cost threshold for parallelism)
factoring in the number of concurrently running queries (with and without parallel execution plans.

The premise is that HT performance gains on SQL Server is heavily impacted by the locking mechanism.
Then the next version of SQL Server, which is expected to make extensive use of lock-free data structures
should see more substantial performance gains with HT.
This should also prompt Intel to support more logical processor on each physical core,
similar to the 4 threads on IBM POWER.

See also
Queries barely over the Cost Threshold for Parallelism
and
Parallelism Strategy and Comments.
In the first, I mentioned that a parallel execution plan with an intermediate DOP may allocate
both logical processors on one core, instead of distributing threads one per physical core.
Even if using both logical processors on one core produces a 30% throughput gain,
that implies each processor running at 65% of one LP used and the other idle.
Then the threads on the doubled up core would take 50% longer than the cores with only
one active LP.

Another question is how should the threads be distributed?
Consider one thread per core, all on a single socket first, before going to the next socket,
so threads share a common L3 cache.
Potentially operations such as the Parallelism Repartition Streams might run better.
The other option is one thread on each socket first,
before coming back to the first socket.
If SQL Server could then distribute the work so that threads are assigned to access pages stored on local memory
as much as possible,
this would improve memory performance.
Of course, this would be impossible to accomplish perfectly, but perhaps good enough is possible?

Comment Notification

Comments

Nice article. In my testing I've found that the sweet spot for MAXDOP with HT, especially in a concurrent workload, is around 150% of the physical core count per node. More than that, at least on the microarchitectures I've tested on (mostly Nehalem and Westmere), and additional threads very sharply begin decreasing in utility. (Which is to say that I don't see worse elapsed time, but rather no additional benefit; and I'd rather keep the worker thread count as lean as possible at any given time.)

My tests haven't been nearly as formalized as these, so it would be interesting to see your numbers specifically comparing MAXDOP 4/6/8. In such a test, correcting for parallel plan choice would be ideal. If you were to enable trace flag 8649 at the server level, it would force parallel plan choice regardless of the DOP-based cost changes, so it would force some plan stability and eliminate that as a variable.

I had to guess, I would say that there are OS and SQL Server background threads that may need to intermittently do something, so having a few cores or LP free would help? I wonder what the parallel plan parent thread needs? And I wonder what the SQL Server strategy is for division of work between threads. I think SQL Server occasionally has problems with asymmetric loop joins. On the one hand, we would like each thread to tackle a big chunk before checking with the parent on the next chunk. But we would also like to keep all assigned threads busy according to its capability, i.e., if both LP on one core are active on some, but only one LP in other cores, there should be different amount of work sent to each? I will try to take another look at 4,5,6,7,8 later.

Whenever I read your blog postings I'm always blown away by the sheer amount of effort you always go to and the level of detail you go into. Something thats quite interesting regarding hyper threading, if my understanding is correct, is that turning it on 'Splits' the on CPU caches between hyper threads. SQL 2012 batch mode is designed such that batches fit into the level 2 cache, as such it would be really interesting to see what turning hyper threading on / off would do to batch mode and monitor execution time againt CPU utilisation and level 2 cache misses.

I recall seeing something on this. If anyone could provide a link to the L2 cache size - batch size details, I would appreciate. I thought this aspect applies to parallel execution plans, but it could also be used for non-parallel plans too?

I would assume that on startup, SQL Server inquires to the L2 size. And it must determine the number of logical processors per core, to determine whether 2 (or more) threads will share the same L2. Then one must estimate the batch size to be some amount less than the L2 size, because L2 is a unified cache, shared between data and instructions and there are other data structures too.

The theory is that in processing a query, the same data may be accessed repeatedly, so if the chunk fits in L2, the access time will be very short. Now suppose that the chunk size was over-estimated, i.e., to a higher percentage of L2 than actually available for the working data.

However, a system with HT-enabled should estimate only half the L2 as available. So the single-threaded execution plan with HT on has a more CPU-efficient strategy than the single-threaded plan with HT off?

If MS would let me have source-code access, I could look into this. I am relatively good at keeping secrets, honest. OK it could be done with just the pdb files, but my debugging skills have faded and I am no longer inclined to decode assembly.

Actually there are many reasons for your threads to be improved in performance even when DOP is 1.

When there are less threads running than the available physical CPU (cores) you may guess the system will schedule a thread on each processor, thus that's why you guess performance should be similar.

I don't believe it works so.

As far as I got the way HT works it should always schedule threads on physical CPU trying to fulfill processors, so you should see processors with two threads running and processors totally idle.

In this way the system can speedup the threads by activating turboboost, i.e. having some processors (of the same CPU socket) idle it can dedicate more clock cycles to active processors.

Furthermore thrads on the same core may use L1 cache for intercommunication rather than L2.

Finally, with this scheduling logic, all processors on the same socket will be first "populated" with threads, rather than scaling on other sockets, thus avoiding processes communication between cores on different sockets (on NUMA systems).

As far as I understood the processor architecture, dual threading on the same core (physical processor) doesn't reduce the clock cycle of each thread, which remains the same.

Summurizing, HT can only bring benefits to your thread scheduling, because can speed-up the threads (turboboost) and avoid processes communication via nested level of cache