This is the official web log for the SQL Server engineering team – we focus on learning customer workloads running on SQL Server, integrating that feedback to enhance the product, and provide guidance on using SQL Server to solve customers' business challenges.

Making parallelism waits actionable

During PASS we announced changes to how users will be able to make parallelism waits more actionable – specifically the “infamous” CXPACKET. This change effectively splits CXPACKET waits into an actionable wait (CXPACKET) and a negligible wait (CXCONSUMER). This change will be effective starting with SQL Server 2017 CU3 and upcoming SQL Server 2016 SP2 (this wait type is already present in Azure SQL Database).

Waits are a normal part of the waits and queues model, allowing SQL Server to concurrently execute many more requests than there are schedulers available. See more about waits and queues here,

It’s also important to understand that parallelism is implemented as if it were two operators. There’s the producers that push data to consumers, and the consumers that may have to wait for data from producers.

And so it becomes easier to understand how producer waits are the ones that may require attention, while consumer waits are inevitable as a passive consequence of longer running producers.

What is what going forward?

With this split, CXPACKET occurs when trying to synchronize the query processor exchange iterator or getting a required buffer.

This is the actionable wait. If contention on this wait type becomes a problem for most of your workload, consider for example:

Or better yet, by improving cardinality estimations if actual rows are very different from estimations. See this previous blog post on how to use the new SSMS Plan Analysis feature to get insights into cardinality estimation differences. Improving estimations can include actions such as updating or adding statistics, revising the underlying index design (consider leveraging the Database Tuning Advisor for this), or even hinting the Query Optimizer in edge cases.

And CXCONSUMER occurs when a consumer thread waits for a producer thread to send rows.

This is a wait type that is a normal part of parallel query execution, and cannot be directly influenced by changing the above mentioned configurations. This is why I called it "negligible" above.

Since we have wait stats and query time stats in the root node of showplan, let’s see how they look:

A lot of CXPACKET waits, totaling over 31s, for 1.2s CPU and 2.8s overall elapsed execution time. Also notice from showplan, this query is executing with DOP 12:

As mentioned above, both producer and consumer waits are tracked at the server and session level, so let’s look at sys.dm_exec_session_wait_stats also. Note that any change that is done to improve on the high waits scenario above is aimed at producer waits (CXPACKET), given consumer waits (CXCONSUMER) may inevitably exist a normal part of parallelism:

Ok, so actually most of the parallelism waits are producer waits. We may expect to see consumer waits to go up in the initial moments of parallel query execution, but then stabilize.

CXPACKET waits down to just over 10s, for 0.8s CPU and 2.7s overall elapsed execution time. So while the elapsed time remained very similar (although lower), there were gains in reducing CPU usage and waits (3x lower) by adjusting DOP just enough.

Your mileage may vary depending on your setup, but thinking that this query may run at scale in a given environment, the gains are relevant. And were made possible by all the insights unlocked in showplan.

Hi Robert, you’re referring to KB 2806535 and you are correct.
At the time I captured query data I had set maxdop 0, and actual DOP at runtime was 12. It is unfortunately common enough to find customer servers experiencing parallelism wait issues precisely because high (or default) DOP, which is what I repro’ed here.
In this case, even if set at maxdop 8 (because I have 12 schedulers per NUMA node I’d set it to 8 according to the KB above) I’d see some improvements by lowering DOP while still leveraging parallelism.
And like any best practice, it stems from lessons learned from many running systems/workloads, but your mileage may vary (up or down) in a specific server/workload. This is what the exercise above illustrates – how to leverage showplan information to improve this type of scenario.

A related tough subject: I think the cost threshold for parallelism deserves more MS blog attention :). Many in the community are grappling with how to identify “the” appropriate value, and suggesting a non-default value. I realize a proper analysis starts with the ability to capture a replayable load. Assuming such a load can be obtained, what metrics (other than duration) can be collected? No need to answer me here – this just a suggestion :),