Introduction

What is CXPacket wait?

When the SQL Server Engine chooses to break down a query into smaller pieces and process the individual pieces in parallel there is a lag between when the first results is received and when the results are compiled.

Caveat

One should keep in mind that CXPacket wait stats are not wholly distinct from other wait stats. That is, because waits are marked as CXPacket, one should not over-read and think that for a specific request waits related to I/O, CPU, and Memory, Locks and Latches are not contributing to the query’s CXPacket wait stat.

In fact, it seems that once a query is parallelized, a timer is started and only stopped once all parallelized threads complete and their results are assembled.

Trouble spot

Based on Microsoft’s published documentation if the percentile of Wait Stats attributed to CXPacket exceeds 5% one should take a good and clear eye view of the settings that are relevant to Parallelism.

SQL Server Instance – Exposes – NUMA Characteristics

Non-Uniform Memory Access is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory.

…

…

Limiting the number of memory accesses provided the key to extracting high performance from a modern computer. For commodity processors, this meant installing an ever-increasing amount of high-speed cache memory and using increasingly sophisticated algorithms to avoid cache misses. But the dramatic increase in size of the operating systems and of the applications run on them has generally overwhelmed these cache-processing improvements. Multi-processor systems without NUMA make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access the computer’s memory at a time.

NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. For problems involving spread data (common for servers and similar applications), NUMA can improve the performance over a single shared memory by a factor of roughly the number of processors (or separate memory banks).

Blind Sides

There are so many blind sides to this parallelism story.

Variable Argument vs Concrete Argument

Though estimated Query Plan might indicate that parallelism will occur, when your run the query and supply actual values for the parameters the optimized might see that the data set is narrow enough and the arguments for Parallelism get decimated.

Here is a sampler:

In the top pane, variables were submitted (@retirementAge) in the where clause and in the bottom actual values were submitted (‘7/1/2013’)

Hardening Parallelism for specific user through Resource Group

To constrain this posting I will only briefly touch on this point. My thoughts were excited via a blog posting from Cindy Gross.

declare @PlanCacheForMaxDop TABLE
(
StatementSubTreeCost FLOAT
, UseCounts INT
, PlanSizeInBytes INT
);
-- Collect parallel plan information
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
WITH XMLNAMESPACES
(DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
INSERT INTO @PlanCacheForMaxDop
(
StatementSubTreeCost
, UseCounts
, PlanSizeInBytes
)
SELECT
n.value('(@StatementSubTreeCost)[1]', 'VARCHAR(128)')
AS StatementSubTreeCost,
ecp.usecounts,
ecp.size_in_bytes
FROM sys.dm_exec_cached_plans AS ecp
CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS eqp
CROSS APPLY
query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS qn(n)
WHERE n.query('.').exist('//RelOp[@PhysicalOp="Parallelism"]') = 1;
-- Return grouped parallel plan information
SELECT MAX(CASE WHEN StatementSubTreeCost BETWEEN 1 AND 5 THEN '1-5'
WHEN StatementSubTreeCost BETWEEN 5 AND 6 THEN '5-6'
WHEN StatementSubTreeCost BETWEEN 6 AND 7 THEN '6-7'
WHEN StatementSubTreeCost BETWEEN 7 AND 8 THEN '7-8'
WHEN StatementSubTreeCost BETWEEN 8 AND 9 THEN '8-9'
WHEN StatementSubTreeCost BETWEEN 9 AND 10 THEN '9-10'
WHEN StatementSubTreeCost BETWEEN 10 AND 11 THEN '10-11'
WHEN StatementSubTreeCost BETWEEN 11 AND 12 THEN '11-12'
WHEN StatementSubTreeCost BETWEEN 12 AND 13 THEN '12-13'
WHEN StatementSubTreeCost BETWEEN 13 AND 14 THEN '13-14'
WHEN StatementSubTreeCost BETWEEN 14 AND 15 THEN '14-15'
WHEN StatementSubTreeCost BETWEEN 15 AND 16 THEN '15-16'
WHEN StatementSubTreeCost BETWEEN 16 AND 17 THEN '16-17'
WHEN StatementSubTreeCost BETWEEN 17 AND 18 THEN '17-18'
WHEN StatementSubTreeCost BETWEEN 18 AND 19 THEN '18-19'
WHEN StatementSubTreeCost BETWEEN 19 AND 20 THEN '19-20'
WHEN StatementSubTreeCost BETWEEN 20 AND 25 THEN '20-25'
WHEN StatementSubTreeCost BETWEEN 25 AND 30 THEN '25-30'
WHEN StatementSubTreeCost BETWEEN 30 AND 35 THEN '30-35'
WHEN StatementSubTreeCost BETWEEN 35 AND 40 THEN '35-40'
WHEN StatementSubTreeCost BETWEEN 40 AND 45 THEN '40-45'
WHEN StatementSubTreeCost BETWEEN 45 AND 50 THEN '45-50'
WHEN StatementSubTreeCost &gt; 50 THEN '&gt;50'
ELSE CAST(StatementSubTreeCost AS VARCHAR(100))
END) AS StatementSubTreeCost
, COUNT(*) AS countInstance
, avg(PlanSizeInBytes) /1000 avgPlanSizeInKB
FROM @PlanCacheForMaxDop
GROUP BY CASE WHEN StatementSubTreeCost BETWEEN 1 AND 5 THEN 2.5
WHEN StatementSubTreeCost BETWEEN 5 AND 6 THEN 5.5
WHEN StatementSubTreeCost BETWEEN 6 AND 7 THEN 6.5
WHEN StatementSubTreeCost BETWEEN 7 AND 8 THEN 7.5
WHEN StatementSubTreeCost BETWEEN 8 AND 9 THEN 8.5
WHEN StatementSubTreeCost BETWEEN 9 AND 10 THEN 9.5
WHEN StatementSubTreeCost BETWEEN 10 AND 11 THEN 10.5
WHEN StatementSubTreeCost BETWEEN 11 AND 12 THEN 11.5
WHEN StatementSubTreeCost BETWEEN 12 AND 13 THEN 12.5
WHEN StatementSubTreeCost BETWEEN 13 AND 14 THEN 13.5
WHEN StatementSubTreeCost BETWEEN 14 AND 15 THEN 14.5
WHEN StatementSubTreeCost BETWEEN 15 AND 16 THEN 15.5
WHEN StatementSubTreeCost BETWEEN 16 AND 17 THEN 16.5
WHEN StatementSubTreeCost BETWEEN 17 AND 18 THEN 17.5
WHEN StatementSubTreeCost BETWEEN 18 AND 19 THEN 18.5
WHEN StatementSubTreeCost BETWEEN 19 AND 20 THEN 19.5
WHEN StatementSubTreeCost BETWEEN 10 AND 15 THEN 12.5
WHEN StatementSubTreeCost BETWEEN 15 AND 20 THEN 17.5
WHEN StatementSubTreeCost BETWEEN 20 AND 25 THEN 22.5
WHEN StatementSubTreeCost BETWEEN 25 AND 30 THEN 27.5
WHEN StatementSubTreeCost BETWEEN 30 AND 35 THEN 32.5
WHEN StatementSubTreeCost BETWEEN 35 AND 40 THEN 37.5
WHEN StatementSubTreeCost BETWEEN 40 AND 45 THEN 42.5
WHEN StatementSubTreeCost BETWEEN 45 AND 50 THEN 47.5
WHEN StatementSubTreeCost &gt; 50 THEN 100
ELSE StatementSubTreeCost
END;
GO

Interestingly enough it shows that as a group Statements costed below our default threshold of 5 are actually the highest on our list:

And, like they say that is not for now; but for another trek up the learning Tree…

Thanks

My interest in this area was piqued by a couple of articles publicly published by Jimmy May. I have listed the postings in the References section.

One of the commenters, Mike B, posted:
There is a misunderstanding here. You can reduce CPU utilization even more by turning off the computer. It should be understood that parallel processing is inherently “inefficient”, but you use it because it executes faster.
With passion like this, who needs Court TV. Just like CSI you almost always learn something, once you have the requisite interest.