Under The Hood With SQL Server Batch Mode Part 7:Stack Walking The Batch Runtime

In this post I am going to “Stack walk” the batch run time and point out one or two interesting things along the way. However, I’ll take a short step back by providing some back ground information around my preferred method of obtaining SQL Server stack traces, this is to use xperf which comes with the Windows Performance Analysis toolkit. The other popular way of doing this is to use the debugger, some good information on how to do this, courtesy of the CSS SQL Engineer team is to be found here. One of the big advantages that WPA provides over the debugger is that is gives you profiling information. Once WPA is installed and the public symbols downloaded, using xperf is quite straight forward, in a DOS command prompt window with administrator privileges issue the following command:

xperf -on base -stackwalk profile

then run your SQL, to recap, the excerpt below illustrates how my test data is created, the final query in the excerpt is the one I will obtain the stack trace for:

Finally after the query has executed we can get xperf to produce a file containing the profiling information with xperf -d batch.etl ( or whatever you wish the file to be called ). Without further ado, lets walk down a file, starting at the very top:

A the top of the trace on lines three and four we can see that the only part of the operating system used is that to run the scheduler, there is a function call to start the thread and one to perform some sort of initialization, that is the lot. In part 5 of this series, I included this graphic:

it happens that there is one particular type of activity that dwarfs anything on this in terms of being a performance killer. It is the context switch I’m alluding to, this can consume anything from between ten and fifty thousand CPU cycles. SQL OS solves several problems:

It allows threads scheduled by taking into account resources that the operating system has no concept of, such as latches.

If we delve further down into the trace, we begin to see the batch execution run time:

This is how queries are executed conceptually:

Data travels between itertators by each iterator in the plan invoking a method call on the next iterator down, we can see the APIs being invoked to obtain batches on lines 18, 25 and 28. This concept becomes a bit more nuanced with parallel exchange iterators, in that these have a consumer and producer side, which results in data being pushed through the iterator. In the stack excerpts I have provided so far, the weightings are omitted ( a weighting being an estimate of CPU time in ms sampled across all CPU cores ). Lets now have a look at a stack trace that includes this information ( double-click on the image to enlarge it ):

The deepest parts of the stack relate to iterators furthest from the root iterator in the execution plan. If we take the hash aggregate that appears immediately after the column store index scan at line 31 on-wards in the stack, the weight of this is 88,180 to the nearest integer, the weight for the entire query is 91,408 ( to the nearest integer ). Some simple maths reveals that the CPU cost of the hash aggregate is 97.5% of the query’s total execution cost.

By analyzing the stack trace further we can determine what element of the hash aggregate which follows the clustered column store scan is the most expensive. In the screen shot below I’ve highlighted the two main areas of interest:We can see that on line 31 ( double click on the image to see a larger version of it ) a call to sqlmin.dll!CBpQScanHashAggNew::PbsGetForOriginalInputs is made with a weight of 88,179.787677, expression evaluation for the hash aggregate begins on line 36 with the call to sqlmin.dll!CBpagAggregateExpression::Eval with a weight of 53,504.141506, taking the weighting to the nearest whole unit we can see that expression evaluation accounts for 61% of the total CPU time expended in processing the hash aggregate.

As we delve down further into the stack trace we can see activity which relates to the column store scan. I have mentioned the deletion bit map in a previous post, the bit where the batch engine checks for deleted tuples is highlighted below:

One question that remained unanswered from part 5 of this series was why CPU utilisation reached a ceiling of 59% when using a column store index created on sorted data and a warm column store object cache. In attempting to answer this question, I trawled through the stack trace looking for likely causes, below are some excerpts of what I managed to find. I’ve had to split the part of the stack trace of interest into two because I was unable to capture the whole stack in one screen print. The top part shows the call to obtain the next batch from the column store index scan, if you notice the tapered vertical line to the left of sqlmin.dll!CBpChildIteratorScanner::BpGetNextBatch and follow it down you arrive at the method call at line 88, follow this down to line 91 and there is a call to sqlmin.dll!SOS_Task::Sleep. My theory is that the hash aggregate is throttling back the column store index scan because it cannot keep up with it, leading me to suspect that the batch engine has a throttling mechanism similar to that used by the Integration Services data flow engine.

I will wrap this series up summarising the key takeaway points of the series:

For the types of query it is designed for, the batch engine is an order of magnitude faster than the row mode engine, however it suffers from the same scalability issues.

The pressure on the CPU is at the backend, Microsoft could alleviate this by leveraging SIMD instructions, other database vendors have already done this.

Level 2 cache reuse could be further enhanced by the increasing of the scheduler quantum above 4 ms, for OLAP style workloads I see no practical reason why the quantum size cannot be increased.

The creation of column store indexes on data which is pre-sorted can result in improvements in efficiency when performing hash aggregates and joins that outweigh any increased overheads in scanning the column store.

Memory access is highly nuanced and goes way beyond any simplistic statements along the lines of “If it fits in main memory its fast”.

Anecdotally the batch engine appears to have a form of throttling mechanism to slow down column store scanning when other iterators cannot keep up with this.

So where does Microsoft now go with the batch engine, there are one or two obvious things such as the leveraging of SIMD instructions and an increase in size of the scheduler quantum. There is also some very interesting stuff in the research community which might filter into commercial products, “Database cracking” is but one example, this refers to the ad hoc and adaptive creation of indexes for column store databases. Despite the fact I am not privy to Microsoft’s internal road map and future plans, I strongly suspect that the next Pass event in the states will herald the next version of the database engine in CTP form, hopefully this will provide some new fertile ground for pushing the database engine to its limits and seeing where the its bottlenecks are.