A technical SQL Server blog from New Zealand. See also my articles on SQLperformance.com

Compute Scalars, Expressions and Execution Plan Performance

The humble Compute Scalar is one of the least well-understood of the execution plan operators, and usually the last place people look for query performance problems. It often appears in execution plans with a very low (or even zero) cost, which goes some way to explaining why people ignore it.

Some readers will already know that a Compute Scalar can contain a call to a user-defined function, and that any T-SQL function with a BEGIN…END block in its definition can have truly disastrous consequences for performance (see this post by Rob Farley for details). This post is not about those sorts of concerns.

Compute Scalars

This leads people to think that Compute Scalar behaves like the majority of other operators: as rows flow through it, the results of whatever computations the Compute Scalar contains are added to the stream. This is not generally true. Despite the name, Compute Scalar does not always compute anything, and does not always contain a single scalar value (it can be a vector, an alias, or even a Boolean predicate, for example). More often than not, a Compute Scalar simply defines an expression; the actual computation is deferred until something later in the execution plan needs the result.

Compute Scalars are not the only operators that can define expressions. You can find expression definitions with labels like [Expr1008] in many query plan operators, including Constant Scans, Stream Aggregates and even Inserts.

Deferred expression evaluation was introduced in SQL Server 2005 as a performance optimization. The idea is that the number of rows tends to decrease in later parts of the plan due to the effects of filtering and joining (reducing the number of expression evaluations at runtime). To be fair to the Books Online entry, it does mention this behaviour in a note:

The take away from this first section is that a Compute Scalar is most often just a placeholder in the execution plan. It defines an expression and assigns a label to it (e.g. [Expr1000]) but no calculations are performed at that point. The defined expression may be referenced by multiple operators later in the execution plan. This makes counting the number of ‘executions’ more complex than for a regular plan operator., and goes some way to explaining why Compute Scalar operators only ever show estimated row counts (even in an ‘actual’ execution plan).

Execution Plans and the Expression Service

The execution plans you see in Management Studio (or Plan Explorer) are a highly simplified representation that is nevertheless useful for many kinds of plan analysis. The XML form of show plan has a more detail than the graphical representation, but even that is a shallow depiction of the code that actually runs inside SQL Server when a query is executed. It is easy to be seduced into thinking that what we see as an ‘execution plan’ is more complete than it actually is. As a tool for high-level analysis, it is excellent, but do not make the mistake of taking it too literally. More detail on this later.

One more thing to appreciate before we get into a concrete example: the SQL Server engine contains a component known as the Expression Service. This is used by components such as the query optimizer, query execution, and the storage engine to perform computations, conversions and comparisons. It is the call to the expression service in the execution plan that can be deferred and performed by a plan operator other than the source Compute Scalar (or other expression-defining operator).

An Example

I came across a forum thread recently which demonstrated the concepts I want to explore in this post really well. The topic of discussion was delimited string-splitting (an old favourite of a topic, for sure) but I want to emphasise that the particular code is not at all important – it is the query plan execution I want to focus on. The particular string-splitting method employed was one I first saw written about by Brad Shultz; it replaces the delimiters with XML tags, and then uses XML shredding to return the results. I should say that I am not a fan of this technique because it is (a) a strange use of XML; and (b) it does not work for all possible inputs. Anyway, I said the example was not that important in itself, so let’s get on to the first example:

The code above creates a single comma-delimited string containing the letters of the alphabet, repeated 32 times (a total string length of 1632 characters). The string-splitting SELECT statement replaces the commas with XML tags, fixes up the start and end of the string, and then uses the nodes() XML method to perform the actual splitting. There is also a ROW_NUMBER call to number the elements returned in whatever order the nodes() method spits them out. Finally, two variables are used to ‘eat’ the output so returning results to the client does not affect execution time. The execution plan is as follows (click to enlarge):

The Constant Scan operator defines an expression labelled as [Expr1000] which performs the string replacement and conversion to the XML type. There are four subsequent references to this expression in the query plan:

The Filter operator (a start-up predicate checks that the expression is not NULL)

The Stream Aggregate (as part of a CASE expression that determines what the value() XML method should finally return)

The query takes 3753ms to execute on my laptop. This seems a rather long time to shred a single string containing just 801 elements.

Test Two – The Empty String

The following code makes one small modification to the test one script:

SELECT

@ItemID = ROW_NUMBER() OVER (ORDERBY (SELECTNULL)),

@Item = x.i.value('./text()[1]', 'varchar(8000)')

FROM

(

VALUES

(

CONVERT

(xml,

'<r>' +

REPLACE(@Input, ',', '</r><r>') +

'</r>' +

LEFT(@@DBTS, 0) -- NEW

, 0)

)

) AS ToXML (result)

CROSS APPLY ToXML.result.nodes('./r') AS x(i);

The only difference in the execution plan is in the definition of [Expr1000] (click to enlarge):

This plan returns the same correct results as the first test, but it completes in 8ms (quite an improvement over 3753ms).

Beyond the Execution Plan – Test One

There is nothing in the query plan to explain why the second form of the query runs 500 times faster than the first. To see what is going on under the covers of the execution engine when executing the slow query, we can attach a debugger and look closer at the code SQL Server is really running. Using SQL Server 2012 and setting a breakpoint on the CXMLTypeProvider::ConvertStringToXMLForES method (as the name suggests, it converts a string to XML for the Expression Service) the first debugger break occurs with this stack trace:

I know that not everyone reading this post will routinely analyse stack traces, so for this first example I will describe the interesting calls from the bottom up (the currently executing code is on top, each line below is the method that called that routine):

There is nothing very interesting below CQueryScan::GetRow (it is just start-up code)

CQueryScan:GetRow drives the plan as we are used to seeing it. You might think of it as approximately the code behind the green SELECT query plan icon in a graphical plan. It reads a row at a time from the first plan operator to its right

The first visible plan operator to execute is the leftmost Nested Loops join (CQScanNLJoinTrivialNew::GetRow). (The Compute Scalar to its left in the graphical plan simply defines an alias for an expression computed by the Stream Aggregate – it does not even exist in the code executed by the execution engine).

The sequence project operator asks the Segment operator for a row (CQScanSegmentNew::GetRow)

The segment operator asks the second Nested Loops Join for a row (CQScanNLJoinTrivialNew::GetRow)

The nested loops join has obtained a row from the Constant Scan and is now preparing to fetch a row from the inner side of the join

The start-up Filter operator is executing its Open() method (CQScanStartupFilterNew::Open)

The start-up Filter calls into the general entry point for the Expression Service to evaluate expression [Expr1000] defined by the Constant Scan

CEsExec::GeneralEval4 calls an ES method to convert a string to XML (ConvertFromStringTypesAndXmlToXml)

The XML type provider executes the code that actually converts the string to XML (ConvertStringToXMLForES)

The stack trace illustrates two things nicely:

The iterative nature of plan operators (one row at a time, Open, GetRow, and Close methods)

An expression defined by one operator (the Constant Scan) being deferred for evaluation until a later operator (the start-up filter) requires the result

Continuing execution with the debugger, the next break occurs when the Table-valued Function to the right of the filter is initialised:

Another break when the Table-valued Function on the lowest branch of the plan initialises:

And finally, a break from the Stream Aggregate, when it needs to evaluate [Expr1016] (which contains a reference to [Expr1000]):

The last two breaks (lowest table-valued function and stream aggregate) execute a total of 801 times. The reason for the slow performance of this query is now clear: [Expr1000] is being computed more than 1604 times. Yes, that means the REPLACE and CONVERT to XML really does run 1604 times on the same value – no wonder performance is so poor!

If you are inclined to test it, the expression service method that performs the REPLACE is sqlTsEs!BhReplaceBhStrStr. It turns out that the string REPLACE is very much faster than conversion to XML (remember XML is a LOB type and also has complex validation rules) so the dominant performance effect is the conversion to XML.

Beyond the Execution Plan – Test Two

Running test two with the debugger attached reveals that only one call is made to CXMLTypeProvider::ConvertStringToXMLForES:

Reading down the stack trace we see that execution of the plan (as we know it) has not started yet. The call to convert the string to XML is called from InitForExecute() and SetupQueryScanAndExpression. The conversion to XML is evaluated once and cached in the execution context (CQueryExecContext::FillExprCache). Once the iterative part of plan execution begins, the cached result can be retrieved directly without calling ConvertStringToXMLForES() 1604 times.

Test Three – Column Reference

Before you go off adding LEFT(@@DBTS, 0) to all your expressions (!) look what happens when we use a table to hold the input data rather than a variable:

The other plan details are the same as before; the expression label [Expr1003] is still referenced by the Filter, two Table-valued functions and the Stream Aggregate. The test three query takes 3758ms to complete. Debugger analysis shows that CXMLTypeProvider::ConvertStringToXMLForES is again being called directly by the Filter, Table-valued Functions, and Stream Aggregate.

The remainder of the query plan is unaffected; the same operators reference [Expr1003] as before. As you might have been expecting, this query executes in 8ms again.

Explanation

In test one, there is nothing to stop the query processor deferring execution of the expression, which results in the same expensive computation being performed 1604 times (once each for the filter and first table-valued function, and 801 times each for the second function and the stream aggregate). As an interesting aside, notice that even with OPTION (RECOMPILE) specified, constant-folding only applies to the REPLACE and string concatenation; constant-folding cannot produce a LOB type (including XML).

In test two, we introduced the non-deterministic function @@DBTS. This is one of the functions (like RAND and GETDATE) that are implemented to always return the same value per query (despite being non-deterministic, see the links in the Further Reading section below for more details). These runtime constants are evaluated before the query starts processing rows and the result is cached. There are some hints (ConstExpr labels) in the execution plan that may indicate a runtime constant but I have not found this to be reliable.

In test three, we changed the game again by replacing the reference to a variable with a column reference from a table. The value of this column reference could change for each table row so the engine can no longer treat the expression as a runtime constant. As a consequence, the result is no longer cached and we see the same behaviour (and slow performance) as in test one.

In test four, we replaced the @@DBTS function with the side-effecting function CRYPT_GEN_RANDOM (the other such function is NEWID). One of the effects of this is to disable the deferred evaluation of the expression defined by the Compute Scalar. The result is a XML LOB and a second optimization occurs where handles to the same LOB (evaluated by the defining operator) can be shared by other operators that reference the LOB. This combination of effects means the expensive computation is only performed once in test four, even though the runtime constant caching mechanism is not used. By the way, there was a bug with CRYPT_GEN_RANDOM which meant it was incorrectly assessed by the optimizer in early versions of SQL Server 2008. If you encounter this issue when running the scripts above, apply a later service pack (or use the NEWID function for SQL Server 2005).

Final Thoughts

There is a lot more going on in query execution than we can see in execution plans

In general, there are no guarantees about the order of execution of expressions, or how many times they will be evaluated

Please do not start using @@DBTS or CRYPT_GEN_RANDOM for their undocumented effects

It might be nice if show plan output did indicate if a computation was deferred or cached, but it is not there today

In many cases we can work around the potential for poor performance caused by repeated evaluation by explicitly materializing the result of the expression using a variable, table, non-inline function…though not always

Please do not suggest improvements to the XML string splitter unless it handles all input characters and can accept a semicolon as a delimiter ;c)

I always use SQLCLR for string splitting

Acknowledgements

I would like to particularly thank Usman Butt and Chris Morris from SQL Server Central for their contributions and thoughts which helped me write this entry.

Comment Notification

Comments

You always have very fascinating and informative blog posts. Thanks for teaching us all. I can't imagine how I would acquire knowledge like this on my own, and I appreciate it. I can sound all awesome and knowledgeable just from a bit of reading! :)

I am very glad to have you around. Your knowledge about SQL Server and your willingness to share it, is really appreciated.

Now talking about the QO; I always admire how the QO evolves from CU/SP/Version to the next, but in this case I wonder if it can't foresee that the evaluation will take place more than one time, if it is pushed down the stream, as in this case?

@AMB: Conor talks about some of the QO limitations in this area in the 'Scalar Operator Costing' link in the further reading section above. The QO primarily reasons about alternative relational constructions - it has very little to say about expressions (which aren't relational). The lack of any sort of reasonable costing information for expressions means it would probably do a poor job anyway. I suppose this might improve in future, but I have heard nothing to indicate that.

I thought that the first part of the post was very compelling, and I was looking forward to seeing how to look under the hood regarding a stack trace.

Unfortunately the examples started with code that had syntax I was unfamiliar with, and then the stack trace code which I would love to learn, but have never seen before. At that point I was completely lost and frustrated.

There doesn't seem to be a way to bridge the gap between basic blogs on the subject, and the advanced stuff which I really want to know.