Proactive Database Administration

T-SQL

In a previous post, I introduced how to use startup expression predicates in T-SQL queries to improve performance. Based on the feedback I got, there was some confusion about what this operator actually does, and why it appears in the query plan as a Filter operator, which is usually seen in other contexts. In this post, I’ll explain the differences and similarities of the Row Filter operator (which is seen more typically) and the Startup Expression filter operator.

Comparison By Example

Let’s set up a test scenario that can be used to demonstrate and compare the two types of operators (note: the test data is <1 MB):

Now we can try running a couple queries to see these operators in action. Here’s the first one, which contains a Row Filter predicate (like the previous post, I’m using hints so you can reproduce the same plans more easily if you try this yourself):

As we can see, the query joined the two tables together, and then filtered that set of rows to give the final result.

The Row Filter operator evaluated the predicate against each returned row (the big arrow to the right of the operator), and output only the rows where the predicate evaluated to true (no rows in this case; the small arrow to the left of the operator).

Here’s the next query, which uses a Startup Expression predicate (this query isn’t logically equivalent to the first one):

This time, table T1 was scanned (20480 rows), and the Startup Expression filter operator was executed for each of those rows. However, the index seek to table T2 was only executed 10 times. How did that happen?

The Startup Expression filter evaluated the predicate against each request row coming in from the upper input (in this case the T1 table scan), and only propagated the request where the predicate evaluated to true. This is how a Startup Expression operator “protects” or “guards” operators to its right, so they aren’t executed for every request row. While this particular example is contrived, it’s this “guarding” that improves performance by only executing the subsequent operator branch the minimum number of times necessary.

Summary

Both the Row Filter operator and Startup Expression filter operator evaluate a predicate against rows.

The Row Filter operator applies the predicate to returned rows, returning only the rows that match the predicate, while the Startup Expression filter operator applies the predicate to requested rows, only making further requests when the row matches the predicate.

While both operators perform essentially the same work (hence they both appear as a Filter operator), they do so logically reversed of each other, and therefore perform very different functions within a query plan.

When we write T-SQL statements, what we’re really doing is describing what data to return. It’s then up to the internals of SQL Server to best decide how to most efficiently return the data we asked for.

Sometimes, there’s extra information we know about, but that SQL Server doesn’t (automatically). Letting SQL Server in on this seemingly redundant information can change how efficiently the data is accessed and returned.

In this post, we’ll walk through a simple parent/child example that exploits a partially denormalized table schema to improve join performance to the child tables. The performance improvement comes through SQL Server producing query plans that contain Startup Expression Predicates, which effectively prevents certain parts of the query plan from executing in some cases.

Test Setup

The first thing we need to do is set up the tables. We’ll need a ProductTypes table, a parent table (Products) and two child tables (ItemProducts and ServiceProducts).

In this type of design, there will only ever be a single row in one of the child tables for each row in the parent table. This is typically handled by some form of business logic (stored procedures or views) and enforced by constraints, but I want to keep this example simple, so I’m only mentioning this for the sake of completeness, and what the data is going to “look” like.

(Note: the hints are not standard, but are needed for demonstration purposes; I got a nested loops/table scan plan by default. See the final section of this post for some extra discussion.)

Since each product row will only exist in one of the child tables, we have to use LEFT joins to get any results. The query plan looks like this (click for full size):

We can see that for each row in the Products table, SQL Server must join to both child tables in case there are rows there. Legitimately there could be, as the only thing preventing that is our business logic. SQL Server doesn’t understand that, so it has no choice but to ensure correctness and do the extra work.

Here’s where the magic comes in. We know that for a given ProductTypeId, rows will only exist in one of the child tables. If SQL Server knew that, then it would only have to join to one child table for each row in Products.

Now we’re telling SQL Server something about our business logic. Let’s see if this improves the execution plan:

That’s better. SQL Server has added two Filter operators — one for each child table — that reject rows that don’t satisfy the Startup Expression Predicate (in other words, the extra business logic we told SQL Server). This results in only a single seek against the proper child table for each row in the Products table. This could provide a big performance boost: for the number of child tables (m) and the number of parent rows (n), this approach will always execute only n seeks (thus making the number of seeks independent of the number of child tables), instead of m*n as the first approach does. This does of course come at the penalty of storage to denormalize enough information (ProductTypeId in this case) to drive the process, but usually that’s not going to be a huge hit (most likely 1 byte per row in the parent table).

As a bonus, here’s a different approach to writing the same query. This form may be more appropriate for some things, depending on what you’re trying to do:

And here is the resulting query plan that contains the Startup Expression Predicate Filter operators:

Conclusion

Sometimes giving SQL Server more information than you might think is necessary can help to improve the query plans that are generated. Certainly in cases like this parent/child example, we were able to exploit a denormalized ProductTypeId column to drive the index seeks to the child tables, and make the query scale much better. The result in this case was that the total number of seeks against the child tables became independent of the number of child tables, while still retaining the original query logic. Look for opportunities like this in your queries to give SQL Server extra hints about your table schema — you can be rewarded with more scalable queries.

More?

As I was playing around with these examples, in particular the second query, I found it interesting that for some reason if the plan used a scan operator as the lower input of the nested loops join (such as I got by not using the FORCESEEK hints), there were no startup expression predicates to be found (nor Filter operators). Instead, the predicate end up on the nested loops operator itself, with each child table scanned for every upper input row. This is somewhat puzzling, as I can’t think of a reason why the lower input couldn’t be protected by a startup expression in that scenario as well. (Note: I only tested on a 2008 R2 RTM instance.)

It’s a common practice to use scalar variables, table variables, and temporary tables when developing a database application. If these transient objects contain (or are) character fields, which collation do they have if a collation isn’t explicitly specified?

Test Setup

The idea of the test is to demonstrate how collation conflicts are possible when using temporary objects.

First, we’ll create a new database with a collation that differs from the instance collation. I’ve chosen one randomly to make it easy to pick out for the purposes of demonstration. The collation of the instance I’m testing on is SQL_Latin1_General_CP1_CI_AS, which is probably the most common.

The reason why is because of collation precedence: scalar variables (and string literals, by the way) are always considered to be coercible-default. In other words, the comparison is done using the collation of the field on the other side of the expression:

Now we get the error, “Cannot resolve the collation conflict between “SQL_Latin1_General_CP1_CI_AS” and “Thai_CS_AI” in the equal to operation.” In this case, the two collations are different and at the same precedence level. SQL Server can’t simply give one collation precedence over the other, and it throws up its hands in disgust.

Temporary Tables

As mentioned in the link in the previous section, there are many similarities and differences between table variables and temporary tables. Let’s find out where character field collation fits in. We’ll run the exact same tests as in the table variable section, using a temporary table instead.

Now we get a collation conflict error. What? It turns out that temporary tables are different than scalar variables in this respect — the field collations now inherit from tempdb instead of from the current database. (Edit 2013-02-10: Thanks to Aaron Bertrand for pointing out that temporary tables created in the context of a contained database inherit the collation from the current database, not from tempdb.)

Running the second test now succeeds because both fields have an implicit collation of SQL_Latin1_General_CP1_CI_AS (since that is my instance, and hence tempdb, collation):

DATABASE_DEFAULT

Because of this crazy behaviour with temporary tables, a virtual collation DATABASE_DEFAULT was created to say that the field should use the collation of the current database, instead of inheriting from the database where the field was created.

Here’s the first query from the previous section again, with the special collation added:

Conclusion

When working with table variables and temporary tables, it’s really important to know the similarities and differences between the two. In the case of character field collations, simply switching between the two types of tables (which is often done when doing performance tuning) can potentially have bigger side-effects than were intended!

Even though character fields in table variables always inherit the collation of the current database, I would strongly consider explicitly specifying the DATABASE_DEFAULT collation on those fields anyway (assuming this is appropriate), so that the data/business logic will still be the same if someone decides to go and switch things up when performance tuning.

In software development, code review is a simple and effective way to reduce the number of bugs that creep in. Merely having a second pair of eyes look at a piece of code can often reveal logic that was implemented incorrectly or weakly, or misses use cases never considered. (Test-driven development is a separate discussion entirely; T-SQL unit testing is not the easiest thing in the world to do, and compromises sometimes have to be made, usually due to time constraints or politics.)

There are many good reasons why code review is a good practice, and I’m not going to get into that here. You can use Google as well as I can, or you can start by reading some of these questions on Software Engineering Stack Exchange. The purpose of this post is to consolidate many of the different things I look at when doing code review, or even when I’m looking over my own code. While this won’t be a complete list, and the items will vary somewhat depending on your company’s environment, it should be a good start.

In no particular order:

Syntax, including version-specific language use. This seems really obvious, but it can be easy to slip up. For example, if you normally do your development on a 2008+ server and the application still has to support 2005, it’s easy to write something like DECLARE @i int = 0 and only have this get caught during testing. (Testing is done against all versions of the database engine that are supported by the application, right? …right?) In terms of outright syntax, pay particular attention to T-SQL that’s contained in string variables, meaning dynamic SQL, or SQL statements embedded in client applications, as these aren’t validated until they are run through the containing code. Do all INSERT statements have a column list, and are all required columns specified?

Data/business logic protection. This could apply to either designing tables (which I do recommend code-reviewing, too), or when writing code to manipulate data (in the tables or otherwise). Are there enough constraints on the tables to prevent invalid data from entering the system? Are there any logical gaps in a piece of code? Does the piece of code accomplish the prescribed task? If in doubt, ask the developer what was intended. Sometimes logical gaps can be okay if a condition is rare, and the code is appropriately guarded. These same ideas apply to things like procedure parameters, particularly if any of the parameter values come from application user input.

Dynamic SQL. Follow through all logic paths to make sure the statements are being concatenated correctly for both syntax and potential injection attack vectors. Make sure the statements are parameterized appropriately.

Code Cleanliness. This is a bit tough to define. I personally don’t expect strict adherence to a style guide, for example, but I do expect code to be legible, and make appropriate use of indentation, line breaks and comments. Everyone has their own natural style — and even that can change over time — so I think giving people more or less free reign over style is fine. Again, as long as the code is legible to others, because that is the most important thing.

Version- or edition-specific feature use. This shouldn’t come up, as it should have been decided during the design phase, but at the end of the day, a developer can check anything in to the version control system.

Use of unexpected methods to accomplish a task. Does the code use a Rube-Goldberg-like approach to accomplish the task? Are cursors being used when all that’s necessary is a slightly more difficult to write set-based operation? In these types of cases, I will always go back to the developer and find out why this kind of approach was taken, and what was intended to be accomplished. Most of the time, this kind of situation provides a great learning opportunity.

Query plan reuse. Does the code reuse (or not reuse) query plans as expected? Pay particular attention to dynamic SQL: for example, there can be subtle string concatenation bugs where two generated statements (via different code paths) only differ by whitespace, but this is enough to produce a separate query plan. Sometimes dynamic SQL is used to force the creation of an alternative plan for the expected data size — does this work as expected?

Performance. Does the code meet the performance requirements, if any, and/or does it execute in a “reasonable” amount of time with a “reasonable” amount of resource usage? Does the code use table variables or temporary tables, and how does this interact with the query plan(s)? Does the code produce efficient query plans that will scale? It’s important to test code on a real data set to find out these things. Some of this is technically a “testing” activity; however, I would say it’s unlikely a testing department will have the technical skill required to determine if the code is doing the right thing. Reading and understanding query plans is a huge topic in itself; in short, look for inefficient operators that are unexpected, such as table scans (particularly as the bottom operator in a nested loops join), huge row counts (thick arrows), missing indexes, etc. Speaking of indexes, determine if the code would benefit from index changes, even if query performance is currently acceptable. Even if you don’t make those changes now, record your findings, as it could be valuable information to have later if performance does become a problem. This could potentially save an incredible amount of time and effort doing a workload analysis.

Deprecated features. This is fairly broad, as deprecation can happen for a wide range of things, from major features (i.e., database mirroring) to minor T-SQL syntax (i.e., using HOLDLOCK without surrounding parentheses). That said, for every new release of SQL Server, it’s a good idea to thoroughly read the official list of deprecated features, and make sure no new code is using any of it unless absolutely necessary (and in that case, there needs to be a plan to modify the system to remove the deprecated feature usage). I have a script on the Scripts & Code resource page that dumps the internal performance counters that track usage of deprecated features, which I think may be more broad than the official reference list. This method is great if you want to find out what your application is really doing in production, so you can proactively fix it long before it becomes a problem.

That’s a pretty good start! If you think I’ve left out something important, let me know in the comments and I’ll add it to the post, as I want this to be a good reference guide.

When designing new database structures, I feel it’s really important to make sure that the data goes into the new tables as cleanly as possible. Doing this early on helps prevent bugs from creeping into the system as it’s developed. And preventing bugs as early as possible in the development cycle is almost certainly the most cost-efficient strategy — cleaning up bad data later on is extremely expensive, time consuming, and frustrating.

One of the tools that can be used to reject bad data values going into tables are CHECK constraints. This type of constraint is meant to be used to validate a single row based on a predicate that’s a function of any combination of the table columns in that row. (CHECK constraints can be used for things beyond this definition… but doesn’t mean they should be.)

In this post, I’m not going to address any more of the advantages of constraining data values. What I do want to investigate is performance.

The extra logic to do this validation necessarily has additional processing cost associated with it. The question is: how much extra? Does it really make a difference? What kind of general guidelines can be established for performance purposes?

The Setup

I decided to run two series of tests: one with a very narrow table (3 columns), and one with a wide table (21 columns). We’ll INSERT a bunch of test data into each of the tables. A baseline would be established for each test, where the tables have no constraints. Then, different constraints would be added, and hopefully we’ll be able to see a difference in the load times.

Here is an abridged version of the table creation script (note: all scripts are available to download in full at the bottom of this post):

So that’s pretty standard. I ran the narrow table test with @i = 1250, and the wide table test with @i = 750, which resulted in 2,560,000 and 1,536,000 rows inserted, respectively. For each table, I ran the test harness 11 times; the first result was discarded, and the rest got recorded and averaged together. Note that the table is TRUNCATEd and the identity seed reset before each run. Also, the database was in SIMPLE recovery.

The Results

My hypothesis was that the constraints (at least the simple ones) would add a little bit of overhead, but I wasn’t sure if it would be measurable, and it was unclear how much (if anything) SQL Server would do to simplify the expressions I used (turns out it didn’t simplify them at all).

Average (ms)

ms/1000 Rows

% Increase

Narrow (no constraints)

12,401

0.484

–

Narrow (simple constraints)

13,034

0.509

5.1%

Narrow (complex constraints)

14,505

0.567

17.0%

Narrow (many complex constraints)

25,101

0.981

102.4%

There are a couple of very interesting things here. First of all, the difference between no constraints and a small number of simple constraints is measurable. Second — this is what really surprised me — a complex constraint is significantly more expensive than the simple constraint. This makes sense because functions are always going to be more expensive than something like a basic comparison. It seems that perhaps this difference is magnified in this case because the amount of work to do the actual INSERT is small relative to the total amount of work needed to be done. That said, it appears that many complex constraints could at the very least hobble an INSERT-only workload.

Average (ms)

ms/1000 Rows

% Increase

Wide (no constraints)

12,477

0.812

–

Wide (simple constraints)

14,647

0.954

17.4%

Wide (complex constraints)

20,238

1.318

62.2%

In this test, we would expect that the differences are larger because of the number of constraints used. As a sanity check, we can see that the ratio of the increases (62.2% / 17.4% = 3.6) vs (17.0% / 5.1% = 3.3) is approximately the same.

If we compare the narrow table with many complex constraints with the wide table with complex constraints, they both have the same number and type of constraints, and the ms/1000 rows numbers have the same sort of increases (0.981 – 0.484 = 0.497) vs (1.318 – 0.812 = 0.506), respectively. This says that evaluating constraints has a fixed cost, independent of the table schema.

Conclusion

I certainly learned a lot from doing this experiment. There are a few takeaways from the results:

Disabling CHECK constraints will reduce the load time, perhaps significantly.

A CHECK constraint’s overhead is determined by it’s complexity of evaluation (comparisons good; function calls not so much).

Evaluating CHECK constraints has a fixed cost, and thus a smaller relative cost when the other parts of the process are more expensive.

While these aren’t Earth-shattering conclusions, we did get to validate the best-practice of disabling constraints for loading, and see just how much overhead CHECK constraints can add.

What I want to stress here is that this test was for a relatively narrow situation involving only INSERTs. If your workload is 100% INSERTs, then you definitely need to pay attention to these kinds of things, but if your system isn’t INSERTing 200k rows/sec — or even anywhere near that — it’s likely that the benefits of the constraints will outweigh the performance penalty, as the overhead will be a small portion of the overall workload.