Proactive Database Administration

Experiments

Sometimes we face the requirement to search a table for a specific string value in a column, where the values are very long. The default answer is to create an index, but we must remember that index keys cannot exceed a 900-byte size limit, and LOB columns (which include the MAX type variants) cannot be in the index key at all. In this post, we’ll explore a method to efficiently index long character strings where the search operation is moderately-to-highly selective.

The Methodology

Since we’re faced with a size restriction on the index key (or the string column cannot be included in the key at all), we need to apply some kind of transformation to the string data to make its representation — and hence the index key size required — smaller.

Of course, we can’t simply compress an infinite amount of data into a comparatively very small field, so the index we create won’t be able to support all the same operations as if we had indexed the original field. But most of the time, that isn’t a problem — generally, we need to search for an exact match, or a string starting (or maybe even ending) with a given few characters.

We’re going to do this by performing a computation on the original values, storing the result in a new column in the table, and creating an index on that column (there are two ways to do this, which we’ll get into shortly). In terms of the computation itself, there are a few different options that can be used, depending on what needs to be accomplished. I’ll introduce two basic approaches here, and then we’ll run through some code that demonstrates one of them.

Use the LEFT function to grab the first several characters and discard the rest. This method is appropriate for both exact matching and wildcard searching with a few given characters (i.e., LIKE ‘abc%’), and because the original text stays in tact, it supports case/accent-insensitive matching without doing anything special. The drawback of this method, though, is that it’s really important to know the data. Why? To improve the efficiency of queries, the number of characters to use should be enough such that the computed values are more or less unique (we’ll see why in the example). It may in fact be the case that the original field always has a consistent set of characters at the beginning of the string (maybe they’re RTF documents or something), so most of the computed values would end up being duplicates (if that’s the case, this approach won’t work and you’ll want to use the other approach instead).

Use a traditional hash function (such as HASHBYTES) to compute a numeric representation of the original field. This method supports exact matching (case-insensitivity with a small tweak, and accent-insensitivity with a bit of code), but not partial matching because the original text is lost, and there’s no correlation between the hash value and the original text. It’s also critical to watch the data types involved here, because hash functions operate on bytes, not characters. This means hashing an nvarchar(MAX) value N’abc’ will not give the same result as hashing a varchar value ‘abc’, because the former is always 2 bytes per character. Those are the downsides. The upside is that traditional hashing has the ability to work on data that contains very few differences, and when the text starts with the same sequence of characters.

Both of these methods are hash functions, as they convert a variable length input to a fixed length output. I suppose the first approach isn’t technically a hash function because the input could be shorter than the number of characters desired in the output, but I think you see my point that both approaches “sample” the original data in a way that allows us to represent the original value more or less uniquely without needing to carry around the original data itself. Maybe it’s more correct to speak of them as lossy compression algorithms. In any event, I will refer to the results of these functions as hash values for the remainder of the post.

Because this method omits some (or probably most) of the original data, some extra complexity in our code must be introduced to compensate for the possibility of duplicate hash values (called a hash collision). This is more likely to happen using the first approach as the original data values are partially exposed. It’s still possible, however rare, that the second approach can have duplicates, too, even if the input text is unique, so we must accommodate for that.

Speaking specifically about using a traditional hash function, such as HASHBYTES, there are a couple major things to consider. First, there are several different algorithms to choose from. Because we aren’t going to be using the hash value for any type of cryptographic functionality (i.e., we don’t care about how difficult it will be to “unhash” the values), it’s largely irrelevant which algorithm we choose, except to pick one that produces a hash value of an acceptable size. A larger hash value size will reduce the possibility of collisions as there are more possible combinations, but as I already mentioned, we have to handle for collisions anyway because no hash function is perfect; therefore, it’s okay to use a smaller hash value size to save storage space, and pick a hash algorithm that is least expensive to compute initially.1

Second, we need to make sure that the hash values represent a consistent portion of the string input values. HASHBYTES is great because it’s built-in, but its input parameter is limited to 8000 bytes. This means if the string values needed to be hashed are longer than 8000 char/4000 nchar, a slightly different approach should be taken. I would recommend creating a CLR scalar function, as there are no limitations of input size in the .NET hashing classes.

The Test

As I mentioned previously, there are two ways to store the hash values so they can be indexed:

Computed column. This method has two advantages: first, the SQL Server engine automatically keeps the hash value up-to-date if the source column changes. Second, if we use a non-PERSISTED column, we save the storage space for the hash values in the base table. The main disadvantage is that we may have to poke and prod our SQL statements to get the plan shape we want; this is a limitation of computed columns in general, not of what we’re doing here specifically. Note: if your goal is to create hash values in two tables and JOIN them together on that column, it may not be possible to do it efficiently using computed columns because of bizarre optimizer behaviour. Any time computed columns are involved, always inspect the execution plan to make sure SQL Server is doing what’s expected — sometimes common sense doesn’t make it into the execution plan that gets run.

Base table column, updated by triggers (if required). This method works best to get execution plan shapes that we want, at the expense of a bit of complexity to keep the values updated, and the additional storage space required in the base table.

In the example below, we’ll use the base table column method, as it’s the most flexible, and I won’t have to go into detail about hinting in the SQL statements.

So we’ve got about 130,000 random-type strings, and then 3 rows which we can predict. Our goal is to efficiently find those 3 rows in the test queries.

Since the string field was defined as varchar(MAX), we can’t create an index on that column at all, and the only index we have in the base table is the clustered index on the Id column. If we try a naive search at this point, we get a table scan, which is the worst case scenario:

We get our 3 rows back, but this is far from ideal. I’m not sure why the optimizer chooses a scan and filter instead of pushing down the predicate into the scan operator itself, but that’s beside the point because this is an awful plan, with an estimated cost of over 3.5.

Let’s move on and create our hash value column and populate it. Note that if you want to update the hash values using triggers and keep the column NOT NULL, you’ll need to add a default constraint so INSERTs don’t fail outright before the trigger code gets a chance to run.

While this looks unremarkable, the index has been created non-unique on purpose — first and foremost to accommodate duplicate source (and hence hash) values, but also to allow for the possibility of hash collisions. Even if your source values are guaranteed unique, this index should be non-unique. Other than that, this should be enough to let us construct a query to make the search more efficient:

You’ll notice I’ve still included the StringCol = @search predicate in the WHERE clause — this is to ensure correct results of the query due to hash collisions. If all we did was compare the hash values, we could end up with extra rows in the results. Here’s the execution plan of the query above:

We got an index seek, which was the main thing we were looking for. The Key Lookup is expected here, because we have to compare the original values as well, and those can only come from the base table. You can now see why I said this method only works for moderately-to-highly selective queries, because the key lookup is required, and if too many rows are being selected, these random operations can kill performance (or the optimizer may revert to a table scan). In any event, we now have an optimal query, and even with the key lookups happening, the estimated cost was 0.0066, an improvement by over 500x on this smallish table.

Takeaways

I’m sure at this point you can think of several other uses for this methodology; it can be applied to large binary fields, too. You could even implement it for cases where the column could be indexed directly (say, varchar(500), with most of the values taking up nearly all the allowed space), but maybe you don’t want to — hashing the data might save a significant amount of storage space, of course at the expense of code complexity and a bit of query execution time.

While hashing data is nothing new, this technique exploits hashing to greatly improve the efficiency of data comparisons. I’ll leave it as an exercise for the reader to try the JOINing scenario I mentioned above — if you do, try it using computed columns first for maximum learning — you’ll find that this technique also improves the queries, as long as the join cardinality remains low (like in the 1-table scenario, to control the number of key lookups).

1 While I haven’t tested this in SQL Server either using HASHBYTES or a CLR function, there are implementations of certain hash algorithms in processor hardware itself. This may be of benefit to reduce CPU usage, possibly at the expense of storage depending on the algorithms in question.

It’s a common practice to use scalar variables, table variables, and temporary tables when developing a database application. If these transient objects contain (or are) character fields, which collation do they have if a collation isn’t explicitly specified?

Test Setup

The idea of the test is to demonstrate how collation conflicts are possible when using temporary objects.

First, we’ll create a new database with a collation that differs from the instance collation. I’ve chosen one randomly to make it easy to pick out for the purposes of demonstration. The collation of the instance I’m testing on is SQL_Latin1_General_CP1_CI_AS, which is probably the most common.

The reason why is because of collation precedence: scalar variables (and string literals, by the way) are always considered to be coercible-default. In other words, the comparison is done using the collation of the field on the other side of the expression:

Now we get the error, “Cannot resolve the collation conflict between “SQL_Latin1_General_CP1_CI_AS” and “Thai_CS_AI” in the equal to operation.” In this case, the two collations are different and at the same precedence level. SQL Server can’t simply give one collation precedence over the other, and it throws up its hands in disgust.

Temporary Tables

As mentioned in the link in the previous section, there are many similarities and differences between table variables and temporary tables. Let’s find out where character field collation fits in. We’ll run the exact same tests as in the table variable section, using a temporary table instead.

Now we get a collation conflict error. What? It turns out that temporary tables are different than scalar variables in this respect — the field collations now inherit from tempdb instead of from the current database. (Edit 2013-02-10: Thanks to Aaron Bertrand for pointing out that temporary tables created in the context of a contained database inherit the collation from the current database, not from tempdb.)

Running the second test now succeeds because both fields have an implicit collation of SQL_Latin1_General_CP1_CI_AS (since that is my instance, and hence tempdb, collation):

DATABASE_DEFAULT

Because of this crazy behaviour with temporary tables, a virtual collation DATABASE_DEFAULT was created to say that the field should use the collation of the current database, instead of inheriting from the database where the field was created.

Here’s the first query from the previous section again, with the special collation added:

Conclusion

When working with table variables and temporary tables, it’s really important to know the similarities and differences between the two. In the case of character field collations, simply switching between the two types of tables (which is often done when doing performance tuning) can potentially have bigger side-effects than were intended!

Even though character fields in table variables always inherit the collation of the current database, I would strongly consider explicitly specifying the DATABASE_DEFAULT collation on those fields anyway (assuming this is appropriate), so that the data/business logic will still be the same if someone decides to go and switch things up when performance tuning.

When designing new database structures, I feel it’s really important to make sure that the data goes into the new tables as cleanly as possible. Doing this early on helps prevent bugs from creeping into the system as it’s developed. And preventing bugs as early as possible in the development cycle is almost certainly the most cost-efficient strategy — cleaning up bad data later on is extremely expensive, time consuming, and frustrating.

One of the tools that can be used to reject bad data values going into tables are CHECK constraints. This type of constraint is meant to be used to validate a single row based on a predicate that’s a function of any combination of the table columns in that row. (CHECK constraints can be used for things beyond this definition… but doesn’t mean they should be.)

In this post, I’m not going to address any more of the advantages of constraining data values. What I do want to investigate is performance.

The extra logic to do this validation necessarily has additional processing cost associated with it. The question is: how much extra? Does it really make a difference? What kind of general guidelines can be established for performance purposes?

The Setup

I decided to run two series of tests: one with a very narrow table (3 columns), and one with a wide table (21 columns). We’ll INSERT a bunch of test data into each of the tables. A baseline would be established for each test, where the tables have no constraints. Then, different constraints would be added, and hopefully we’ll be able to see a difference in the load times.

Here is an abridged version of the table creation script (note: all scripts are available to download in full at the bottom of this post):

So that’s pretty standard. I ran the narrow table test with @i = 1250, and the wide table test with @i = 750, which resulted in 2,560,000 and 1,536,000 rows inserted, respectively. For each table, I ran the test harness 11 times; the first result was discarded, and the rest got recorded and averaged together. Note that the table is TRUNCATEd and the identity seed reset before each run. Also, the database was in SIMPLE recovery.

The Results

My hypothesis was that the constraints (at least the simple ones) would add a little bit of overhead, but I wasn’t sure if it would be measurable, and it was unclear how much (if anything) SQL Server would do to simplify the expressions I used (turns out it didn’t simplify them at all).

Average (ms)

ms/1000 Rows

% Increase

Narrow (no constraints)

12,401

0.484

–

Narrow (simple constraints)

13,034

0.509

5.1%

Narrow (complex constraints)

14,505

0.567

17.0%

Narrow (many complex constraints)

25,101

0.981

102.4%

There are a couple of very interesting things here. First of all, the difference between no constraints and a small number of simple constraints is measurable. Second — this is what really surprised me — a complex constraint is significantly more expensive than the simple constraint. This makes sense because functions are always going to be more expensive than something like a basic comparison. It seems that perhaps this difference is magnified in this case because the amount of work to do the actual INSERT is small relative to the total amount of work needed to be done. That said, it appears that many complex constraints could at the very least hobble an INSERT-only workload.

Average (ms)

ms/1000 Rows

% Increase

Wide (no constraints)

12,477

0.812

–

Wide (simple constraints)

14,647

0.954

17.4%

Wide (complex constraints)

20,238

1.318

62.2%

In this test, we would expect that the differences are larger because of the number of constraints used. As a sanity check, we can see that the ratio of the increases (62.2% / 17.4% = 3.6) vs (17.0% / 5.1% = 3.3) is approximately the same.

If we compare the narrow table with many complex constraints with the wide table with complex constraints, they both have the same number and type of constraints, and the ms/1000 rows numbers have the same sort of increases (0.981 – 0.484 = 0.497) vs (1.318 – 0.812 = 0.506), respectively. This says that evaluating constraints has a fixed cost, independent of the table schema.

Conclusion

I certainly learned a lot from doing this experiment. There are a few takeaways from the results:

Disabling CHECK constraints will reduce the load time, perhaps significantly.

A CHECK constraint’s overhead is determined by it’s complexity of evaluation (comparisons good; function calls not so much).

Evaluating CHECK constraints has a fixed cost, and thus a smaller relative cost when the other parts of the process are more expensive.

While these aren’t Earth-shattering conclusions, we did get to validate the best-practice of disabling constraints for loading, and see just how much overhead CHECK constraints can add.

What I want to stress here is that this test was for a relatively narrow situation involving only INSERTs. If your workload is 100% INSERTs, then you definitely need to pay attention to these kinds of things, but if your system isn’t INSERTing 200k rows/sec — or even anywhere near that — it’s likely that the benefits of the constraints will outweigh the performance penalty, as the overhead will be a small portion of the overall workload.

While some of us may disagree, I think XML is a great method of moving small-to-medium sized data sets into SQL Server.

Because of XML’s generality, it can be used by pretty much any client application and technology you can think of, and this can be a big advantage to getting the job done quickly and flexibly. I personally like it most for passing in sets of parameters (i.e., multiple ids) into stored procedures, and sending in full sets of data values where it’s much easier to consolidate them on the SQL side rather than using complicated application logic.

The downside is that the XML has to be shredded (i.e. parsed) on the SQL Server side, which uses CPU. Out of all the tiers in our application stack, SQL Server is the most expensive to scale up, so it’s important to be mindful when putting operations with large processing overhead into production. Ideally, we’d like to use XML with as little overhead as possible.

When we tell SQL Server to shred an XML document into a rowset, it can’t make assumptions about what the incoming XML is going to look like. In fact, by default, SQL Server assumes there could be multiple root elements in the document (i.e., it’s an XML fragment)!

By using XML Schema Collections, we tell SQL Server what to expect, and the optimizer is able to tailor the query plan to only include the necessary physical operations. A great side-effect of this is that the incoming XML is automatically validated against the schema, essentially making the input “strongly typed” like we would with regular variables.

Let’s walk through an example to demonstrate just how much of a difference this can make. (Note: XML Schema Collections are available in 2005+, but this example code is written for 2008+.)

Next, I’ll create the XML Schema Collection for comparison testing. You can see all the XML Schema Collections in a database through Management Studio in the Programmability | Types | XML Schema Collections folder.

Finally, here is the meat of the test script, which builds up an XML document, and then shred/INSERTs it into the table. Comparison test by uncommenting the (Table1Rows) part of the XML variable declaration on line 15.

While the results on your machine will no doubt vary versus mine, my test on a 2008 R2 instance took about 3,100 milliseconds without using the XML Schema Collection, and about 280 ms with. This is a huge improvement! And we can see the difference in the query plans.

Before:

After:

When I started using XML Schema Collections, I was only using them to strongly-type my XML, but it turns out that shredding XML can see a significant performance improvement as a side-effect. I read that strongly-typing XML slows down writes to XML variables, but I have yet to encounter a situation where that was necessary, or indeed, a good idea.

If you do start using XML Schema Collections in your own applications, I have a slight caution: you may want to avoid strongly typing parameters that are exposed in the database API (i.e., stored procedure parameters), even though it’s technically correct to construct the parameters that way. The reason why is that XML Schema Collections, once created, cannot be altered in-place (there is an ALTER XML SCHEMA COLLECTION, but it doesn’t do what you might expect). To change the definition, a collection must be dropped and recreated, which means that all objects with strong dependencies must be dropped and recreated as well. I think this is a royal pain — feel free to vote up the Connect feature request to add the expected functionality.

A workaround is to weakly type the API parameters, but immediately cast the parameter to a strongly-typed local variable in the procedure body, like so:

From an API point of view, that’s not the best solution as the strong type isn’t exposed to the outside, but IMO, it’s a good enough tradeoff from a maintenance point of view, particularly if the Schema Collection is reused in many places.

This is more of a reminder/warning/spread-the-word post to be really, really careful when dealing with multi-column foreign key relationships when one or more of the foreign columns allows NULL values.

Statement 1 is standard, and succeeds as we’d expect. Similarly as expected, Statement 2 fails with a constraint violation as there’s no row in PrimaryTable that has that combination of values.

What about Statement 3? If there was no foreign key, (0, NULL) would be an allowed combination of values, but what happens when there is a foreign key?

The answer is that the insert still succeeds.

I had expected it would fail just like Statement 2, because there isn’t a (0, NULL) combination in PrimaryTable. In fact, it’s not even an allowed combination to put in that table at all!

What happens is that NULL values are completely ignored from checking, and the constraint is satisfied as long as the non-NULL values match up to at least 1 row in the primary table. Edit: this is incorrect. If there are any NULLs in the foreign table columns, the constraint is not checked at all, which is even more perilous.

Multi-column foreign keys aren’t exactly the best practice in the world as far as schema design goes, but you do encounter them in the wild occasionally. If a database is suspect from an integrity point-of-view (which usually means it has many NULLable columns), be aware of this behaviour when doing data cleanup or querying. The corollary is that if this kind of schema is found in a database, poor integrity may not be very far away either.