September 18, 2014

Takeaway: According to SQL Server, an identifier with trailing spaces is considered equivalent to the same identifier with those spaces removed. That was unexpected to me because that’s not how other programming languages work. My investigation was interesting and I describe that here.

The First Symptom

Here’s the setting, I work with a tool developed internally that reads metadata from a database (table names, column names, column types and that sort of thing). Recently the tool told me that a table had an unexpected definition. In this case, a column name had an extra trailing space. I expected the column name "Id" (2 characters), but my tool was reporting an actual value of "Id " (notice the blank at the end, 3 characters). That’s what started my investigation.

But that’s really weird. What would lead to a space accidentally getting tacked on to a column name? I couldn’t think of any reason. I also noticed a couple other things. Redgate SQL Compare was reporting no discrepancies and the database users weren’t complaining at all, they seemed just fine. A bug in the in-house tool seemed most likely. My hunch was that there was a problem with the way we collecting or storing these column names (how did a space sneak in there?).

Where Are Column Names Stored?

I wanted to look at the real name of the column – straight from the source – so I ran:

SELECT*FROM INFORMATION_SCHEMA.COLUMNSWHERE COLUMN_NAME LIKE'Id%'

SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME LIKE 'Id%'

It told me that my tool wasn’t wrong. That the column was actually named "Id " with the space. So maybe Red Gate is getting its metadata from somewhere else? I know of a few places to get column information. Maybe Red Gate is getting it from one of those? Specifically I wanted to look closer at these views:

sys.columns

sys.syscolumns

INFORMATION_SCHEMA.COLUMNS

Because these objects are views, I used sp_helptext to learn that all the column names ultimately come from a system table called sys.syscolpars. But sys.syscolpars is a system table and you can’t look at its contents unless you connect to the database using the dedicated administrator connection. And that’s exactly what I did.

I learned that there is only one version of column names, only one place on disk that sql server persists the name of the column. It’s interesting because this implies that Red Gate’s SQL Compare trims trailing spaces from identifier names.

But Doesn’t SQL Server Care?

Well, there’s one way to check:

CREATETABLE[MyTest]([id ]INT);
INSERTINTO[MyTest]VALUES(1);
SELECT[id ], [id]-- one column name with a space, one column name withoutFROM[MyTest];
-- returns a dataset with column names as specified in the query.
go

CREATE TABLE [MyTest] ( [id ] INT );
INSERT INTO [MyTest] VALUES (1);
SELECT [id ], [id] -- one column name with a space, one column name without
FROM [MyTest];
-- returns a dataset with column names as specified in the query.
go

Just like Red Gate’s SQL Compare, it seems like SQL Server doesn’t care about trailing spaces in identifiers either.

Google? Stackoverflow? Want to Weigh In?

And that question has answers which link to the Books Online page documenting Delimited Identifiers. That page claims that “SQL Server stores the name without the trailing spaces.” Hmmm, they either mean in memory, or the page is inaccurate. I just looked at the system tables a moment ago and the trailing spaces are definitely retained.

The stackoverflow question also led me to a reported defect, the Connect item Trailing space in column names. This item was closed as “by design”. So this behavior is deliberate.

What do other SQL Vendors do?

I want to do experiments on SQL databases from other vendors but my computer doesn’t have a large number of virtual machines or playground environments. But do you know who does? SQL Fiddle
It’s very easy to use this site to see what different database vendors do. I just pick a vendor and I can try out any SQL I want. It took very little effort to be able to compile this table:

RDBMS

CREATE TABLE...

SELECT "id ", "id"...

MySQL

Incorrect column name 'id '

Oracle

Success

"id": invalid identifier

PostgreSQL

Success

Column "id" does not exist

SQLite

Success

could not prepare statement (1 no such column: id)

SQL Server

Success

ID

ID

1

1

And What Does the ANSI standard say?

Look at the variety of behaviors from each vendor. I wonder what the “standard” implementation should be.

Microsoft pointed to the ANSI standard again. I mean they explained exactly where to look, they pointed straight to (Section 8.2, , General rules #3) which is the section where ANSI explains how the comparison of two character strings is determined. The ANSI standard says that for string comparisons, the shorter string is effectively padded with trailing spaces so that comparisons can always performed on strings with an equal number of characters. Why? I don’t know.

And that’s where identifier comparisons come in. I found another part of the standard (Syntax rule #11) which tells me that Identifiers are equivalent if they compare as equivalent according to regular string comparison rules. So that’s the link between string comparisons and identifier comparisons.

Summary

There’s a number of things I learned about string comparisons. But does any of this matter? Hardly. No one deliberately chooses to name identifiers using trailing spaces. And I could have decided to sum this whole article up in a single tweet (see the title).

But did you figure out the head fake? This blog post is actually about investigation. The investigation is the interesting thing. This post describes the tools I like to use and how I use them to find things out for myself including:

Made use of the Dedicated Adminstrator Connection to look at system tables

Microsoft’s Books Online (used this twice!)

Microsoft Connect

Google

Stackoverflow

SQLFiddle

Wikipedia

the ANSI Standard

Maybe none of these resources are new or exciting. You’ve likely used many of these in the past. But that’s the point, you can find out about any topic in-depth by being a little curious and a little resourceful. I love to hear about investigation stories. Often how people find things can be at least as interesting as the actual lesson.

Takeaway: When performing long-running modifications, I’m sure many of you enjoy using batches to increase concurrency. But I want to talk about a pitfall to be aware of. If you’re not careful, the method you use to implement batching can actually worsen concurrency.

Why Use Batches?

Even without an explicit transaction, all SQL statements are atomic – changes are all or nothing. So when you have long-running modifications to make, locks on data can be held for the duration of your query and that can be too long. Especially if your changes are intended for live databases.

But you can make your modifications in several smaller chunks or batches. The hope is that each individual batch executes quickly and holds locks on resources for a short period of time.

But care is needed. I’m going to give an example to show what I mean. The example uses the FactOnlineSales table in the ContosoRetailDW database (available as a download here). The FactOnlineSales table has

one clustered index on OnlineSalesKey and no other indexes,

12 million rows,

and 46 thousand database pages

Metrics to Use
In this example, I want to know how long each query takes because this should let me know roughly how long locks are held.
But instead of duration, I’m going to measure logical reads. It’s a little more consistent and in the examples below it’s nicely correlated with duration.

Unfortunately, this method does poorly. It scans the clustered index in order to find 1,000 rows to delete. The first few batches complete quickly, but later batches gradually get slower as it takes longer and longer to scan the index to find rows to delete. By the time the script gets to the last batch, SQL Server has to delete rows near the very end of the clustered index and to find them, SQL Server has to scan the entire table.

In fact, this last batch performs 46,521 logical reads (just 100 fewer reads than the straight delete). And the entire script performed 1,486,285 logical reads in total. If concurrency is what I’m after, this script is actually worse than the simple DELETE statement.

Careful Batching

But I know something about the indexes on this table. I can make use of this knowledge by keeping track of my progress through the clustered index so that I can continue where I left off:

The delete statements in this script performed 46,796 logical reads in total but no individual delete statement performed more than 6,363.

Graphically that looks like:

Logical Reads Per Delete

The careful batching method runs in roughly the same time as the straight delete statement but ensures that locks are not held for long.
The naive batching method runs with an order of n² complexity (compared to the expected complexity of n) and can hold locks just as long as the straight delete statement.
This underlines the importance of testing for performance.