Every time I see behavior from SQL Server that I don’t understand, it’s very disconcerting. But it often turns out to be a learning experience. And if I’m lucky to get to the bottom of it, the knowledge becomes one more tool in my mental utility belt. I had that experience a couple weeks ago and I want to write about it.

The thing I learned most recently is that searching inside Unicode strings (NVARCHAR strings) is a lot more cpu-intensive than searching inside single-byte strings (VARCHAR strings). By searching I mean looking for substrings. This has nothing to do with index lookups and full-text indexing is a different topic.

So here’s how I learned it: In a database I work on, we recently changed many of our database strings into unicode strings (VARCHAR to NVARCHAR) in order to support multiple languages. And we discovered that the CPU time taken by of a couple procedures shot through the roof! It was an 800% increase in CPU and this was without any significant I/O increase.

But Why?

The extra CPU cannot be explained away by the wider characters. My gut feeling says that strings twice as long should not take eight times the CPU to search. My first thought was that there was an implicit conversion somewhere but that wasn’t the case.

After some stackoverflow.com help it turns out that this has something to do with the different collations. Many different strings can be compared as equal even though they have different binary representations. VARCHAR strings with different binary representations can compare as equal (e.g. ‘MICHAEL’ is equal to ‘michael’). And Unicode string comparisons have even more complicated rules than these.

So if SQL Server collations have something to do with it, then we can hope that by using a binary collation, we’ll save the extra CPU. And in fact, we see something like that:

Binary collations are the exception. Searching inside strings using binary collations are much faster

Not Just SQL Server

It turns out that this is most likely not SQL Server’s fault. SQL Server relies on the operating system for its string methods. In particular, it probably relies on any one of these methods found in Kernel32.dll:

lstrcmpi or lstrcmp

FindNLSStringEx, FindNLSString and FindStringOrdinal

CompareStringEx, CompareString and CompareStringOrdinal

The docs for the ~Ordinal functions indicate that these functions are meant for binary (non-linguistic) string comparisons. I’d bet a lot of money that this explains the behavior we see in SQL Server. It accounts for why comparisons using binary collations are faster while comparisons using other collations are not.

What’s the difference between the CHAR datatype and the VARCHAR datatype? At the time of this writing Google gives thousands of web pages that are willing to explain the difference. Or if not explaining the difference at least explaining which is better and when. Here are two of the more interesting or important links:

What this Blogger Thinks
I’m going to put my vote with VARCHAR over CHAR in almost all circumstances. I’m going to list what I think are the stated benefits of CHAR and then hopefully show that the benefits are outweighed by consistency you get from applying VARCHAR as a rule of thumb always. (n.b. Throughout this post, the same arguments apply to NCHAR vs. NVARCHAR.)

Data Integrity:
Does SQL Server Complain if you give CHAR (10) less than ten characters?
At first glance, you might assume (the way I did) that the following benefit: That

CREATETABLE example1 ( col CHAR(10))

is somehow equivalent to

CREATETABLE example2
(
col VARCHAR(10),
CHECK(LEN(col)=10))

But it turns out not to be the case. Inserting strings shorter than ten characters into the column in example 1 does not produce an error. Instead, SQL Server pads the value with enough spaces to fill out the value of the column.

Size
And so we get to it. The main consideration when deciding between these types always seems to be size. VARCHAR as you know takes only the characters it needs, but it uses an extra two bytes to store the length of the string. That’s why Microsoft recommends using CHAR for strings that are going to be a consistent length.

So, I’ve decided to find out how much exactly. Using the sample database Adventureworks, I changed the tables that used NCHAR columns to identical tables that use NVARCHAR columns. And I found that the number of pages consumed was exactly the same (even after rebuilding indexes). I expected that though. The small savings you get in that case was not enough to be able to fit extra rows into a database page.

But here’s something that surprised me regarding null-able columns. If you are counting every byte, you’ll see that VARCHAR is the clear space saver when it comes to NULLS. You’ll find that CHAR (10) columns that are NULL still take ten bytes while VARCHAR (10) columns take zero bytes. This is another reason to pick VARCHAR over CHAR.

Semantics
What about semantically? When application developers sees a CHAR (10) column, it’s clear to them that data in the column is expected to be exactly 10 characters. This is a benefit to the CHAR data type.
It is not something that the VARCHAR data type conveys nicely. Even with a CHECK constraint or with documentation.

So I’ll concede that. Even though it’s possible to create user defined types that do something similar:

So are CHARS Completely Useless?
No, but I don’t see any real significant advantage either. And so I think it’s easier to apply VARCHAR as a rule of thumb and as a best practice without worrying about whether it’s better than CHAR. So pick VARCHAR and be done with it.