Every time I see behavior from SQL Server that I don’t understand, it’s very disconcerting. But it often turns out to be a learning experience. And if I’m lucky to get to the bottom of it, the knowledge becomes one more tool in my mental utility belt. I had that experience a couple weeks ago and I want to write about it.

The thing I learned most recently is that searching inside Unicode strings (NVARCHAR strings) is a lot more cpu-intensive than searching inside single-byte strings (VARCHAR strings). By searching I mean looking for substrings. This has nothing to do with index lookups and full-text indexing is a different topic.

So here’s how I learned it: In a database I work on, we recently changed many of our database strings into unicode strings (VARCHAR to NVARCHAR) in order to support multiple languages. And we discovered that the CPU time taken by of a couple procedures shot through the roof! It was an 800% increase in CPU and this was without any significant I/O increase.

But Why?

The extra CPU cannot be explained away by the wider characters. My gut feeling says that strings twice as long should not take eight times the CPU to search. My first thought was that there was an implicit conversion somewhere but that wasn’t the case.

After some stackoverflow.com help it turns out that this has something to do with the different collations. Many different strings can be compared as equal even though they have different binary representations. VARCHAR strings with different binary representations can compare as equal (e.g. ‘MICHAEL’ is equal to ‘michael’). And Unicode string comparisons have even more complicated rules than these.

So if SQL Server collations have something to do with it, then we can hope that by using a binary collation, we’ll save the extra CPU. And in fact, we see something like that:

Binary collations are the exception. Searching inside strings using binary collations are much faster

Not Just SQL Server

It turns out that this is most likely not SQL Server’s fault. SQL Server relies on the operating system for its string methods. In particular, it probably relies on any one of these methods found in Kernel32.dll:

lstrcmpi or lstrcmp

FindNLSStringEx, FindNLSString and FindStringOrdinal

CompareStringEx, CompareString and CompareStringOrdinal

The docs for the ~Ordinal functions indicate that these functions are meant for binary (non-linguistic) string comparisons. I’d bet a lot of money that this explains the behavior we see in SQL Server. It accounts for why comparisons using binary collations are faster while comparisons using other collations are not.

Whew, turns out that R Meyyappan covers this quite well in a SQL Bits session he gave almost a year ago: http://bit.ly/hsjfJs

I wish I had seen it then, it could have saved me some time this year. It was also very encouraging because he describes a solution we ended up using. It was also encouraging that his demo performance numbers are in line with mine (8x worse with unicode).

Michael this is an excellent post, thanks very much for raising it and it certainly gives me a few things to think about. I have to confess I’d missed the SQL Bits session too so I’ll definitely be taking a look at that.

Then I just had to tweak your test code:
select COUNT(*) from test where charindex(‘ABCD’,nv COLLATE Latin1_General_Bin)>0 option (maxdop 1)

No surprise that lower case had no matches. This did have results which surprised me because I wasn’t comparing a Unicode string to the binary unicode. (I threw in charindex to see if that had any performance benefit. No more than random execution stats could improve it, but this binary comparison was nearly twice as fast as the byte comparison)

Nice article. (In the first tests using your original test script, the first query was 0.2 seconds slower on my machine and the second was nearly a second faster.)

Unicode string searches are definitely more expensive — the comparison rules are much more complex than the relatively simple ones for non-Unicode SQL sort orders. But it’s worth pointing out that this overhead is generally only noticeable if you are doing full table or index scans of a fairly large table. In a queries with search predicates that can be satisfied via a selective index seek, this will generally be a non-issue.

In other words, do not assume that moving to Unicode will make an arbitrary application 8X slower. That’s close to the worst case scenario… that kind of additional cost will only be seen by a subset of applications (and even with the affected apps, only in a small subset of queries).

Of course it’s worth noting that “Your mileage may vary”. But when coming across a huge jump in CPU as I and others have done, it’s good to have an explanation and that’s what I’m doing here.

I did point out that this has “This has nothing to do with index lookups”. SARG-able queries are in fact best of course. I never meant to imply that Unicode = 8x worse always but only when searching for substrings (hence the title).

Sorry, I didn’t mean to imply that you had stated anything that contradicted the point I was making.

I’ve run into this problem myself. But I’ve also worked with many, many apps that used Unicode types without issue. I just want to make sure that people don’t walk away with an incorrect conclusion that they should avoid using Unicode types. If you want your app to be multilingual, you want Unicode; I would caution people not to try storing multilingual data in varchar fields. With most apps the Unicode overhead will be so small that it would be difficult to measure.

That said, this potential perf issue is important to be aware of. And in cases where you’re storing strings that are simple programmatic identifiers that don’t need to store multilingual data (in other words, the field’s content is interpreted by an app but is not intended for direct consumption by end users), the additional cost of Unicode string comparisons is a good reason to use varchar instead of nvarchar.

Ahh… Sorry for misunderstanding your original post. And I understand your point now.

It’s interesting how I’m a slave to my own point of view sometimes. I looked at this Unicode vs. cpu issue from an “explaining weird behavior” point of view and that’s how I wrote the post.

It never occurred to me that someone might mistake this article to mean: “When designing databases columns, avoid Unicode or face the dire CPU consequences”. Which couldn’t be farther from what I meant.

Interesting observation Michael, I read it with pleasure! Collations, once in a while we all have some fun with them

What I’d like to point out though: the speed improvement of casting the value to binary collation is good, but the results may end up to be different. Using binary collation, every character is different. The biggest impact of that is case-related: if your regular collation is CI, you won’t be able to ignore case using the binary collation (well, unless you use the LOWER() or UPPER() function on your column and search term). I’m sure you’re aware of this but I didn’t see this explicitly mentioned above, which means that some readers may be getting unexpected results.

Having said that, this method is definitely interesting to keep in mind for certain circumstances such as filtering on GUIDs (erm, yeah right :-))

In troubleshooting mode, we all look for the solution to a problem. In this article it seems like I’m describing a problem (slow string searching). But it would be a mistake to think that switching collations is a solution I’m proposing. That certainly wasn’t my intent. String comparison behaviour is greatly impacted by collations (which is the point )

But it’s interesting that in the particular circumstances that prompted me to write this article, I temporarily implemented a comparison that looked something like:
WHERE UPPER(NvarcharColumnToSearch) COLLATE Latin1_General_Bin like UPPER(N’%SearchString%’)
until I could replace the whole thing with something “SARG”-able.

[…] This is the fun part. Investigation time. I use google, #sqlhelp, and everything I can to help me get to the bottom of the mystery. In the case of the Unicode string mystery, I opened a StackOverflow item and got some quick quality answers. Also after the mystery got solved, it made a great topic for a blog post. […]