Forcing Collation in the WHERE Clause - 22 Jun 2011

For example, you’re trying to pull back a list of all customers who marked California ‘CA’ as their home state, but SQL Server is only returning exact matches (‘CA’ and not ‘ca’ or ‘Ca’). Or, you’re trying to pull back an email address for ‘someuser@domain.com’ and SQL Server won’t match ‘someuser@Domain.com’.

Solution: SQL Server is obeying rules that define how it should compare text. To get it to ‘behave’ properly, you just need to correctly match collations – which you can do in a number of ways.

Collations Explained

Collations in SQL Server provide sorting rules, case, and accent sensitivity properties for your data. Collations that are used with character data types such as char and varchar dictate the code page and corresponding characters that can be represented for that data type. Whether you are installing a new instance of SQL Server, restoring a database backup, or connecting server to client databases, it is important that you understand the locale requirements, sorting order, and case and accent sensitivity of the data you will be working with.

When you select a collation for your server, database, column, or expression, you are assigning certain characteristics to your data that will affect the results of many operations in your database. For example, when you construct a query by using ORDER BY, the sort order of your result set might be dependent on the collation that is applied to the database or dictated in a COLLATE clause at the expression level of the query.

Using Collations Granularly

Note that the definition of collations from Books Online points out that you can set collations at the server, database, column, or even expression level.

The ability to granularly define collation settings can be a huge win when it comes to the need to localize data – because it allows you localize at the lowest level possible and avoid some of the ugliness that can come from working with more ‘advanced’ collations at a higher level.

Case in point: I’ve always maintained that setting up Case-Sensitive collation at the database level is a worst practice. And I’m not alone. In my mind, there’s nothing worse than trying to execute the following:

SELECT * FROM dbo.employees

Only to be told that dbo.employees doesn’t exist, because the true name of the table is actually Employees, and there’s a case-sensitive collation in play.

So, by taking the approach of trying to keep collations constrained to just columns, it’s possible to avoid ugly situations like this one, while still being able to properly localize data through the use of advanced collations.

Similarly, in cases where you may (or may not) care about case sensitivity, specifying that choice as a collation setting at the column level almost always provides a better benefit than specifying it at the database level. For example, with a ‘Users’ table where you’re doing lookups against an email address to pull back the user’s EmailAddress and hashed PassPhrase, some applications will want to force case-sensitivity in the email address (so that ‘SomeUser@blah.com’ can’t be matched by a user specifying ‘someuser@blah.com’ – for a tiny bit of extra precaution). Whereas, in other applications, the preference may be to match the email address no matter the case sensitivity – to avoid causing confusion with end-users.

Forcing Collation at the Expression Level (or in the WHERE clause)

And, of course, one of the great things about collations is that they can be set at the expression level – something that few developers or DBAs take advantage of. Which, in turn, means that you can force or even CHANGE collations within queries.

So, for example, if you wanted to force case-sensitive searches in a given query, you can do so by means of the COLLATE clause, as follows:

The only thing you’ll want to watch out for, though, when ‘forcing’ collation within the WHERE clause is that you end up coercing your data – meaning that while the second query shown above ‘works’ (meaning that it will force SQL Server to address case-sensitivity), it ends up degrading performance by a factor of at least 20x – because SQL Server has to drop to using an Index Scan instead of an Index Seek. And that’s on a table with only 20,000 rows – so you’ll likely see a much larger performance decrease on larger tables.

Changing Column-Level Collation

Consequently, I’d only recommend this technique or approach for ad-hoc queries – not for something you’re going to do on a regular basis. Otherwise, just make sure to specify your desired collation as part of the table (or column) definition as needed.

As such, if you find that you need to change the collation for a given column (and this is something that I recommend attempting only after careful consideration – because it can have negative consequences, and because it’s a ‘size of data’ operation (meaning that the amount of time it takes is directly proportional to how much data you have in the table/column being modified)), then you’d just modify the column as follows:

And, of course, note that such an operation also incurs additional complexity in cases where there are dependencies upon the column being modified.

But, once you make the change, case-sensitivity is ‘natively’ enforced by SQL Server as expected:

Contributors

Michael K. Campbell is a contributing editor for SQL Server Pro and Dev Pro and is an ASPInsider. Michael is the president of OverAchiever Productions, a consultancy dedicated to technical evangelism...