SQL Server and UTF-8 Encoding (1) -True or False

Today, I will start my series of articles about SQL Server and Unicode UTF-8 Encoding. In many times, I found when people ask me about UTF-8, they actually don't understand UTF-8. So today's talk will be quite short. I just clarify some misunderstand.

1. SQL Server doesn't support Unicode, do you mean UTF-8?

Sometime, people just say "SQL Server doesn't support Unicode". Actually, it is wrong, SQL Server support Unicode since SQL Server 7.0 by providing nchar/nvarchar/ntext data type. But SQL Server doesn't support UTF-8 encoding for Unicode data, it supports the UTF-16 encoding. I copy several definitions from Internet for these concepts:

"Unicode is the universal character encoding, maintained by the Unicode Consortium. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The unique number, which is called the code point of a character, is in the range 0 (hex) to 10FFFF(hex). It is normal to reference a Unicode code point by writing "U+" followed by its hexadecimal number. For example, Character A is defined as "Latin Uppercase Alphabet", and assigned a code point U+0041.

In additional to assign each character to a unique code point, Unicode Consortium also defined several Unicode transformation formats (UTFs) which are the algorithmic mapping from every Unicode code point to a unique byte sequence. Note, the Unicode code point itself has nothing related to how to store/transform in a Computer, but a UTF does.

The commonly used UTFs are UTF-16 and UTF-8. UTF-8 is the byte-oriented encoding form of Unicode which commonly used on Unix Operating System, Web Html and Xml File. UTF-16 uses a single 16-bitcode unit to encode the most common 63K characters, and a pair of 16-bit code unites, called surrogates, to encode the 1M less commonly used characters in Unicode. UTF-16 is commonly used on Windows, .Net and Java. The transform between different UTFs are loseless and round-tripping. "

In summary, don't confuse with Unicode and UTF-8 Encoding. They are totally different concepts.

2. UTF-8 Encoding is much better than UTF-16 Encoding

There are tons of articles comparing with UTF-8 encoding with UTF-16 encoding. I will compare these two encoding side by side in my next article. I can have 100 reasons to say UTF-8 Encoding is better than UTF-16 Encoding, and verse vice. The correct answer is that no encoding is absolute better than the others. User should choose the suitable encoding according to your application software requirement. The operation system, programming language, database platform do matter when choosing the encoding. UTF-8 is most common on the web. UTF-16 is used by Java and Windows. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.

So please don't jeopardize SQL Server's Unicode support because of it only support one of the UTFs.

3. SQL Server cannot store all Unicode Characters

You may notice that I say SQL Server support UTF-16 Encoding in previous paragraph, but I also said SQL Server' nvarchar type encoding is UCS-2. I intend to do this in this article because I want to discuss the different between these two Encodings in here. Let us describe in details in what area SQL Server support UTF-16 encoding:

SQL Server can store any Unicode characters in UTF-16 encoding. The reason is that the storage format for UCS-2 and UTF-16 are the same.

SQL Server can display any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows functions to display characters, the Windows functions and fonts can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.

SQL Server can input any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows IMEs (Input Method Editors) to input, the Windows IMEs can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.

SQL Server can sort/compare any defined Unicode characters in UTF-16 encoding. Note, not all code points are map to valid Unicode character. For example, The Unicode Standard, Version 5.1 defines code points for around 10,000 characters. All these characters can be compared/sorted in SQL Server latest version: SQL Server 2008.

In contrast, I also list the UTF-16 thing SQL Server doesn't support:

SQL Server cannot detect Invalid UTF-16 sequence. Unpaired surrogate character is not valid in UTF-16 encoding, but SQL Server accept it as valid. Note, in reality, it is unlikely end-user will input invalid UTF-16 sequence since they are not support in any language or by any IMEs.

SQL Server treats a UTF-16 supplementary character as two characters. The Len function return 2 instead of 1 for such input.

SQL Server has potential risk of breaking a UTF-16 supplementary character into un-pair surrogate character, such as calling substring function. Note, in the real scenario, the chance of this can happen is much lower, because 1) supplementary character is rare 2) string function will only break this when it happens be the boundary. For example, when calling substring(s,5,1) will break if and only if the character at index 5 is a supplementary character.

In summary, SQL Server DOES support storing all Unicode characters; although it has its own limitation. Please refer to my previous blogs to details.

When you choose nvarchar as your data type, you automatically get UTF-16 Encoding. Please note, For nvarchar types, we always use UTF-16 encoding to store the data no matter which collation you use. The collation is only matter on what method it will use to compare two string. Since different language has different sorting rule, that is the reason we have many different collations.

The most common collation is latin1_general_ci_as. Although the name seems indicate it can only sort latin based character correctly, but it actually can sort many or majority of the language correctly.

Hi Qingsong, I’m dealing with chinese characters. How come when I imported mt utf-8 file with chinese chars in sql server I get shattered non chinese characters even though I use nvarchar? Can you kindly advise? Thank you

If you are using SQL Server 2005, you can use bcp.exe to bulk load UTF-8 data into your table, try

[-C code page specifier] option and use 65001 as code page specifier. Note, I never tried this option, but I know it works.

If you are using SQL Server 2008, the bcp.exe shipped with SQL Server 2008 don’t support import UTF-8 data in. As a workaround, you can install the SQLNCLI library shipped in SQL Server 2005 (I beleive it is a downable package) on another machine, and bulk load UTF-8 data into SQL Server 2008 table.

Another workaround is converting your data from UTF-8 to UTF-16 encoding, you can either write a simple C# program or use NotePad.

Note: the default chinese character’s encoding is GB2313, which is supported as code page 936. which is natively supported.

So, what you’re saying is that SQL Server doesn’t support UTF-16 in all cases, but does support UCS-2 ? (equivalent to UTF-16 BMP only). Unfortunately that will mean that not all Chinese, Japanese and Korean characters can be stored in SQL Server. A partial implementation is not much use if you are doing global internationalization.

The short answer is that a collation can support many languages. A collation has a major language, such as latin1_general_ci_as has major language us_English. It also support Korean language since Korean language is compatiable with us_English. So this collation is enough for your application.

However, the latin1_general_ci_as only support Unicode 2.0 characters, so if you want more advacned Korean Language support, and Korean is the majority language of your application, you should choose Korean_90_CI_As as your collation.

Here’s a case need your help. I created a linked server in SQL Server 2005 connecting to MYSQL. I need to query the UTF-8 data from MYSQL and then insert them into local table. After insert, the English chars are ok but Chinese chars are not displayed correctly. Is there a way to use linked server retrieving data from MYSQL correctly?

Are you using nvarchar type in SQL Server? What about you only select the data throught linked server, but not insert into the table.

The behavior really depends on how we map the character datatype defined in MySQL to OLE-DB type. If it maps to varchar type, you will have data loss, other it maps to nvarchar type, you should be fine

I have a SQL server 2005 database that is has the default Collation of SQL_Latin1_General_CP1_CI_AS. I am trying to store Turkish Data (8859/9 and/or codepage 1254). There are actually only six characters that don’t store correctly in varchar columns. For example, one of them is "Ş" (LATIN CAPITAL LETTER S WITH CEDILLA, U+015E, decimal 351). So, for a test, I created new NVarchar(10) column thinking that would be all that would be required. However when I run the following in SSMS T-SQL command in SSMS:

INSERT INTO mytable (MynVarchar) Values (‘Ş’)

the "Ş" gets changed to a Latin1 Captal "S" without the Cedilla–UNLESS I preface the unicode character with an upper-case ‘N’ as below

INSERT INTO mytable (MynVarchar) Values (N’Ş’)

I read in the article below that this is a requirement, but I do NOT understand WHY it is this way. Why can’t I just use NVARCHARS and have the data inserted correctly as you wrote in your article above? To me this means that it’s not a simple matter of changing my columns from Varchar to NVarchar, but in addition, I have to change *ALL* of my source code to include the prefaced ‘N’. From your article above, (and many others on the internet, excluding the one above from Microsoft), I keep reading that I just need to use NVARCHAR, but this is not correct? What am I missing here? Very confusing…

I should add that I also tried change the collation of the Database from the SQL_Latin1_General_CP1_CI_AS to Turkish_CI_AS. What I found was that the command below worked WITHOUT the prefaced ‘N’:

INSERT INTO mytable (MynVarchar) Values (‘Ş’)

So at that point I said to myself "GREAT!!! No problem!–I will simply change the collation to Turkish_CI_AS on all Turkish Servers, and all Varchars to NVarchars and then we are good to go–WITHOUT having to modify all of my source code to include the ‘N’…… But then I connected to server in Turkey to test the same thing, (created a DB with Turkish_CI_AS as the collation, created an nVarchar(50) column and then ran the same insert command, and BOOM, the Ş is screwed up again and is now an "S" I am extremely puzzled!!!

The reason is that ‘Ş’ is a varchar type which has a code page associate with it. N’Ş’ is a nvarch type and it is a Unicode character. the sql_latin1_general_ci_as code page don’t have the character ‘Ş’ defined in the code page, so we will use the best matched charcter for replacement.

For your secondary question, I don’t know the reason. It might be client side issue, can you try to use SQL Server Management Studio to run the query and see the result?

I actually remote into their server and have been doing the tests inside of SQL server management Studio. I just tried it again though, and this time, it worked! I must have done something wrong when I set the collation the first time (I have been switching back and forth between the default collation and the Turkish and I must have goofed in my testing.)

Anyway, is it appears that setting collation to Turkish_CI_AS, I can then issue the Insert command below without the N and the Ş character is preserved.

There are many articles that never mention collation settings though and seem to suggest that at least in terms of storing data "the collation doesn’t matter as long as you use NVarchar instead of Varchar as your data type" These articles seem to me to suggest that the collation only has value in regards to sorting and comparing data.

In fact I just re-read this page of yours and you write:

"When you choose nvarchar as your data type, you automatically get UTF-16 Encoding. Please note, For nvarchar types, we always use UTF-16 encoding to store the data no matter which collation you use. The collation is only matter on what method it will use to compare two string. Since different language has different sorting rule, that is the reason we have many different collations."

This again suggests that the collation doesn’t matter. Can you explain? Thank you.

I wanted to move the entire database to new server as it is , so i took backup and then moved it to new sql 2005 sp1 server and restored the backup , now when i query chinese characters are not displaying

First, let us confirm whether it is a display issue or data missing/lose issue. From my experience, it is unlikely the back/restore will cause data lose. So I guess it might related to data display issue. Can you confirm that you install the correct chinese fonts. It happens for Winodw XP/2003 which did not install Chinese Package by default. You can go to Control Panel -> Language, and install it

I am hoping you did not forget about me and my last question regarding whether the database collation is essential or not for "storing" unicode. From my tests, it appear that it is, but as I’ve read from your posts above, and from many others, usually people say that the collation only matters in relationship to sorting and comparing teh stored data.

actually a have a mysql database linked to SQLServer 2000 using oleDb for ODBC and MySQL ODBC 5.1. The MySql Database is in UTF-8 Charset. When I run a query, it’s returns a bad word, i mean instead to returns Bogotá, it’s returns Bogot&aacute;.

I can’t change the collation, directly on mySQL database because it’s not mine. How can I read those fields in the right way? I’ve been trying to do it using collations syntax but it doesn’t works.

I have worked with Unicode for years now and I do agree with everything you said. I am designing a database for a forum on my site, I would like to store each posted message body in UTF-8 format, so I believe I’ll need to resolve to VARBINARY instead of NVARCHAR. Imagine 700,000 user posts in database, what savings I can achieve with UTF-8 instead of UTF-16. 🙂

So all that is fine, my question really is, how do I then get the FTS (Full Text Search) capability enabled for each of these messages? Since it is a forum, it is natural that users would want to search for certain contents.

Another naive question: what does "N" in front of a string literal mean (I’m more of a C++ guy, SQL isn’t quite my cup of tea). Sorry I could have searched but "N" being a single character is not quite search friendly.

When you use for xml auto, the result is a XML data type, which is not nvarchar or nvarchar type and it has no encoding. If you want to convert it to ascii, you can use explict convert/cast from xml to varchar type.

I strong suggest you store your data in UTF-16 encoding inside of database instead of using varbinary data type. The main reason is that SQL Server don’t have any built-in support for UTF-8, so that you will unable to query/fulltext search UTF-8 data.

If you are using ASP or ASP.Net , the client will automaticlly handling coonverting frm UTF-16 data to UTF-8 for you. So you will not experience on any issue. You can refer to this article http://support.microsoft.com/kb/893663 for details.

It is true that you may waste some disk space, but you will have less trouble on handiling unicode data in your application. If you want save disk-space, the data compression feature in SQL Server 2008 can achieve this, and the feature was enhanced in SQL Server 2008 R2 to achieve better compression ratio than UTF-8 encoding.

The reasn of having N’ is because we have two different kinds of string const:

‘abc’ means a constant with varchar types, since varchar type already have a code page, so that it can only store characteres defined in the code page.

N’abc’ means a constant with nvarchar type which is UTF-16 encoding.

Follow statement:

create table t1(c1 nvarcahr(10)

insert into t1(‘abc’)

will insert a varchar ‘abc’ into a nvarchar column. So even the c1 is nvarchar column, but the constant

might be converted to ‘???’ before inserting into the table if a,b,c are not in the codepage.

I’m in headache to think about a best match of ‘Collation in MSSQL’ and ‘Encoding (charset) in XML’ for Korean Language. Initially I pick ‘Korean_90_CI_As’ for testing. What is your recommendation for me? Thanks in advance.

You need nvarchar type of SQL Server to insert Unicode strings. Also, please use nvarchar type as the parameter of your stored procedure. Please let me know if you still have trouble on this. You can e-mail me at qyao@microsoft.com for further question.

I have a hube problem with correct display of polish characters on PHP application.

The collation of table is set to default SQL_Latin1_General_CP1_CI_AS. Data type of column is nchar.

When I insert records manualy with polish chars and then display them using select statement then all it's ok, i'm talking about using select query directly in MSSQL Server 2005. Problem appears when I want to display rows in PHP app. Charset of html document is set to UTF-8. I was trying to display chars correctly using utf8_encode function but no effect, all polish chars like "ąęłźćś" are auto converted to "aelzcs". Any ideas?

In fact it was a great read. Thanks for the article. I have an issue with japanese characters. I am using SQL server to connect to a Progress database through ODBC. I have the fields in table set to nvarchar, and DB collation is French_CI_AS. The issue is that I am not able to correctly store japanese characters in the table. WHen I am inserting data manually using the N'' prefix, its working fine. How can I use N'' while selecting? For eg: if I have a query, insert into tab1 select field1 from tab2. How can I prefix N'' to field1?

> SQL Server treats a UTF-16 supplementary character as two characters.

> The Len function return 2 instead of 1 for such input.

I have regular need for characters above U+FFFF (the extended Han characters, specifically).

The fact that these characters are counted as two characters is quite problematic.

For example, in a user interface, a user is allowed to enter a field up to 25 characters.

This will be a mixture of characters below U+FFFF (1 "character") and a few above U+FFFF (2 "characters"). To the user, the total number of characters that one can actually enter is thus less than the defined 25.

There are no workarounds so we often need to make the width 15% larger than we actually say is possible.

In next version of SQL Server, we will have new UTF-16 collation. The length issue will be address. However, when you declare a field/column as nvarchar(10), it might still hold 5 characters due to the Extended Han characters. The reason is that the length semantic for data type is not changed.

Oidon,

Please keep in mind that the Extended Han characters is not frequently used in many apps, so you can include a small number to the total numbers, it should work in many cases.

I am trying to use CDOSYS for sending out mails in SQL server 2005. In a recent requirement, I need to send out mails in Korean language as well. When tried, the subject line and the body is not getting displayed properly (displaying "???" instead of Korean characters). I have used nvarchar data type wherever required. For string constants I have preceded N'.

I have a user who is pulling data from a SQL Server that we do not support anymore. The db is on SQL 2008 and the file that the user receives in is being set as UTF6. What the user needs is the data to come through as UTF8 (Unicode). What this tells me is that the SQL Server is set to UTF6 or the DB is set to UTF6. What they would like to do is to take the data in the UTF6 db and have it converted to UTF8. Is there a way to write a SQL syntax that will allow to specify UTF8 and have the data converted in that format?

I remember we block bcp in/out UTF-16 to UTF8 encoding during SQL Server 2008, I guess your customer is hitting this issue. The workaround is a encoding transform at client side. You can use Win32API, .Net, Java, Powershell or many tools to achieve this. Note, I mean when your bcp out your data, you must specify that you want Unicode format.