Query XML with different languages or different UNICODE encodings [UTF-8 & UTF-16]

While creating XML documents we should take care of the UTF encoding if we are dealing with NON-US or international data. Non-English foreign languages fall in different ranges and takes more memory size compared to English language and thus require different encoding schemes, as follows:

>> UTF-8: represents an octet (8 bit) loosless encoding of Unicode characters. UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character in the range U+0000 through U+007F as a single octet. UTF-8 is the default encoding for XML.

In an XML prolog the encoding is represented as:‘<?xml version=”1.0″ encoding=”UTF-8″ ?>’

>> UTF-16: encoding is the 16-bit encoding of Unicode. It use 2 bytes per character (and sometimes combines two pairs), it makes implementation easier, but looks a bit overkill for Western languages encoding.

In an XML prolog the encoding is represented as:‘<?xml version=”1.0″ encoding=”UTF-16″ ?>’

Let’s check both the encodings and issues if they are not used correctly:

As per MS BOL: “SQL Server stores XML data using the UTF-16 encoding scheme. Because UTF-16 data is variable-width, it is processed according to a byte-oriented protocol. This means that UTF-16 data can be treated in a way that is independent of the byte ordering on different computers (little endian versus big endian). Therefore, UTF-16 is well-suited for traversing different computers that use different encodings and byte-ordering systems. Because XML data is typically shared widely across networks, it makes sense to maintain the default UTF-16 storage of XML data in your database, and when you export XML data to clients.”