Windows Programming/Unicode

Contents

Unicode is an industry standard whose goal is to provide the means by which text of all forms and languages can be encoded for use by computers. Originally, text-characters were represented in computers using byte-wide data: each printable character (and many non-printing, or "control" characters) were implemented using a single byte each, which allowed for 256 characters total. However, globalization has created a need for computers to be able to accommodate many different alphabets from around the world.

The old codes were known as ASCII or EBCDIC, but it was apparent that neither of these codes were capable of handling all the different characters and alphabets from around the world. The solution to this problem created Unicode. Windows NT implements many of its core functions with a "wide" 16-bit characters set, close to Unicode standard, although it provides a series of functions that are compatible with the standard ASCII characters as well.

UNICODE characters are frequently called "Wide Characters", "Generic Characters", or "T Characters". This book may use any of these terms interchangeably.

Before Unicode, there was an internationalization attempt that introduced character strings with variable-width characters. Some characters, such as the standard ASCII characters would be 1 byte long. Other characters, such as extended character sets, were two bytes long. These types of character formats fell out of favor with the advent of UNICODE because they are harder to write and much harder to read. Windows does still maintain some functionality to deal with variable-width strings, but we won't discuss those here.

Unfortunately all advantages of using wide characters were lost because the number of characters needed quickly exceeded the 65,536 possible 16-bit values. Windows actually uses what is called UTF-16 to store characters, where a large number of characters actually take //two// words, these are called "surrogate pairs". This development is after much of the Windows API documentation was written and much of it is now obsolete. You should never treat string data as an "array of characters", instead always treat it as a null-terminated block. For instance always send the entire string to a function to draw it on the screen, do not attempt to draw each character. Any code that puts a square bracket after a LPSTR is wrong.

At the same time, variable-width character-based strings made a big comeback in the multi-platform standard called UTF-8, which is pretty much the same idea as UTF-16 except with 8-bit units. Its primary advantage is that there is no need for two APIs. The 'A' and 'W' APIs would have been the same if this were used, and since both are variable-sized, it has no disadvantage. Although most Windows programmers are unfamiliar with it, you may see increased references to using the non-UNICODE API.

The Win32 API classifies all of its functions that require text input into two categories. Some of the functions have an "A" suffix (for ASCII), and some have a "W" suffix (for Wide characters, or Unicode). These functions are differentiated using the macro "UNICODE":

Because of this differentiation, when you receive a compiler error, you will get an error on "MessageBoxW" instead of simply "MessageBox". In these cases, the compiler is not broken. It is simply trying to follow a complex set of macros.

All Windows functions that require character strings are defined in this manner. If you want to use unicode in your program, you need to explicitly define the UNICODE macro before you include the windows.h file:

#define UNICODE
#include <windows.h>

Also, some functions in other libraries require you to define the macro _UNICODE. The standard library functions can be provided in unicode by including the <tchar.h> file as well. So, to use unicode in your project, you need to make the following declarations in your project:

Some header files include a mechanism like the following, so that when one of the two UNICODE macros is defined, the other is automatically defined as well:

#ifdef UNICODE
#ifndef _UNICODE
#define _UNICODE
#endif
#endif

#ifdef _UNICODE
#ifndef UNICODE
#define UNICODE
#endif
#endif

If you are writing a library that utilizes UNICODE, it might be worthwhile for you to include this mechanism in your header files as well, so that other programmers don't need to worry about including both macros.

The data type "TCHAR" is defined as being a char type if unicode is not defined, and is defined as being a wide type if UNICODE is defined (in tchar.h). To make strings portable between unicode and non-unicode, we can use the TEXT() macro to automatically define a string as being unicode or not:

TCHAR *automessage = TEXT("This message can be either ASCII or UNICODE!");

Using TCHAR data types, and the TEXT macro are important steps in making your code portable between different environments.

Also, the TEXT macro can be written as:

TEXT("This is a generic string");
_T("This is also a generic string");
T("This is also a generic string");

Unicode characters 0 to 31 (U+0000 to U+001F) are part of the C0 Controls and Basic Latin block. They are all control characters. These characters correspond to the first 32 characters of the ASCII set.