Unicode, MBCS and Generic text mappings

A guide to using generic text functions to make the transition between character sets simple and painless

Introduction

In order to allow your programs to be used in international
markets it is worth making your application Unicode or MBCS
aware. The Unicode character set is a "wide character"
(2 bytes per character) set that contains every character
available in every language, including all technical symbols and
special publishing characters. Multibyte character set (MBCS)
uses either 1 or 2 bytes per character and is used for character
sets that contain large numbers of different characters (eg Asian
language character sets).

Which character set you use depends on the language and the
operating system. Unicode requires more space than MBCS since
each character is 2 bytes. It is also faster than MBCS and is
used by Windows NT as standard, so non-Unicode strings passed to
and from the operating system must be translated, incurring
overhead. However, Unicode is not supported on Win95 and so MBCS
may be a better choice in this situation. Note that if you wish
to develop applications in the Windows CE environment then all
applications must be compiled in Unicode.

Using MBCS or Unicode

The best way to use Unicode or MBCS - or indeed even ASCII -
in your programs is to use the generic text mapping macros
provided by Visual C++. That way you can simply use a single
define to swap between Unicode, MBCS and ASCII without having to
do any recoding.

To use MBCS or Unicode you need only define either _MBCS
or _UNICODE in your project. For Unicode you
will also need to specify the entry point symbol in your Project
settings as wWinMainCRTStartup. Please note that
if both _MBCS and _UNICODE are
defined then the result will be unpredictable.

Generic Text mappings and portable functions

The generic text mappings replace the standard char or LPSTR
types with generic TCHAR or LPTSTR macros. These macros will map
to different types and functions depending on whether you have
compiled with Unicode or MBCS (or neither) defined. The simplest
way to use the TCHAR type is to use the CString
class - it is extremely flexible and does most of the work for
you.

In conjunction with the generic character type, there is a set
of generic string manipulation functions prefixed by _tcs.
For instance, instead of using the strrev
function in your code, you should use the _tcsrev
function which will map to the correct function depending on
which character set you have compiled for. The table below
demonstrates:

#define

Compiled
Version

Example

_UNICODE

Unicode (wide-character)

_tcsrev maps to _wcsrev

_MBCS

Multibyte-character

_tcsrev maps to _mbsrev

None (the default: neither _UNICODE
nor _MBCS defined)

SBCS (ASCII)

_tcsrev maps to strrev

Each str* function has a corresponding tcs*
function that should be used instead. See the TCHAR.H file for
all the mapping and macros that are available. Just look up the
online help for the string function in question in order to find
the equivalent portable function.

Note: Do not use the str*
family of functions with Unicode strings, since Unicode strings
are likely to contain embedded null bytes.

The next important point is that each literal string should be
enclosed by the TEXT() (or _T())
macro. This macro prepends a "L" in front of literal
strings if the project is being compiled in Unicode, or does
nothing if MBCS or ASCII is being used. For instance, the string
_T("Hello") will be interpreted as "Hello" in
MBCS or ASCII, and L"Hello" in Unicode. If you are
working in Unicode and do not use the _T()
macro, you may get compiler warnings.

Note that you can use ASCII and Unicode within the same
program, but not within the same string.

All MFC functions except for database class member functions
are Unicode aware. This is because many database drivers themselves
do not handle Unicode, and so there was no point in writing Unicode
aware MFC classes to wrap these drivers.

Converting between Generic types and ASCII

ATL provides a bunch of very useful macros for
converting between different character format. The basic form of
these macros is X2Y(), where X is the source
format. Possible conversion formats are shown in the following
table.

Never use the conversion macros inside a tight loop. This
will cause a lot of memory to be allocated each time the
conversion is performed, and will result in slow code.
Better to perform the conversion outside the loop and
pass the converted value into the loop.

Never return the result of the macros directly from a
function, unless the return value implies making a copy
of the data before returning. For instance, if you have a
function that returns an LPOLESTR, then do not do the
following:

Tips and Traps

The TRACE statement

The TRACE macros have a few cousins - namely
the TRACE0, TRACE1, TRACE2
and TRACE3 macros. These macros allow you to
specify a format string (as in the normal TRACE
macro), and either 0,1,2 or 3 parameters, without the need to
enclose your literal format string in the _T()
macro. For instance,

TRACE(_T("This is trace statement number %d\n"), 1);

can be written

TRACE1("This is trace statement number %d\n", 1);

Viewing Unicode strings in the debugger

If you are using Unicode in your applciation and wish to view Unicode strings
in the debugger, then you will need to go to Tools | Options | Debug and click
on "Display Unicode Strings".

The Length of strings

Be careful when performing operations that depend on the size
or length of a string. For instance, CString::GetLength
returns the number of characters in a string, NOT the size in
bytes. If you were to write the string to a CArchive
object, then you would need to multiply the length of the string
by the size of each character in the string to get the number of
bytes to write:

Reading and Writing ASCII text files

If you are using Unicode or MBCS then you need to be careful
when writing ASCII files. The safest and easiest way to write
text files is to use the CStdioFile class
provided with MFC. Just use the CString class
and the ReadString and WriteString member
functions and nothing should go wrong. However, if you need to
use the CFile class and it's associated Read
and Write functions, then if you use the following code:

then the results will be Significantly different. The two lines of
text below are from a file created using the first and second code snippets
respectively:

(This text was viewed using WordPad)

Not all structures use the generic text mappings

For instance, the CHARFORMAT structure, if the RichEditControl
version is less than 2.0, uses a char[] for the szFaceName field,
instead of a TCHAR as would be expected. You must be careful not
to blindly change "..." to _T("...") without
first checking. In this case, you would probably need to convert
from TCHAR to char before copying any data to the szFaceName
field.

Copying text to the Clipboard

This is one area where you may need to use ASCII and Unicode
in the same program, since the CF_TEXT format for the clipboard
uses ASCII only. NT systems have the option of the CF_UNICODETEXT
if you wish to use Unicode on the clipboard.

Installing the Unicode MFC libraries

The Unicode versions of the MFC libraries are
not copied to your hard drive unless you select them during a
Custom installation. They are not copied during other types of
installation. If you attempt to build or run an MFC Unicode
application without the MFC Unicode files, you may get errors.

Share

About the Author

Chris is the Co-founder, Administrator, Architect, Chief Editor and Shameless Hack who wrote and runs The Code Project. He's been programming since 1988 while pretending to be, in various guises, an astrophysicist, mathematician, physicist, hydrologist, geomorphologist, defence intelligence researcher and then, when all that got a bit rough on the nerves, a web developer. He is a Microsoft Visual C++ MVP both globally and for Canada locally.

His programming experience includes C/C++, C#, SQL, MFC, ASP, ASP.NET, and far, far too much FORTRAN. He has worked on PocketPCs, AIX mainframes, Sun workstations, and a CRAY YMP C90 behemoth but finds notebooks take up less desk space.

He dodges, he weaves, and he never gets enough sleep. He is kind to small animals.

Chris was born and bred in Australia but splits his time between Toronto and Melbourne, depending on the weather. For relaxation he is into road cycling, snowboarding, rock climbing, and storm chasing.

I do have a problem with copying UNICODE to the clipboard. The data will be pasted in the same application (If I’m able to copy them!). The date source is just text from different Languages like German, English, Russian, Hungarian, etc…

If you want to copy multilingual data on the clipboard also set the Locale of the clipboard, Meand you had to set the CF_TEXT and CF_LOCALE for the clipboard, otherwise you can copy data to clipboard in CF_RTF format

I am trying to translate just the interface of my application in chinese.
When I replace menu items in my rc file with chinese it's working great even so I can't edit my resources in VC++ 6.0.
I am intending to do the same thing with dialog boxes strings, replacing them in the rc file and showing them like before with a dialog.DoModal() but it is not working (I just see a lot of ????????).
How can I do such a thing ?

I have created an MFC application using VC++ ver 6.0 under WindowsNT 4.0 environment.

In that application i want to display unicode strings in the view.

I expect the display as "ψello, World" in the view. But it

is getting displayed as "▐ello, World".

'ψ' is a greek character.

I need support for the above problem.

I have read the article "Unicode, MBCS and Generic text mappings" By Chris Maunder(Platinum. Member No. 1) of this codeproject site.

As adviced in the article i have
1. defined _UNICODE in the project->settings->c/c++ tab under preprocessor definitions
2. i have mentioned wWinMainCRTStartup in the project->settings->Link under Entry-point symbol.
3. I have removed the _MBCS definition in the project->settings->c/c++ tab under preprocessor definitions
4. Further during installation of VC++ 6.0 i have selected both "Static Library for Unicode" and "Shared Library for Unicode."

Hi
I'm confused with the behavior the editbox shows.while i've not defined
_UNICODE or _MBCS(I've been forced not to use it) typing in the editbox in arabic shows strange
characters.if i just copy & paste those characters into notepad the
characters will look correctly(arabic).
What makes me more confused is that with the same application but on another
computer you can type arabic letters in the editbox.
1)What should I do so that my program acts normally on all computers?
2)what are probable the different settings on those 2 computers which make
my application behave different.(bothh run winXP)?
3)what are these strange characters?
please contact me on the subj. at (smalsa_ae@yahoo.com)
Thanks in advance

Hi, you could try changing the font of your dialogs from "MS Sans Serif" into "Microsoft Sans Serif", as the former doesn't support unicode chars but the latter does. Just open your .RC file in Text-Editing mode and do the replacement (you can also select another font, but Sans Serif is the Default.) Your currently selected font could also be "MS Shell Dlg", which also doesn't support Unicode chars.
The strange characters you see are the default mapping for undefined characters, which normally map to a black box or a question mark.

I don't know why the program behaves differently on the other computer, maybe you have a different locale setting there or overridden the default dialog box font.

Recently i created an Unicode project with the project Entry point symbol assigned as wWinMainCRTStartup. The project has been build sucessfully but once i intend to run the .exe file, error prompted out with message :

The instruction at ")x5f8336bb" referenced memory at "0x00000000". The memory could not be "read".

I suspected that the problem is related with the wWinMainCRTStartup entry point since this option sets the starting address for an .exe file.

"wWinMainCRTStartup " works well in my .exe project. but when i use a .dll project with unicode setting for my exe file. and also error:
"x5f8336bb" referenced memory at "0x00000000". The memory could not be "read".

if I does not use that dll, and .exe file runs well.
I don't know why.Thank you very much for any reply.

hello need help with code below(comes from CMailMan Class in codeguru)
MapiMessage Message;
Message.lpszSubject = (LPTSTR)((LPCTSTR)Subject);
Message.lpszNoteText =(LPTSTR)((LPCTSTR) Text) ;
"Subject" & Text" are strings passed to my function as it's parameters
The ERROR reported is "error C2440: '=' : cannot convert from 'unsigned short *' to 'char *' "
I think the problem starts from " Message.lpszSubject" since in the documentation it's a LPTSTR while here it's a char* (while my mouse rolls over it)
I 've defined unicode and expect to be able to send (retrieve) messages in arabic/persian
meanwhile I would appreciate if you could supply me with a link having a class,library,.... which can help me to retrieve my e-mails (supporting arabic/persian text)
Thanks for your attention in advance

Very usefull article... just one comment. You can use CStdioFile's ReadString and WriteString if the file you intend to read or write is NOT Unicode.
If you need to do deal with Unicode text files, you should use David Pritchard's very useful classCStdioFileEx[^]

> The Unicode character set is a "wide character" (2 bytes
> per character) set that contains every character available
> in every language, including all technical symbols and
> special publishing characters.

This isn't really true now, although it may have been true, briefly, 10 years ago ?

The Unicode character set is a set of some 100,000+ characters. Obviously these cannot be coded into a unique set of two-byte code assignments. But, an old set of them can be, and that is the one that was used in NT, and which Windows has not really advanced beyond.

However, on modern operating systems besides Windows, wchar_t is a 4 byte type, because all the 100,000+ Unicode characters can easily be coded as unique 4 byte assignments.

I want to read an Unicode Text file by ASP but I can't. After reading it, I write it into another file, i only see many strange characters. Please help me!
For example :
Ra mắt trang Web Đại Gia Đình
Ra máº¯t trang Web Äáº¡i Gia ÄÃ¬nh

Please send to me an email through tran_nam_thanh@yahoo.com if possible. Thanks in advance

hi,
i entered wWinMainCRTStartup as in the tutorial,also in Project/Settings/
C/C++ category:general i changed _MBCS in _UNICODE.it compiles without error,but when i want to start app,windows displays messige that can run programs with Unicode on win 3.11 and 95 (i use win98),and than displays message that MFC42UD.dll cant start.
i will be gratfull if You can send me some example VC++6 UNICODE project that runs under win98.
thanks.

Are there any C++ localization development tools out there? I'm interested in converting my C++ source so that it supports UNICODE. Unfortunately, there are alot of instances in my code that will not play nice with UNICODE (like the examples in the article). Is there a tool that can help me with this huge task?? Thanks in advance.