If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Yet another Unicode v Ansi question

Working in Win32 console app (VS 2010) I have been trying to convert several Unicode (UTF-16) C++ functions to Ansi C (UTF-8). The test app includes two tokenizer classes, each of which work perfectly well in their respective environments, CTokA and CTokW (UTF-8 and UTF-16).

A problem arises when I attempt to run the UTF-8 functions when the Character Set properties is set to 'Use Unicode Character Set' in that std::string manipulations do not perform as expected, e.g.,

printf("start\n");

gets reproduced as

printf("start\n");═══════════²²²²

Attempting to null terminate the string where it is supposed to end simply results in a space in that position and the garbage end persists, e.g.,

printf("sta t\n");═══════════²²²²

Code:

sline[11] = 0x0000;

If I attempt to change the Character Set property to 'Use Multibyte Character Set' or 'Not Set', the app will not compile and hundreds of errors occur. Of course, I can eliminate all of the UTF-16 code, but it strikes me that it should not be necessary. Perhaps if M$ made everything UTF-16 without all of the necessary decorations like 'L' and '_T(', life would be much simpler. Unfortunately, I have a very extensive UTF-8 app under 10 years of development that works quite well, but my UTF-16 (Unicode) conversion doesnt work as well because of the mixing of pointers (I think), so I have had to revert much of the code back to UTF-8. (All of which has nothing to do with my question but is simply psychotheraputic for me to ventilate on.)

My question is this: Can UTF-8 and UTF-16 code coexist in a single Win32 console app?

Re: Yet another Unicode v Ansi question

>> Can UTF-8 and UTF-16 code coexist in a single Win32 console app?
Both the ANSI Windows API and the MS-CRT do not support UTF8. That means that Win32 API functions that end in 'A' do not support UTF8 - and the MS C standard library locale implementation does not support UTF8.

Re: Yet another Unicode v Ansi question

each of which work perfectly well in their respective environments

I believe, this is the key point. The application domain internally uses whatever you want, until you have to interface with Win32 API domain. Any time that you need to pass a string across domain boundaries you must comply with domain specifics converting the string when needed. Explicitly, or by means of a third party wrapper.

when the Character Set properties is set to 'Use Unicode Character Set' in that std::string manipulations do not perform as expected

Sorry, I don't follow you here. std::string is always ANSI string, and std::wstring is always wide character string disregarding Character Set setting.

Besides, using 'Use Unicode Character Set' setting literally means: use wchar_t characters (which is UTF-16LE in Windows) when expanding T-family macros, and in fact it does nothing but defining UNICODE and _UNICODE macros project wide. By no means it implies any magic that allows ANSI strings become UTF-8 strings all of a sudden.

Re: Yet another Unicode v Ansi question

std::string is always ANSI string, and std::wstring is always wide character string disregarding Character Set setting.

That's what I thought. But take a look at the demo I've attached.

This demo shows that when the Character Set property is set to Use Unicode Character Set the app compiles and runs ok when both CTokW and CTokA classes are included in the build. But, when Character Set property is set to Use Multibyte Character Set or Not Set, the program will not compile with many errors entirely attributable to the wchar_t elements of CTokW. This despite the fact that the CTokW class is never called in the program, and the appropriate _T("") macro is used throughout CTokW. When the Character Set is Multibyte and CTokW is excluded from the build, all works as it should.

Re: Yet another Unicode v Ansi question

Originally Posted by Mike Pliam

Thanks for your input.

That's what I thought. But take a look at the demo I've attached.

No need to.

The simple reason is that std::string and std::wstring are templates, i.e. it is set at compile-time as what they can do. It isn't a runtime issue. So there is no way std::string or std::wstring can behave differently. Take a look at what std::string is:

Code:

typedef basic_string<char, char_traits<char> > std::string;

The definition is something similar to this -- note that you cannot change the behaviour of std::string, since the char_traits template class is a compile-time construct which defines how std::string behaves (this is called policy-based programming in C++, where the behaviour of a generic class is set at compile-time by giving it a policy, in this case the char_traits template class). The std::wstring is the same thing, except the character traits are based on wchar_t.

So whatever you've done hasn't and cannot change the behaviour of std::string or std::wstring -- it is impossible to do so unless you change the source code and rebuild the runtime library.

Re: Yet another Unicode v Ansi question

Originally Posted by Mike Pliam

when Character Set property is set to Use Multibyte Character Set or Not Set, the program will not compile with many errors entirely attributable to the wchar_t elements of CTokW. This despite the fact that the CTokW class is never called in the program, and the appropriate _T("") macro is used throughout CTokW. When the Character Set is Multibyte and CTokW is excluded from the build, all works as it should.

This is where your design leaks.

The intention of using T macros may be only this: you need your code be compilable no matter what Character Set setting is in use while having your string types mutating at compile time and depending on the setting value.

This is what happens with TCHAR. The same type name actually is an alias to CHAR or WCHAR depending on the Character Set setting value. You have the same code base able to build to two binary representations.

But your case is sort of opposite. You need your code be compilable no matter what Character Set setting is in use and have your string types immutable at compile time. You need your code to always compile into the same binary representation. But you can not provide string immutability by using mutable types.

In other words, your CTokW has to get rid of any kind of T macros, and use explicitly wide character types and L"" literals.

Or you stay with those T macros in your code and never try to switch to MBCS anymore.

* The Perfect Platform for Game Developers: Android
Developing rich, high performance Android games from the ground up is a daunting task. Intel has provided Android developers with a number of tools that can be leveraged by Android game developers.

* The Best Reasons to Target Windows 8
Learn some of the best reasons why you should seriously consider bringing your Android mobile development expertise to bear on the Windows 8 platform.