All replies

Where u8"" creates a UTF-8 encoded string. It is still unavailable in VC++2012.

But in any case compiler has to know the encoding of the source file and that is what the Microsoft's
pragma directive for. See this discussion for some enlightenment on the subject - How to use utf8 character arrays in c++?

> But in any case compiler has to know the encoding of the source file
> and that is what the Microsoft's pragma directive for.

No, thats not what pragama execution_character_set is for.

When the compiler reads an UTF-8 encoded source file, it recognizes the encoding from the byte order mark of the file. But the compiler does not generate UTF-8 encoded strings. Instead, when the compiler sees a multi-byte encoded UTF-8 character
inside a string, it converts it according the ACP into a 1-byte character. Which is okay, because what should the compiler do with 'ö'? It needs to generate a single char value. A multi-byte sequence would not make sense here. In contrast, generating
multi-byte encoded characters inside a string often makes a lot of sense (even if the source file itself might not be UTF-8, but ANSI or UTF-16 encoded).

I too need a solution for this problem. Sadly Microsoft still does not support u8"", which would be the perfect solution. To tell you the truth, I am quite angry with Microsoft for removing the feature without providing an alternative solution.

There can be any file supplied as a source for compiler. With or without BOM.

Most UTF-8 encoded files I have seen had a BOM.

And yes, it think it would still be nice to have the possibility to define the encoding of the source file. Although I would prefer to define this in the file properties in Visual Studio, not in the source code itself. Because if you put the definition
into the file as a pragma, the compiler must read and analyze the file before it can determine the encoding. For this the compiler has to know if the file is ANSI or UTF-16 anyway ...

But that is not the point. My problem is that I need the compiler to generate UTF-8 strings, no matter if I have to save the source code as ANSI, UTF-8 or UTF-16 for this to work. I could live with each of this encodings, just if the compiler could
generate UTF-8 encoded strings. If have many strings with umlauts, and writing them as a sequence of octal numbers is impractical.

Regarding the source from where I get the information: There is no source. I just tried it. If I am not totally mistaken, then what I described is what the Microsoft compiler actually does.

Thank you for the link. The information it provides is interesting, but does not solve the problem. The problem is not about writing UTF-8 into a stream or onto the console. The problem is to create a string literal that is UTF-8 encoded in the first place.

Thank you for the link. The information it provides is interesting, but does not solve the problem. The problem is not about writing UTF-8 into a stream or onto the console. The problem is to create a string literal that is UTF-8 encoded in the
first place.

After reading and understanding that link, you'll know what it takes to create a string literal in a particular encoding - or if it's even possible with MSVC/GCC. It also straightens out other misinformation present in this thread (and in other links presented
in this thread).

Thank you for the link. The information it provides is interesting, but does not solve the problem. The problem is not about writing UTF-8 into a stream or onto the console. The problem is to create a string literal that is UTF-8 encoded in the first place.

Of course I know that I can use hexadecimal or octal notation to write any byte sequence into a string literal. But as I said before: This is impractical.

The project I am working on has many thousands of strings. Many of them include imlauts and sharp s. Finding and changing them would be expensive and error-prone, and it would make the strings completely unreadable.

What I do not get is why the heck would Microsoft remove the pragma execution_character_set feature? It would have been okay if they had implemented the u8"" feature in Visual Studio 2012. Instead they put a lot of effort into ruining the user interface.
Grey icons ... who comes up with such ideas? I fear what they will do in the next version. Maybe they will come up with white letters on white background...

Of course I know that I can use hexadecimal or octal notation to write any byte sequence into a string literal. But as I said before: This is impractical.

Well, that's the only reliable, portable way I know of.

The project I am working on has many thousands of strings.

Do they have to be represented as string literals? A string table resource, or even a plain old text file, could be a more practical alternative. As an added benefit, if you decide one day that you want to ship localized versions of your application, you'd
have all strings in need of translation handily in one place.

The best way I can think of would be to use C++11 UTF-8 string literals: char* s = u8"Hügel";

But the pragma did the trick as well.

Regarding the thousands of string in our project: We are already using string resources for all strings that need to be localized (having thousands of them too). The strings I was talking about must not be localized. They are stored in large arrays. The
array element type is a structure with about 15 members, and only one of them is a string. The array definitions are spread over more than 100 source files. So, in theory we could load the strings from a file into memory, but how would we put the right string
into the right element of the right array? As an alternative we could read each array from a separate file, reading all the struct members from the file. Then again we would have to parse the file and detect all kind of errors. Currently the compiler tells
us during compile time if we make a syntax error, use the wrong data type or mistype an enumeration value.

PS: Sorry for picking on Microsoft earlier. The development teams at Microsoft mostly do a good job. It's just that when I first read about the VS 2012 user interface, I was suprised that someone could come up with so many obviously bad ideas. Later I was
frustrated that the "designers" mostly ignored the user complaints. It is like buying a car, only to find out on delivery that the windshield is painted black. When you complain, the car salesman would only say: "But it looks cool!"

Regarding the thousands of string in our project: We are already using string resources for all strings that need to be localized (having thousands of them too). The strings I was talking about must not be localized. They are stored in large arrays. The
array element type is a structure with a about 15 members, and only one of them is a string. The array definitions are spread over more than 100 source files. So, in theory we could load the strings from a file into memory, but how would we put the right string
into the right element of the right array?

How about something like this. You gather all these strings into a text file, containing lines like:

kHuegel=Hügel

(doesn't have to be this exact way - whatever you find convenient). You then write a tool that consumes this file and generates two source files:

Defining the strings as wide character strings and converting them into UTF-8 either on use or during the startup of the program is a possible way to go. We are still using Visual Studio 2010 for now (mainly because we need to support Windows XP), but we
consider to move to VS 2012 a few months from now. I am still hoping that Microsoft will provide a solution until then (for example the pragma solution).

I disagree. UTF-8 and wchar is not the same. Sure I can convert one into the other, but why should I be forced to do that?

Imagine having the data type double but having no way to write double literals into your source code. You say that is silly? Why? You can always use strings instead and convert them into double values using the atof() function. So there
is no need for double literals, right?

In my opinion, not having UTF-8 literals is very much the same. And by the way: why would Microsoft implement the pragma into VC++2010 in the first place, if there is no need for it?

As I said before: This is not about sending UTF-8 to a stream or the console, or about converting Unicode into UTF-8 (which is quite simple). It is about making the compiler generate UTF-8 encoded strings.