For creating a C++ program that is source code level portable between Windows and Linux and handles internationalization well, there are IMHO three main encodings to consider:

The encoding of the C++ source code.

The encoding of external data.

The encoding(s) of strings and literals.

For the C++ source code there is not really any alternative to UTF-8 with BOM, at least if standard input and wide string literals should work on the Windows platform. UTF-8 without BOM causes Microsoft's Visual C++ compiler to assume Windows ANSI encoding for the source code, which is nice for UTF-8 output via std::cout, to the limited degree that that works (Windows console windows has lots of bugs here). However, then input via std::cin does not work.

And for the external data UTF-8 seems to be the de facto standard.

However, what about the internal literals and strings? Here I had the impression that narrow strings encoded as UTF-8 was the common convention in Linux. But recently two different persons have claimed otherwise, one claiming that the common convention for internal strings in international applications in Linux is UTF-32, and the other just claiming that there is some unspecified difference between Unix and Linux in this area.

As one who fiddles a little, on a hobby basis, with a micro-library intended to abstract away the Windows/Linux differences in this area, I … have to ask concretely

what is the common Linux convention for representing strings in a program?

I am pretty sure that there is a common convention that is so overwhelmingly common that this question has a Real Answer™.

An example showing e.g. how to Linux-conventionally reverse a string (which is complex to do directly with UTF-8, but which presumably is done by functions that are de facto standard in Linux?), would also be nice, i.e., as a question, what is a Linux-conventional version of this C++ program (the code as given works for Latin-1 as the C++ narrow text execution character set):

The C++ class std::string has no notion of encoding; it should be understood as a "byte sequence". Reversing a string is not the same as reversing a run of text. As for source code and literals: C++ has a notion of "source encoding", which is opaque and unspecified, and any sane compiler should have a configuration option to specify the source encoding.
– Kerrek SBNov 13 '11 at 13:33

I really doubt that there is such a "common convention". Different programs have different needs. Some don't care at all about i18n issues, some are built exclusively for that. Some don't care (much) about memory use by strings, some care a lot. Shouldn't you be using the encoding that makes the most sense for your application?
– MatNov 13 '11 at 13:35

@KerrekSB: re the possibility of compiler options (which you added while I responded to the first 2 sentences): in Windows one does usually not build the compiler. Visual C++ infers the encoding from the file contents, with Windows ANSI as default, and MinGW g++ blindly assumes UTF-8, but, happily for novices who serve it Windows ANSI source code, it doesn't validate narrow literals... ;-)
– Alf P. SteinbachNov 13 '11 at 13:43

1

@AlfP.Steinbach: GCC has the options -finput-charset and -fexec-charset, so you can always be explicit if you find the default insufficient.
– Kerrek SBNov 13 '11 at 13:44

3 Answers
3

For external representations, UTF-8 is definitely the standard. Some 8-bit encodings are still strong (mostly in Europe) and some 16-bit encodings are still strong (mostly in East Asia), but they are clearly legacy encodings, on their slow way out. UTF-8 is standard not only on unix, but also on the web.

For internal representations, there's no such overwhelming standard. You'll find some UTF-8, some UCS-2, some UTF-16 and some UCS-4 if you look around.

UTF-8 has the advantage that it matches the common representation, and that it's a superset of ASCII. In particular, it's the only encoding here where a null character corresponds to a null byte, which is important if you have C APIs around (including unix system calls and standard library functions).

UCS-2 is a historical survivance. It was attractive because it was thought of as a fixed-width encoding, but it can't represent all of Unicode, which is a stopper.

UTF-16's main claims to fame are Java and Windows APIs. Unix APIs (who like UTF-8) are more relevant than Windows APIs if you're programming for unix. Only programs geared towards interaction with APIs that like UTF-16 tend to use UTF-16.

UCS-4 is attractive because it looks like a fixed-width encoding. The thing is, it isn't, really. Because of combining characters, there is no such thing as a fixed-width Unicode encoding.

There's also wchar_t. The thing is, that's 2 bytes on some platforms, and 4 bytes on others, and the character set that it represents is not specificed. With Unicode being the de facto standard character set, newer applications tend to eschew wchar_t.

In the unix world, the argument that trumps them all is usually compatibility with unix APIs, pointing to UTF-8. It's not universal, however, so there's no yes-or-no answer to whether your library needs to support other encodings.

There's no difference between unix variants in that respect. Mac OS X prefers decomposed characters so as to have a normalized representation, so you might want to do that as well: it'll save some work on OSX and won't matter on other unices.

Note that there is no such thing as a BOM in UTF-8. A byte-order mark only makes sense for super-byte-sized encodings. The requirement that UTF-8-encoded files begin with the character U+FEFF is specific to a few Microsoft applications.

thanks, it is about as I expected. I hadn't thought about the nullbyte/terminator issue before, interesting! In the other direction, one of the "a few Microsoft applications" is the Visual C++ compiler, which still as of version 10.0 uses the BOM to identify UTF-8 as such, and now that g++ no longer chokes on BOM it is possible to encode a source file that includes non-ASCII characters, such that both compilers can digest it (namely, UTF-8 with BOM). I agree that "BOM" is an unfortunate term since it only associates to part of the function, but then, Unicode and terminology... ;-)
– Alf P. SteinbachNov 13 '11 at 15:07

The name "BOM" did make more sense before UTF-8 was invented and it was assumed that Unicode == UCS-2.
– user1024Nov 15 '11 at 9:05

"there is no such thing as a fixed-width Unicode encoding." UTF-32 is a fixed-width Unicode encoding.
– ScooterMar 2 '17 at 17:06

@Scooter No, UTF-32 is the same as UCS-4, it's variable-width when you consider combining characters.
– GillesMar 2 '17 at 22:54

@Scooter If you think it is variable width then you should update Wikipedia, which says: "UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings."
– ScooterMar 3 '17 at 6:54

C++ defines an "execution character set" (in fact, two of them, a narrow and a wide one).

When your source file contains something like:

char s[] = "Hello";

Then the numeric byte value of the letters in the string literal are simply looked up according to the execution encoding. (The separate wide execution encoding applies to the numeric value assigned to wide character constants L'a'.)

All this happens as part of the initial reading of the source code file into the compilation process. Once inside, C++ characters are nothing more than bytes, with no attached semantics. (The type name char must be one of the most grievous misnomers in C-derived languages!)

There is a partial exception in C++11, where the literals u8"", u"" and U"" determine the resulting value of the string elements (i.e the resulting values are globally unambiguous and platform-independent), but that does not affect how the input source code is interpreted.

A good compiler should allow you to specify the source code encoding, so even if your friend on an EBCDIC machine sends you her program text, that shouldn't be a problem. GCC offers the following options:

-finput-charset: input character set, i.e. how the source code file is encoded

-fexec-charset: execution character set, i.e. how to encode string literals

GCC uses iconv() for the conversions, so any encoding supported by iconv() can be used for those options.

I wrote previously about some opaque facilities provided by the C++ standard to handle text encodings.

Example: take the above code, char s[] = "Hello";. Suppose the source file is ASCII (i.e. the input encoding is ASCII). Then the compiler reads 99, and interprets it as c, and so on. When it comes to the literal, it reads 72, interprets it as H. Now it stores the byte value of H in the array which is determined by the execution encoding (again 72 if that is ASCII or UTF-8). When you write \xFF, the compiler reads 99 120 70 70, decodes it as \xFF, and writes 255 into the array.

Again, thanks for this information. However, with both MinGW g++ 4.4.1 in Windows, and Ubuntu g++ 4.6.1 in Ubuntu, the static assertion in the code in the question, fires even when the execution character set is set to Latin-1 (option -exec-charset:"ISO−8859−1"". This is a bug in the g++ compiler version 4.6 and earlier (I don't know about 4.7). In short, it's unreliable. But I'm pleased to learn that these options now are accepted and work, I didn't know that. So, thanks.
– Alf P. SteinbachNov 13 '11 at 14:15

@AlfP.Steinbach: I cannot reproduce this here. GCC behaves as expected both in Windows (MingW/GCC 4.5.2) and Linux (GCC 4.6.2). Are you sure your source file has the expected encoding?
– Kerrek SBNov 13 '11 at 14:23

the source file is encoded as UTF-8 with BOM. It should not matter as long as the compiler identifies it correctly. The resulting string, with execution character set Latin-1, should be just 1 character, and is just 1 character with Visual C++. I pasted the Ubuntu test here: pastebin.com/nJQJDCWV
– Alf P. SteinbachNov 13 '11 at 14:39

The option should be -fexec-charset=ISO-8859-1. No colon, and f.
– Kerrek SBNov 13 '11 at 14:41

Ah, thanks, that works both in Ubuntu and Windows. It failed to report that it did not recognize the option (a lesser bug). I just tried various syntaxes till it seemingly accepted it.
– Alf P. SteinbachNov 13 '11 at 14:47

one claiming that the common convention for internal strings in
international applications in Linux is UTF-32

This is probably a reference to the fact that GCC defines wchar_t as a UTF-32 character, unlike Windows C(++) compilers that define wchar_t = UTF-16 (for compatibility with Windows WCHAR).

You could use wchar_t internally if that's convenient for you. However, it's not as common in the *nix world as in the Windows world, because the POSIX API was never rewritten to use wide characters like Windows was.

Using UTF-8 internally works well for routines that are "encoding-neutral". For example, consider a program to convert tab-separated spreadsheets to CSV. You'd need to treat the ASCII characters \t, ,, and " specially, but any bytes in the non-ASCII range (whether they represent ISO-8859-1 characters or UTF-8 code units) can simply be copied as-is.

As one who fiddles a little, on a hobby basis, with a micro-library
intended to abstract away the Windows/Linux differences in this area,

One of the many annoyances of writing cross-platform code is that on Windows it's easy to use UTF-16 and hard to use UTF-8, but vice-versa on Linux. I've dealt with it by writing functions like this:

Version 2 of boost.filesystem supported Unicode filenames also for opening C++ level ifstreams. Version 3 currently supports this only for Microsoft's Visual C++ compiler, which provides extra wide character based constructors and open functions. The version 2 workaround for g++, using Windows short names, was not brought over to version 3. However, there is a ticket opened by me on that, and Beman promised to fix it sooner or later, perhaps later though. Until then one fix is to implement the g++ workaround oneself, and another is to use the old version 2 of boost.filesystem.
– Alf P. SteinbachNov 14 '11 at 8:41