I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasonably simple and correct. Could I ask for comments on the following? I'm inclined to use UTF-8 and UTF-32, and to define something like:

The string8 class would be used for UTF-8, and having a separate type is just a reminder of the encoding. An alternative would be for string8 to be a subclass of std::string and to remove the methods that aren't quite right for UTF-8.

The string32 class would be used for UTF-32 when a fixed character size is desired.

The UTF-8 CPP functions, utf8::utf8to32() and utf8::utf32to8(), or even simpler wrapper functions, would be used to convert between the two.

@Steve: I should have mentioned that platform independence is often a requirement for my work, and I see that wchar_t size (and therefore wstring) depends on the implementation. Also, I would like to support the full range of Unicode characters, and UTF-32 is the best way I know of to do that in a fixed length encoding. It does take up a lot of space, but I'm thinking that strings could be stored in UTF-8 most of the time.
–
nassarOct 16 '10 at 21:05

2

It looks like C++0x will define u32string as basic_string<char32_t>, and char32_t appears to be equivalent to uint32_t (looking at the gcc/g++ header files). So I should probably call these u8string and u32string, and define the latter using char32_t.
–
nassarOct 17 '10 at 1:59

3 Answers
3

If you plan on just passing strings around and never inspect them, you can use plain std::string though it's a poor man job.

The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data.

Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data.

With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::string nor std::wstring are aware of what a Unicode Character is. Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid.

The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned.

If you really wish to manipulate Unicode string, you need a Unicode aware container. The usual way is to use the ICU library, though its interface is really C-ish. However you'll get everything you need to actually work in Unicode with multiple languages.

I found your comment about diacritics a bit scary. It is in a sense most relevant to what I am trying to do, which is to handle strings "correctly" in a relatively simple way.
–
nassarOct 19 '10 at 2:13

ICU has (among other interfaces in C++) a C++ string class which interoperates with std::string
–
Steven R. LoomisOct 20 '10 at 5:00

@Steven: icu-project.org/apiref/icu4c/classUnicodeString.html which I consider C-ish in its interface (lots of interaction with unmanaged memory, uses of int32_t where unsigned would be better suited, ...) though as you mention, thanks to StringPiece it can be created quite smoothly from a std::string.
–
Matthieu M.Oct 20 '10 at 6:16

It's not specified what character encoding must be used for string, wstring etc. The common way is to use unicode in wide strings. What types and encodings should be used depends on your requirements.

If you only need to pass data from A to B, choose std::string with UTF-8 encoding (don't introduce a new type, just use std::string). If you must work with strings (extract, concat, sort, ...) choose std::wstring and as encoding UCS2/UTF-16 (BMP only) on Windows and UCS4/UTF-32 on Linux.
The benefit is the fixed size: each character has a size of 2 (or 4 for UCS4) bytes while std::string with UTF-8 returns wrong length() results.

For conversion, you can check sizeof(std::wstring::value_type) == 2 or 4 to choose UCS2 or UCS4. I'm using the ICU library, but there may be simple wrapper libs.

Deriving from std::string is not recommended because basic_string is not designed for (lacks of virtual members etc..). If you really really really need your own type like std::basic_string< my_char_type > write a custom specialization for this.

The new C++0x standard defines wstring_convert<> and wbuffer_convert<> to convert with a std::codecvt from a narrow charset to a wide charset (for example UTF-8 to UCS2).
Visual Studio 2010 has already implemented this, afaik.

I've intentionally avoided UCS-2, because it seems to me that if one is going to the trouble of handling character encoding, one might as well do it right and support full Unicode. (At the same time, I'm looking for something less cumbersome than ICU for general purpose use.) As for UTF-16, it seems to have the disadvantages of both variable length encoding and using lots of memory. That is why I propose using UTF-8 and UTF-32 in combination.
–
nassarOct 16 '10 at 23:00

Point taken about deriving from std::string. Thanks!
–
nassarOct 16 '10 at 23:14

1

I think defining a new type is not at all essential, but a lot of people seeing std::string in code will tend to forget about multi-byte characters and incorrectly use character positions. The fact that it is UTF-8 can be conveyed in comments, but having a reminder in the type name seems helpful because methods such as std::string::insert() do suggest 8-bit characters in my opinion.
–
nassarOct 16 '10 at 23:33

I just read that C++0x will define u32string as basic_string<char32_t>. So this should be good for UTF-32.
–
nassarOct 17 '10 at 1:24

1

For completion, if you only need to convert between the different UTFs and you already use c++0x features, there are a few new codecvts for that, for example codecvt<char16_t, char, mbstate_t> and codecvt<char32_t, char, mbstate_t> which converts char (UTF-8) to UTF16/32. Together with std::wstring_convert and std::wbuffer_convert you can easily convert between UTF without any additionaly library. If you need to convert other charsets, you can write your own codecvts using iconv() on linux and MultiByteToWideChar() & Co. on windows.
–
cytrinoxOct 17 '10 at 9:46