Generally, it's best to throw objects,
not built-ins. If possible, you should
throw instances of classes that derive
(ultimately) from the std::exception
class. By making your exception class
inherit (ultimately) from the standard
exception base-class, you are making
life easier for your users (they have
the option of catching most things via
std::exception), plus you are probably
providing them with more information
(such as the fact that your particular
exception might be a refinement of
std::runtime_error or whatever).std::runtime_error or whatever).

But in the face of Unicode, it seems to be impossible to design an exception hierarchy that achieves both of the following:

Derives ultimately from std::exception for ease of use at the catch site

Provides Unicode compatibility so that diagnostics are not sliced or gibberish

Coming up with an exception class that can be constructed with Unicode strings is simple enough. But the standard dictates that what() must return a const char*, so at some point the input strings must be converted to ASCII. Whether that is done at construction time or when what() is called (if the source string uses characters not representable by 7-bit ASCII), it might be impossible to format the message without loss of fidelity.

How do you design an exception hierarchy that combines the seamless integration of a std::exception-derived class with lossless Unicode diagnostics?

No big deal, just use an encoding which uses bytes. IMO the bigger problem with std:.exception is that derived classes derive non-virtually from it. Due to that there's no way you could derive from your own base class, derived from std::exception, and, say, std::out_of_range.
–
sbiSep 21 '10 at 14:19

@sbi: True, but I dodge this problem by defining my heirarchy only in terms of std::exception directly. I throw my own std::exception-derived exceptions and leave the other Standard-defined exceptions to the standard library. Not an ideal solution, to be sure, but for my uses it is the best possible solution given the current state of the Standard.
–
John DiblingSep 21 '10 at 14:30

+1: There is a common misunderstanding about encoding.
–
ereOnSep 21 '10 at 13:40

1

The additional benefit with going the UTF-8 path is that STL et al exception text strings already are valid UTF-8. The problem is that it's a bit cumbersome to handle once you pass the 7-bit code points. At that point you'll either need custom output routines for UTF-8 or a conversion routine to an 8- or 16-bit code page all of which may or may not be something you want to do in your exception handler.
–
Andreas MagnussonSep 21 '10 at 13:49

1

@Andreas: There's two problems when using std::string for UTF-8: One is that in UTF-8, there's a difference between the number of characters and the number of bytes in a string. The other is that it's very easy to confuse system-encoded strings (which every application will continue to need) and UTF-8-encoded ones, resulting in funny text to be shown to the users. I found it better to use, say, std::basic_string<signed char> for UTF-8-encoded strings. That eliminates at least the second problem, because it makes the compiler bark at you when you confuse the encoding.
–
sbiSep 21 '10 at 14:15

2

How prevalent are system-encoded strings that use characters outside the ASCII subset? If system-encoded strings can be restricted to the ASCII subset, then UTF-8 can be used without funny text. As for string length, I like using std::string because I can get a byte count from it and can calculate the number of characters in O(n). Basically, if you want the string to think in characters, you have to subclass std::basic_string<signed char>, change its iterator (and maybe demote it from being a random-access iterator), and add a byte count method.
–
Mike DeSimoneSep 21 '10 at 14:24

+1: I need to learn more about UTF-8
–
John DiblingSep 21 '10 at 14:31

Returning UTF-8 is an obvious choice. If the application that uses your exceptions uses a different multibyte encoding, it might have a hard time displaying the string though. (It can't know it's UTF-8, can it?)
On the other hand, for ISO-8859-* 8bit encodings (Western european, cyrillic, etc.) displaying a UTF-8 string will "just" display some gibberish and you (or your user) might be fine with that if you cannot disambiguate btw. a char* in the locale character set and UTF-8.

Personally I think only low level error messages should go into what() strings and personally I think these should be english anyway. (Maybe combined with some error number or whatnot.)

The worst problem I see with what() is that it is not uncommon to include some contextual details in the what() message, for example a filename. Filenames are non ASCII rather often, so you are left with no choice but to use UTF-8 as the what() encoding.

Note also that your exception class (that's derived from std::exception) can obviously provide any access methods you like and so it might make sense to add an explicit what_utf8() or what_utf16() or what_iso8859_5().

Edit: Regarding John's comment on how to return UTF-8:

If you have a const char* what() function this function essentially returns a bunch of bytes. On a western european windows platform, these bytes would usually be encoded as Win1252, but on a russian windows it might as well be Win1251.

What the bytes return signify depends on their encoding and their encoding depends on where they "came from" (and who is interpreting them). A string literal's encoding is defined at compile time, but at runtime it's still up to the application how to interpret these.

So, to have your exception return UTF-8 strings with what() (or what_utf8()) you have to make sure that:

The input message to your exception has a well defined encoding

You have a well defined encoding for the string member you use to hold the message.

The conversion could also be placed in the (overridden) what() member function of MyExc() or you could define the exception to take an already UTF-8 encoded string or you could convert (from an expected input encoding, maybe wchar_t/UTF-16) in the ctor.

"Returning UTF-8 is an obvious choice." This seems to follow the arc of current thought. Now the only question is, how do I return UTF-8? :)
–
John DiblingSep 21 '10 at 14:33

@John Dibling:If the text of your messages is all in English and can be expressed in standard ASCII, you have already done enough because ASCII and the first 128 characters of UTF-8 are identical. If you are using characters and an encoding above 127 you'll need to convert the encoding to UTF-8. There must be a standard C++ library function to do that by now. If not, libiconv can do the trick.
–
JeremyPSep 21 '10 at 15:51

1

@JeremyP: we use ICU where I work to handle Unicode, certainly not perfect (C-interface...) but it does the work and handles the quircks of Unicode / Internationalization / Localization.
–
Matthieu M.Sep 21 '10 at 18:27

@Matthieu M: Thanks for that. I was looking for a C compatible unicode library. I could have used libiconv but it's licence is more restrictive.
–
JeremyPSep 22 '10 at 8:53

The first question is what do you intend to do with the what() string?

Do you plan to log the information somewhere?

If so you should not be using the content of the what() string you should be using that string as a reference to look up the correct local specific logging message. So to me the content of the what() is not for logging purposes (or any form of display) it is a method of looking up the actual logging string (which can be any Unicode string).

Now; It can be us-full for the what() string to contain a human readable message for the developers to help in quick debugging (but for this highly readable polished text is not required). As result there is no reason to support anything more than ASCII. Obey the KISS principle.

In response to your questions. I'd like to use the what() string in order to generate two levels of diagnostics. The lower level is a developer- or technician-centric diagnostic that would be displayed in log files. But at a higher level I'd like these strings to be used to construct a diagnostic that is actionable by a normal human being. As you seem to imply, the what() return could simply be a lookup value in to a table of more humane messages, but some components of the string (or at least the exception) would need to be human-readable, such as " File blah.txt could not be found."
–
John DiblingSep 21 '10 at 15:38

Another goal of mine is to keep catch blocks to a minimum. Utopia would be to have a single catch( const std::exception& ex ) block that catches everything, and that block would consume the what() string to produce both the technician- and human-level diagnostics. Following this pattern, all of the data to construct both messages must be retrievable from the what() string.
–
John DiblingSep 21 '10 at 15:39

Most local conversionlanguages take an input string and convert it to the local string via resources. So if you say the first part of the string upto a colon is used to look up local strings you can then do this: File could not be found: blah.txt. The part File could not be found: can then be used to look up the local specific translation.
–
Loki AstariSep 21 '10 at 16:30

-1 : I think adding a link (a great link, btw.) without any explanation on how this would relate to C++ exceptions does nothing to help answer the question. (It might help contextualize some encoding issues, but that what comments are for, no?) This is especially true if the OP actually needs to read the link.
–
Martin BaSep 21 '10 at 14:19

1

Moreover, I've already read the link and it does not address my question.
–
John DiblingSep 21 '10 at 14:32

2

To the contrary, I think this link provides great insight as to why using char const* has nothing to do with character encoding.
–
Alexandre C.Sep 21 '10 at 14:37

Standard doesn't specify what encoding is the string returned by what(), neither there is any defacto standard. I just encode it as UTF-8 and return from what(), in my projects. Of course there may be incompatibility with other libraries.

A const char* doesn't have to point to an ASCII string; it can be in a multi-byte encoding such as UTF-8. One option is to use wcstombs() and friends to convert wstrings to strings, but you may have to convert the result of what() back to wstring before printing. It also involves more copying and memory allocation than you may be comfortable with in an exception handler.

I usually just define my own base exception class, which uses wstring instead of string in the constructor and returns a const wstring& from what(). It's not that big of a deal. The lack of a standard one is a pretty big oversight.

Another valid opinion is that exception strings should never be presented to the user, so localizing them isn't necessary and so you don't have to worry about any of the above.

what() is generally not meant to display a message to a user. Among other things the text it returns is not localizable (even if it was Unicode). I'd just use what() to display something of value to you as the developer (like the source file and line number of the place where the exception was raised) and for that sort of text, ASCII is usually more than enough.

This is your opinion, and while I respect your opinion I don't share it. Even if the what() output is only stored to a log file it is on some level "presented to the user" and needs to not be gibberish.
–
John DiblingSep 21 '10 at 14:35

1

I am not saying it should be gibberish. I am saying that what() is not suitable to hold "international" text not because it can't hold Unicode (it can) but because it is not localizable.
–
Nemanja TrifunovicSep 21 '10 at 14:42

Certainly the exception text may not need to be "internationalized" in the same way as text that the users normally see. But I can imagine times where a piece of Unicode text would still be very relevant and one would want it included with the exception. For example, a file name or path could have Unicode characters. Leaving that out would make the exception handling or logging less useful.
–
TheUndeadFishSep 21 '10 at 17:40

why can't you internationalize it ? Can't you access the local within what ?
–
Matthieu M.Sep 21 '10 at 18:26