Convert Between std::string and std::wstring, UTF-8 and UTF-16

Introduction

I needed to convert between UTF-8 coded std::string and UTF-16 coded std::wstring. I found some converting functions for native C strings, but these leave the memory handling to the caller. Not nice in modern times.

The best converter is probably the one from unicode.org. Here is a wrapper around this one which converts the STL strings.

Unlike other articles, this one has no other dependencies, does not introduce yet another string class, only converts the STL strings, and that's it. And it's better than the widely found...

This have nothing to do with unicode! What you did is putting a ASCII string with 8bits chars into wide 16 bits chars but they are still characters in the ASCII range! The ASCII range is a small sub range of the unicode range.

The point of this article is that you can convert unicode characters formatted as utf-8 string into utf-16 string and v.v.. In this string you can mix Latin, Greek, Russian, Hebrew or the like with ASCII range characters.

You can store a unicode character up to 0xD800 into a single wchar_t while you can store a character in the ascii-range into a char, but for unicode characters outside the ascii-range you have to store it in more than 1 chars.

The Windows API works with utf-16 format but the most other OS's with utf-8.

As well, in order to compile, place the following at the top of the utfconverter.cpp file:

#include "windows.h"
#include <string>

If you use the modified code[^] shown later, with the two lines above, it appears to work. I've not put it through its paces (and probably won't since I have very simple requirements here) but just wanted to share what I had to do to get it working. Suggest updating the zipfile with the latest code.

One "wide" character may require more than one UTF-8 characters.
If you are using
size_t utf8size = widesize;
this works very well with strings only containing ASCII characters, but as soon as you go beyond that, you can expect buffer overruns.

We are a big screwed up dysfunctional psychotic happy family - some more screwed up, others more happy, but everybody's psychotic joint venture definition of CPblog: TDD - the Aha! | Linkify!| FoldWithUs! | sighist

Wrong, it is *not* Unicode. The "L" prefix to a string literal in C++ means the subsequent character literal or string literal is a *wide* character or string respectively (ie. corresponding to wchar_t). A wide string has no specified encoding (Unicode is an encoding).

In general this is not OK. The string interface is crafted to allow optimizations. One such optimization is to allocate the string’s buffer in chunks to make insertion in the middle of the string more efficient and to avoid the need for copying data when is string’s size increases. If this is the case then c_str copies the data into one big buffer which hangs around until the string is destroyed or a non-const method is called; but there is no requirement for the data to be copied back. This hack may work on some specific STL implementations (perhaps even most) but is not portable.