Posts filed under 'Database'

In this post, we talk about Unicode, about Strings, about how they sometimes become unexpectedly mangled when you pass them around the system and what we can do about it. We’ll see some Chinese ideograms, peek into memory and display some nonsense on screen in the process.

VBA likes Unicode

Internally, VBA handles strings in Unicode encoded as UTF-16. This means that each ‘character’ of the string uses at least a 16 bit WORD (2 bytes) of memory (not entirely true, but let’s keep it simple).

Below is the proof; it shows the memory content of a VBA string variable that contains the following sentence:

Chinese or the Sinitic language(s)
(汉语/漢語, pinyin: Hànyǔ; 中文, pinyin: Zhōngwén)
is the language spoken by the Han Chinese

If you see squares in place of the Chinese ideograms, then you are probably using an old version of Internet Explorer and do not have the Asian fonts installed: simply use the latest versions of Firefox or Chrome if you want to display them correctly (although you can still go through this post without seeing the exact glyphs).

In memory, the layout of this VBA string is:

Each character of our string occupies 2 bytes: our first letter C whose hexadecimal code 0043 is located at address 0EED7974, then, 2 bytes later at address 0EED7976, we have our second character, h whose code is 0068 and so on. Note that Windows is a Little Endian architecture, so the lowest significant byte 68 appears first in memory, followed by the most significant byte 00 at the next address.

On the right side, the map displays strange characters in place of the Chinese ones because the memory map interprets each byte as a simple ASCII character, but you can see for instance that the ideogram 中 is located at address 0EED79EA and that its code is 4E2D .

The character codes we see here are simply the ones that would be returned by the AscW() function in VBA (more or less).

VBA doesn’t always like Unicode

The thing though, is that VBA considers the outside world to be ANSI, where characters take (generally) 1 byte and strings must be interpreted according to a Code Page that translate the character code into a different visual representation depending on the encoding.

This means that if you are Greek, your system’s code page will be 737 and you will see Ω (the capital letter Omega) if a string contains the hexadecimal code 97 whereas on a system set for western languages, including English, you will see ù: different representations of the exact same character code.

This nightmare of ANSI encoding made it very hard to pass around strings if you did not know which code page was associated with them. Even worse, this makes it very hard to display strings in multiple languages within the same application.

The Office VBE IDE editor is one such old beast: it only support the input and display of characters in your current system’s Code Page and it will simply display garbage or ? placeholders for any characters it can’t understand.

Fortunately, we’re not really living in an ANSI world any more. Since the days of Windows NT and Windows XP SP2, we’ve been able to use Unicode where each possible character, glyph, symbol has its own code point (things can be quite complicated in Unicode world as well, but that’s the topic for another post).

Unfortunately, VBA still inherits some old habits that just refuse to die: there are times when it will convert your nice Unicode strings to ANSI without telling you.

To make the situation worse, most of the VBA code on the Internet was written by English speakers at a time when Unicode was just being implemented (around 2000). The result is that even Microsoft’s own examples and Knowledge Base articles still get things wrong and copying these examples blindly will probably make you, your customers and users very unhappy at some point.

To understand this unwanted legacy, and the solutions to avoid these problems, we’ll go through a concrete example.

Message in a box

Let’s study a simple case: display our previous message using the MessageBox Windows API. The result we want to achieve is this:

Note that in order to display Chinese ideograms, your system must have the proper fonts installed. If not, you’ll end up with little boxes where the glyphs should be displayed.

What we are talking about here is not limited to Asian languages though: it concerns all languages, including English if you ever include any symbols that’s outside of extended ASCII, like the Euro €, mathematical symbols, words in other alphabets like Greek, phonetics, accents, even Emoji (emoticon) symbols.

The GetUnicodeString() function simply returns a string containing the text we want to display. It uses a function UnEscStr() from a previous article to convert escaped Unicode sequences into actual characters.

What we obtain is not what we expected:

All Chinese characters have been replaced by ? placeholders.

Ok, so, if we look again at the signature of the function, we’re calling MessageBoxA. What does this A tacked at the end means? The Windows API has generally 2 version of functions that handle strings: the old ANSI version, ending with a A, and the Unicode version, ending with a W, for Wide-Characters.

Our first tip is: whenever you see a WinAPI declaration for a function ending in A, you should consider the W version instead. Avoid A WinAPI functions like the plague!

Get it right: MessageBoxW

So, now we know that we must use MessageBoxW for Unicode, the API documentation says so as well, so we must be on the right track.

The only change is the W in the function name and we’ll call the Unicode version. Let’s try to open that box again:

Dim s As String
s = GetUnicodeString()
MessageBoxW 0, s, "test", 0

And this displays:

Oh dear. Looks like Chinese, but it certainly isn’t what we expected. Here we’ve just got Mojibake, garbage due to incorrect processing of Unicode.

What could be wrong? We did pass a proper Unicode string, so what happened?

Well, VBA happened: whenever you pass a String as a parameter in a Declare statement, VBA will convert it to ANSI automatically.

Let me repeat this: when you see a Declare statement that uses As String parameters, VBA will try to convert the string to ANSI and will likely irremediably damage the content of your string.

I’ll show you what happens by going back to our memory map. This is the content of memory before we call MessageBoxW:

This is the memory of the same string after the call to MessageBoxW:

The string has been converted to ANSI, then converted back to Unicode to fit into 2-byte per character but in the process all high Unicode code points have been replaced by ? (whose code is 003F).

Note that the memory location of the string before and after the call are different. This is normal since the string was converted, so another one was allocated (sometimes they end-up occupying the same memory though it’s probably an optimisation of the interpreter).

So our string was actually returned modified to us; we can’t even rely on VBA to keep our original string intact…

Get it right, use pointers!

So, it’s clear now, whenever VBA sees that you are passing a String in a Declare statement, it will silently convert that String to ANSI and mangle any characters that doesn’t work in the current Code Page set for the system.

To avoid that unwanted conversion, instead of declaring the parameters as String, we will declare pointers instead so that our final declaration will be:

What we did here is simply define MessageBoxU as an alias for MessageBoxW which we know is the right API function. Then we replaced the declaration for ByVal lpText As String and ByVal lpCaption As String, by pointers values: ByVal lpText As LongPtr and ByVal lpCaption As LongPtr.

From the point of view of the MessageBoxW function, in both instances it will receive a pointer, except that instead of a String, itwill receive the value of a LongPtr, which is functionally identical.

Now, when using our new MessageBoxU function, we also need to pass the value for the pointers to the strings explicitly:

The StrPtr() function has been part of VBA for a while now but it is still not documented. It simply return the memory address of the first character of a string as a LongPtr (or Long versions of Office older than 2010).

There are other functions to get the address of a variable: VarPtr() returns the memory address of the given variable while ObjPtr() returns the memory address of an object instance. We’ll have to talk more about those in another post.

So, now, the result:

Hurray! We did it! No unexpected conversion! If you looked at the memory map, there would be nothing really interesting to see: the string would have stayed the same throughout the call and would be untouched: no conversion, no copy.

Not only did we manage to make call to API functions work, they are also faster because VBA doesn’t have the overhead of these silly Unicode-ANSI-Unicode conversions every time we use the API function.

Parting words

The point of all this is simply to remember the following when involving Strings in Win API calls:

Do not use the A version of API calls, always use the W version instead.

This is also relevant for some of the built-in VBA functions: use ChrW$() over Chr() and AscW() over Asc(), both W version handle Unicode characters and are about 30% faster in my tests (see KB145745 for more information).

VBA will mess-up and convert to ANSI all strings parameters in a Declare statement.

To avoid this mangling, use pointers instead of Strings in your declarations: never declare As String in an API Declare statement. Always use ByVal As LongPtr instead and then use StrPtr() in your calls to pass the pointer to the actual string data.

Similarly, when passing a User-Defined-Type structure that contains Strings to a Declare statement parameter, declare the parameter as ByVal As LongPtr and pass the UDT pointer with VarPtr().

Remember that the VBE IDE cannot process and display Unicode, so don’t expect your debug.print to display anything other than ? for high code point characters.