So, there were two major groups of comments on the last post, and I'll try to address them.
The first was a question about managed support
for ETW. I talked to the ETW team, and the current state is that
there is no official managed interface...

A lot of work in performance tuning is
organizational. There's only so much work one can do with a
profiler and a single module. A good example is the Registry --
we can attach profilers to the Registry access routines and optimize
them until they...

Hello! I haven't updated this blog in a
while; work and other events have conspired to keep me from
writing. Also, blogs.msdn.com moved internally from .Text to
Telligent Community Server, and my CSS markup was an unfortunate
casualty of the move...

Going from a legacy format to Unicode is fairly simple; in addition to combining characters, Unicode also provides an array of compatibility characters. Compatibility characters are canonically equivalent to a sequence of one or more other Unicode characters; they are usually placed so that you have a single codepoint that's equivalent to a character in some older standard. For example, ISO8859-2 defines 0x5A to be equivalent to a capital letter L with a caron accent (Ľ). The "simple" equival...

Imagine that you've allocated a byte array for recv()ing something in from a TCP socket. If we know that said content is UCS-4, the natural urge is to cast it to an unsigned long * to iterate over... except that you can't. Or, at least, you shouldn't. If that byte array isn't suitably aligned for 32-bit accesses, code will either run slowly (on x86 and AMD64) or crash (on IA-64, unless SetErrorMode() is called to force OS alignment fixups, in which case it will run extremely slowly). Of cour...

How do you define operator[] for a string that's in a variable-width encoding such as UTF-8? One of the basic assumptions in std::string that I intend to honor is that operator[] returns a reference to the actual data, not a copy. For fixed-width encodings such as ASCII, UCS2, or UCS4, this is not a problem; I simply return a unsigned char/short/long. However, for variable-width encodings, I need to return a range of bytes, and presumably a size as well. I could do this with covariant return...

However, regardless of whether pre-composed characters are favored or not, there are some character sequences which do not have pre-composed equivalents and must be represented using combining characters. Of course, our problem here is that most programmers don't think about accents as being distinct elements to iterate through! When you hit the right arrow in Microsoft Word to skip over an À, you don't go first to an A and then to the A's accent -- you move past the whole "character." (Unico...

In our last episode, we established that we wouldn't be able to make a true std::string replacement and still handle variable-width encodings. So, we started with the beginning lines of an rmstring class. However, this doesn't mean we are going to dispense with std::string entirely! And, as it turns out, compatibility with it is both easier and harder than actually making a std::string, depending on what you're implementing and where......

Yesterday, we took the definition of string as an ordered sequence of Unicode code points, and explored various schemes for encoding and decoding code point indices on a binary computer. At the end, we had a new definition for string -- a stream of bits, and some type of information identifying the encoding scheme used to interpret the bits as a stream of Unicode codepoints. Today, since I'm a coder, we'll be starting a C++ implementation of a string library based on this definition....

At the end of the last post, we reduced the abstract concept of "string" down to an "ordered sequence of Unicode code points." (We did so by choosing to actively ignore glyph information, but we'll be coming back to it later.) Unicode code points are simply numbers; of course, numbers have to be reduced to binary to be stored in a computer. Someone who is reading a string needs to use the exact same encoding scheme. And not all encoding schemes are equal......

What is a string?
About six months ago at the Game Developers Conference in San Jose, I sat in on a talk about performance tuning in Xbox games. The presenter had a slide that read: "Programmers love strings. Love hurts." This was shown while he described a game which was using a string identifier for every object in the game world and hashing on them, and was incurring a huge performance hit from thousands of strcmp()s each frame. I nodded -- but my mind was thinking......