Introduction

For people with PUA1 character needs, a major concern is whether the software products that they use will make particular assumptions about the semantics of PUA characters that affect how their text is processed. If a software product assumes different semantics for a code point than the user is assuming, the user may experience unexpected text-processing behaviours, such as lines breaking at undesirable points within the text. At worst, though, irreparable data loss might also occur.

Between them, major software vendors have made use of the entire PUA area in the Basic Multilingual Plane (BMP) of Unicode (U+E000..U+F8FF) for one purpose or another. That does not mean that none of these code points are usable in popular software products, though: the fact that vendors have made use of these code points does not necessarily imply that software products made by these vendors will assume certain semantics. It does raise concern for users of those software products, however, and point to the need for testing.

Within SIL International, many of our users are working with Microsoft products, Windows and Word in particular. Many of our users have PUA character needs, and Microsoft has made use of portions of the BMP PUA range.2 For this reason, I decided I should do some tests to see how these PUA code points are handled in Microsoft Products.

I performed tests on Windows XP using Notepad, WordPad and Word 2002. Notepad is of interest because it does little on its own to affect rendering, and so gives an indication of what Windows GDI is doing when PUA characters are displayed. WordPad is important to consider because it is based on the RichEdit control. Therefore, whatever behaviours we encounter in WordPad are likely to occur also in other applications built from the same version of the Rich Edit control. Of course, Word is important to consider becuase it is in widespread use.

In these tests, I was interested in the entire BMP PUA range, but particularly in these portions of that range:

I attempted to enter characters from these ranges and format them using a font that I knew did not support these character ranges. For the font, I used one that we have been developing, Doulos SIL. I then made a modified version of the font, Doulos SIL - GDI PUA test, in which I copied various glyphs and assigned these new copies to code points in the PUA. I then repeated the tests in each application using this new font.

Notepad

I began the experiment by opening Notepad, setting the font to Doulos SIL, and then entering a selection of PUA characters from then ranges mentioned above. The following image shows the result:

Notepad displaying PUA characters not supported in the selected font

There are various things worth noting here. First of all, there are a number of code points for which Windows is using alternate fonts. This does not happen for all of the tested code points, however.

Windows uses code points in the range U+E801..U+E805 for Hebrew presentation forms, and we see that these code points are being interpreted as Hebrew and displayed using a font that supports Hebrew (these glyphs appear to come from Tahoma). In contrast, while Windows uses the range U+E816..U+E836 for Arabic presentation forms, code points in this range are not being displayed as Arabic. I was surprised, though, to see these code points being displayed as Asian “EUDC” characters3

The range U+F001..U+F031 is used in Microsoft core fonts for certain presentation forms, and also for certain glyphs for drawing text insertion-point icons. We see that some of these code points are being displayed in this way, though again, not all. I cannot explain the superscript digits being displayed for U+F006..U+F009 — I’m not sure what font these are coming from, and thus far I haven’t even discovered a font that encodes such glyphs at these PUA code points.

It is worth noting that this range overlaps with the range used by symbol fonts, U+F020..U+F0FF,4 and that the code points from the symbol range that were tested were not affected.

Finally, the range U+F700..U+F71D has been used in Windows for Thai presentation forms, and most of the code points in this range that I tested were displayed using Thai glyphs.

In considering these results, it is important to remember that the selected font did not have glyphs for any of these code points: Windows assumed certain semantics, and other fonts that supported those semantics.

Having tested in Notepad using a font that does not support the PUA code points being tested, and I then changed it to use the other font, which does. The results are shown in the next image.

Notepad displaying PUA characters supported in the selected font

This time, all of the glyphs displayed were from the selected font — except for one: the character U+F00F was still displayed using some other font. The improvement from the previous font is an encouraging result, but the one code point that failed to display as desired leaves some cause for concern: how many of the code points that weren’t tested will have the same problem?

WordPad

Next, I tested using WordPad. I opened WordPad with a new document, set the font to Doulos SIL, and then copied the text from Notepad and pasted it into the WordPad document.5 I made a point of setting the font before I pasted in the text since WordPad is based on the Rich Edit control, and Rich Edit likes to take some measure of control when applying formatting to existing text: if it thinks that characters within the selected text are not supported by the font being applied, it may not allow those characters to be formatted with that font. It is less controlling in this regard when text is first entered, however.

The results after pasting the text into WordPad are shown in the next figure. Again, keep in mind that none of the code points are supported by the selected font.

WordPad displaying PUA characters not supported in the selected font

Again, there are several things to be noted here. First of all, for the code points that were tested between U+E000 and U+E83A, the results match those in Notepad using this same font. There is something new to be noticed, though: note the text that was selected in the upper right hand corner. The entire run is formatted with the Doulos SIL font, but the rightmost glyph is being displayed from another font. We are now dealing with rich text, and we have a situation in which the formatting attributes within the document do not match what is displayed.

There is something else peculiar with that selection: the advance width for that third glyph is too wide. The image makes it look as though there is a space after the character and that the space is also selected, but there is no space; there is only one character, U+E803. Notice that the width of the selection is appropriate for the Asian glyph on the line below, however. Evidently, the software is working here with two different assumptions at the same time: for purposes of glyph selection, it assumes the character is a Hebrew presentation form, and it finds that glyph from some font that supports Hebrew, but for purposes that involve metrics, it is assuming the character is a Chinese character and is basing the advance width on some other Asian font.

Turning to consider the rest of the code points being tested, those from U+F001 to U+F71D, the results are

quite) different from what they were with Notepad: none of the same glyphs appear. Indeed, the only glyph that appear are some Wingdings. The Wingdings are all for code points in the range used by symbol fonts, U+F020..U+F0FF.

Whereas earlier we saw that the text was formatted with one font but another was actually used for display, in the case of the Wingdings, that is actually the font that WordPad applied to those characters. Interestingly, when I subsequently tried to apply the Doulos SIL font to these characters, WordPad accepted the change, and then actually transformed the code points in the text, changing U+F021 to U+0021, U+F022 to U+0022, etc.

Symbol characters in WordPad are changed after reformatting

One of the mysteries of text formatted with symbol fonts (at least, in certain Microsoft applications) is that characters appear to be encoded in terms of 8-bit code points even if a document is otherwise encoded in Unicode. When U+F021 was inserted into WordPad from the clipboard, not only did WordPad (more precisely, the Rich Edit control) apply the Wingdings font, it seems that it also changed the code point to 0x21. When the character was reformatted to a non-symbol font, this became U+0021.6

This points out the need for users to be warned, therefore: when working with PUA characters in the symbol range, U+F020..U+F0FF, there is potential for data to be unexpectedly changed and for unrecoverable data loss when working in WordPad or other applications based on the Rich Edit control.

This is not yet the end of the story for how WordPad handles code points U+F001 to U+F71D using this font. I discovered that I could copy some of the characters for which no glyphs were displayed and paste them elsewhere into the document, and suddenly glyphs would appear — the same glyph from another font that appeared with Notepad.

Copies of PUA characters displayed with a different font

In this screen shot, the last five characters7 are U+F00A, U+F00B, U+F00C, U+F700 and U+F70F. One instance of each in the document is invisible while these instances are not, even though all of the characters are formatted with the Doulos SIL font.

There is one more oddity at work. Notice in the previous screen shot the fourth character on the line was selected. If I extended the selection to the right by one character, so that the selection rectangle bordered on one of the characters in the range of U+F001 and above, then the remaining characters on the line (those in that range) disappeared again:

Selection rectangle causes U+F001..U+F71D to display with a different font

Note that the content of the document had not changed between these two screen shots. All that had changed was the amount of text that was selected.

I’ve explained in detail what happened when I tested PUA code points in WordPad using the first font, which doesn’t support those code points. I then repeated this experiment using the other font. Again, I started with a blank WordPad document, set the font, and then pasted in the text from Notepad. The result is seen in the following image:

WordPad displaying PUA characters supported in the selected font

As was the case with Notepad, more code points were displayed using the selected font when it was a font that supported those code points. Code points in the range U+E801..U+E83A no longer display with Hebrew and Chinese glyphs. On the other hand, code points in the range U+F020..U+F0FF are still being formatted with the Wingdings font and converted to 8-bit values (0x21, 0x22, etc.)

To check whether the issue with the symbol range applies only when data is pasted from the clipboard, I tried entering characters directly into WordPad: I set the font to Doulos SIL - GDI PUA test, and then entered some characters (U+F024, U+F0A1, U+F0FF). The results were the same.

Therefore, the warning mentioned above applies at all times in WordPad or other applications based on the Rich Edit control, regardless of the font that was used when the characters were entered or how the characters are entered: if you use these applications and have PUA characters in the range U+F020..U+F0FF, there is potential for data to be unexpectedly changed and for unrecoverable data loss.

Word

After testing with WordPad, I performed the same tests in Word 2002: I set the font, then pasted the text from Notepad. The following image shows the result using the first font, Doulos SIL. Again, this font does not have glyphs for any of these code points.

Word displaying PUA characters not supported in the selected font

By now, the results are not at all surprising. Word formatted all of the tested PUA characters using a different font than the one I had selected:

Font

Characters

SimSun

U+E000..U+E004, U+E816..U+E83A

Times New Roman

U+E801..U+E805, U+F001..U+F009, U+F00F..U+F02F

Tahoma

U+F00A..U+F00E, U+F700..U+F71D

Fonts applied to tested PUA characters by Word

When I tried reformatting the text using Doulos SIL, Word did not allow the font to be applied to code points that appeared as Hebrew and Thai presentation forms, but it did allow the change to apply to all the other code points. Code points from U+E816 to U+E83A continued to display using glyphs from SimSun, however: in the font dialog, Doulos SIL was specified in the control for Latin text font, and SimSun was specified in the control for Asian text font.

Finally, I tested Word using the second font, Doulos SIL - GDI PUA test. The next image shows the results:

Word displaying PUA characters supported in the selected font

As before, Word used some other fonts than the one I had chosen, but this time it change the font for certain code points only:

Font

Characters

SimSun

U+E000..U+E004, U+E816..U+E83A

Times New Roman

U+E801..U+E805, U+F00F..U+F02F

Tahoma

none

Doulos SIL - GDI PUA test

U+F001..U+F00E, U+F700..U+F71D

Fonts applied to tested PUA characters by Word

This time, when I attempted to reformat the text using Doulos SIL - GDI PUA test, Word applied that font to all of the PUA code points, and each was displayed using glyphs from that font.

One more point is worth noting: at no point did I encounter problems with advance widths in Word 2002 as I had in WordPad.8

Other Microsoft applications

I took a very quick look at a few other Microsoft applications to see how they handled PUA characters.

Excel 2002, FrontPage 2002 and Publisher 2002 were very well-behaved with regard to PUA characters — better than Word 2002, in fact. As expected, when I first tested using Doulos SIL, which doesn’t support the tested PUA characters, I saw other glyphs. For Excel and Publisher, the visual results were comparable to Word using that font. FrontPage was slightly different: I saw the Hebrew presentation forms and Chinese glyphs, but the other tested code points were invisible. With each of these applications, I was able to reformat the text to the other font, Doulos SIL - GDI PUA test, and have all of my PUA characters appear. Also, when I pasted text in having first selected that font, it came in the way I wanted it, displaying my PUA characters. At no time was there any problem with advance widths in any of these applications, with either font.

Access 2002 fared reasonably well, but wasn’t perfect. I imported the data into a table, and then generated a query and formatted the query with each of my two fonts in turn. In terms of display, Access fared as well as Excel and the other applications just described. What was a problem, however, was the data import process: for some reason, the code points U+F701..U+F703 were removed on import. All other characters imported without incident, however, and when I pasted in those three characters, they displayed as desired. I have not done any further testing on data import to see what may have gone wrong, or whether this problem might extend to other code points.

Powerpoint 2002 was a very different story. When using the first font, it gave results similar to Notepad. What is a serious concern is that it also gives the same results using the other font that does support the tested PUA characters. It choose what font to use for formatting those characters, and it would not allow me to reformat the characters using my font. In a nutshell, it appears that PUA characters cannot be used in Powerpoint 2002 unless you are using Microsoft’s PUA definitions.

Summary

To summarise, I discovered that each application made assumptions about the semantics of PUA code points, though this was not always insurmountable. Users should be able to work with PUA characters in these applications, though there are some exceptions and special considerations they need to be aware of:

If you try to display PUA characters using a font that doesn’t support those characters, many of the PUA characters used by Microsoft are likely to be displayed in Microsoft software using glyphs from other fonts.

It appears that U+F00F cannot be used reliably in Notepad (at least, the version that comes with Windows XP).

The symbol range, U+F020..U+F0FF, should be avoided when working with WordPad or other applications based on the Rich Edit control.

When PUA characters are entered into Word 2002, it may at first change the font, and you may need to reformat the text with the font you use for your PUA characters.

Access 2002, Excel 2002, FrontPage 2002 and Publisher 2002 all display PUA characters without problems using an appropriate font. There may be problems with import of PUA characters into Access 2002, however. This needs further investigation.

PowerPoint 2002 enforces its own semantics for PUA characters, and so is useless for users that have defined their own PUA characters.

It appears that the 8-bit-to-Unicode conversion is done using the default system codepage. I tested pasting U+F08A into WordPad, and it became the Wingding symbol 0x8A, but when I reformatted the text using a non-symbol font, this became U+0160. The mysteries of symbol fonts in Windows are many, though: before reformatting the character, I copied it to the clipboard. Several formats ended up on the clipboard, including CF_TEXT and CF_UNICODETEXT. (I don’t know for certain whether Rich Edit generated both of these formats or if it supplied one and the other was generated by User32.dll.) On the clipboard, the CF_UNICODETEXT contained U+008A (a control character), and the CF_TEXT contained “?”.