RTF is a terrible, horrible, no good, very bad file format

by Michael S. Kaplan, published on 2007/06/21 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/06/21/3431070.aspx

(Apologies to Judith Viorst!)

The other day Michael Michael asked me:

My team has been caught in a situation that is affecting our globalization plans. Basically, we have the following scenario:

We are using richtextbox’s in several places in our GUI (we are coding in C# using Whidbey and winforms in our GUI).

because of some of our formatting restrictions (need to bold, underline, etc our content), we need to use the RTF property of the richtextbox control to add the data

Now comes the problem. In English, all this works fine. However, in Chinese for example, we get a bunch of ???’s in our UI. We have identified the problem to be the character set and the font for each language. I am thinking, this is a problem that other teams at Microsoft must have faced and solved. Are you aware of any solutions?

Now I called this way easy, but it can get hard pretty quickly, especially when one considers that the RTF you get back when yout take this approach is really bad.

I mean horrible.

Like lock up your daughters, break out the food rations, move to Australia just to get away from it all terrible.... basically putting everything in byte-based data that you cannot read (and which bloats up very quickly if you try to o that editing work on it....

For Chinese even the old RichEdit 3.0 in RichEdit.dll on XP or Vista should handle both (CHS/CHT) writing systems okay without resorting to \uN control words. Such words are necessary for Unicode-only writing systens, but not for the principal East Asian writing systems. So if you manage to get the desired Chinese text into a RichEdit control, it should be able to generate valid RTF. But if you’re generating the RTF, then you need to worry about making it valid and that’s not so easy. Using the \uN approach is probably the easiest way. If you use \uc1, then follow each \uN entry with a ? so that you don’t have to figure out the corresponding codepage values. RichEdit uses WCTMB to convert Unicode to the CHS or CHT codepage for use in RTF and only resorts to \uN when the codepage doesn’t have the character...

...It’s true that the spec says that the N for \uN is a signed 16-bit number, but RichEdit and Word accept larger values. The parameter is treated as a LONG internally and I figured early on (back in the 1990’s) that higher-plane characters would be written using (slightly) more readable UTF-32 values rather than surrogate pairs. So RichEdit supports both signed 16 and 32-bit values. UTF-32 hex values would be best, but oh well! Word does handle things like \insrsid3623724, which clearly doesn’t have a signed 16-bit number. But I believe Word requires surrogate pairs for higher-plane characters.

And colleague developer Stephan (who I remembered from my Office daysand whose words I was as always glad to see) stepped in with some very good info about the format, whether UTF-16 code points were always required, and why the terminating question mark would be needed at the end of each code point:

RTF keyword numeric arguments are supposed to be 16-bit signed integers, yes. So, Unicode codepoints >32K should be written as negative numbers.

Note that there are exceptions to the signed short rule – I think (it’s been a few years since I wrote anything, or for that matter read anything, in the RTF spec) that the \bin keyword for example, takes 32-bit integers. In my own parser, rather than insist on only 16-bit signed values, I simply accepted 32-bit values everywhere. I can imagine this being a common approach, and so, who knows how close typical real-world readers are to the spec in this and other respects.

Another thing that may be relevant is how parsers handle the “single character fallback” \uc1 refers to. This is the ? Murray mentions. is it “the next single byte”? What if that byte is a \ -- is it that control word started by that \ ? What about a space (recall that RTF keywords may be optionally followed by a space, so it’s not obvious whether a space should be considered the fallback character instead of the optional terminator). What about { or } ? I think the spec is now clear on this (I remember getting such answers clarified in the Unicode RTF spec before too many people saw it), but that doesn’t mean all readers agree on this subtlety.

From just the most recent reply in this thread, I can imagine that the reader in question is interpreting \u22914\u26524 as Unicode codepoint 22914, followed by one ‘character’ of fallback; where the character is the whole RTF keyword and argument \u26524 (even though the character represented isn’t ASCII – the kind of thing you’d expect in the fallback representation). A little later, there’s \u20316\b0 which seems to bear out the theory: the fallback for \u20316 is the \b0 sequence – not a reasonable fallback character at all, but consumed by the reader nonetheless. Near the end is \u12290\par\par\par} in which the wrong \par is highlighted. It’s the first \par that is the inappropriately consumed fallback for \u12290, and the second and third one are faithfully preserved.

Because the reader is capable of handling the Unicode, it ignores the fallback(which is how it gets lost, and why the meaninglessness of \b0 “bold off” as a fallback isn’t a problem). Meanwhile, when the control returns the text, it emits its own, much more reasonable, fallback for each of \u22914, etc. – a simple ?.

stephan();

and Greg (test guru who I sometimes think has probably investigated more RTF bugs than most other testers combined!) reminded me of something he hd pointed out when the issue of thse \u escapes came up previously:

Speaking of bugs, just a couple of words of caution – be careful how much stuff you encode with \u####. Older versions of RichEdit and some other RTF parsers have some bugs with picking the right font for these from RTF stream. The RTF spec suggest that if you can use the \’ encoding to do so (or obviously putting ANSI latin as latin and not the u escapes). Also, I have seen several folks run try to encode everything with \u, including font names. Don’t do that. Some RTF readers can’t handle \u#### in font tables.

I still don't mind encoding the RTF with the \u escapes, though I try to keep ASCII and actual formatting metadata out of it when I can because of this....

So if you are handling almost every character as a decimal Unicode number and putting it in a question mark at the end of each character, then your RTF will be almost as compact as you can allow it to be (the RTF stiill won't be readable, but least least you can see the code points.

It is really terrible that RTF has all of these crufty requirements that are handles so much better by HTML and by XML with specific XSL as needed. That is what I am recommending more and more.

Because in my opinion, RTF is a terrible, horrible, no good, very bad file format, and I think it should be sent to Australia. :-)

This post brought to you by ᙀ (U+1640, a.k.a. CANADIAN SYLLABICS CARRIER ZU)

>It is really terrible that RTF has all of these crufty requirements that are handles so much better by HTML and by XML with specific XSL as needed.

The rich edit control also offers UTF-8 RTF which might provide a better solution for including Unicode in RTF. It is much easier to parse than those \u... expressions with complex fallback characters. There doesn't seem to be much support for this elsewhere, though.

As someone who used to own the RTF spec back in the late '80s to early '90s, I'd like to say that RTF is a perfectly good file format, so long as you don't expect it to use Unicode. RTF was defined in the early '80s, long before Unicode came along - why would you expect it to work well?