Posts from Monday, March 3rd, 2003

After more than a year of sitting bolt upright in a chair whose back was about 20 degrees from horizontal, Kat finally got me to buy a new chair on Saturday. I assembled it this morning, which anyone who knows me will tell you is astonishing on two counts:

I put it together less than a month after I bought it. Usually I let a project like that sit for a while, to let it come to the proper sense of fullness. Or else because I’m lazy.

I put it together, period. I’m not what you would call handy with a toolbox.

I did put the armrests on backwards, but I did that on purpose. They look cooler this way.

Font and text handling seem to occupy more and more of my attention of late. Here’s another good example of the problems we face: character encoding. This morning I dropped by the O’Reilly Network and spotted some badly mangled text. Apparently that’s supposed to be a “ü” in there, since that’s what the referenced article shows. How did this happen? No doubt somebody copy-and-pasted the text from a word processor into a CMS interface, and it looked fine on their machine when they previewed the text. Unfortunately, in my Web browser, no such luck. (This was in IE5.1.4/MacOS9.1, but a quick check in a recent Mozilla build showed the same problem.) It may have gone through some XSLT for extra munging, for all I know.

I have a little experience with the encoding problems that can arise when you’re working with XML and XSLT. If you want to use HTML-style character entities, you have to write a stylesheet that defines every last entity you might use, which is kind of weighty, although I do it for this journal’s XML files. For the new DevEdge, we wrote a separate namespaced transform based on the old entities. In our world, a “u” with an umlaut is <ent:uuml/>; an “A” with a ring is <ent:Aring/>. Of course we also have documents that are encoded for localization (e.g., DevEdge Japan) by their authors, and nobody else can touch them for fear that we’ll break the encoding. For that matter, when we had an inline JavaScript alert for our printer-friendly links, the spaces in the value were encoded as %20. Every browser showed those as spaces in the link, except Opera, which showed the raw text (“This%20page%20is%20already…”). Is it right to do this? Is it wrong? I don’t know. Do I care? Not really.

In a like vein, I recently found out why recent e-mail message from a certain well-known CSS luminary look like an encoded binary to me, while his responses to other authors’ messages on listservs look just fine: he’s sending out 8-bit text in ISO-8859-1, and something between his fingers and my eyes is munging the text into 7-bit ASCII. If he sends a message as 7-bit text, there are no problems. I’m not sure if it’s my aging mail client or a server along the message’s path from him to me. Again, I don’t care. I shouldn’t have to care.

It seems that the more powerful our tools become, the more ways we have to break the flow of information. This to me is exactly opposite of what should be happening. It’s not that hard to implement character encoding, and it’s not that hard to agree on a character format. We (as an industry) just haven’t done it to the necessary extent, and there’s really no excuse for this fact. A character should be a character. If Unicode is the answer, then great, let’s do it.

As is common for my little technology rants, I don’t have a solution, only questions. My biggest question is, “How long until we fix this basic problem?” I don’t even care about how, really. Just when.

Today is a triple-three, for those of you who care and use two-digit date formatting: 03/03/03. I wonder if any lotteries will have that number come up tonight. I still remember when the American Embassy hostages were released by Iran after 444 days in captivity, and that night one state lottery’s Pick 3 came up 444. Those kinds of coincidences are always fascinating to me.