Archive for Language

Pankaj Narula dropped a friendly comment on my previous post on Hindi and Unicode, explaining that Hindi blogs are in fact almost universally encoded as Unicode, thanks in large part to Blogger.com’s good Unicode support. And so it seems that among Hindi bloggers at least, everyone is quite up-to-date with their language technology…

Just for fun I decided to poke around in the Tamil blogosphere to see if the situation was similar, and it turns out that Blogspot is equally prominent among Tamil bloggers:

Of the 613 blogs in Tamil listed at the directory at the Tamil Bloggers List, 513 are hosted on Blogspot–so we can assume that most Tamil blogs are encoded in Unicode.

After a bit of digging I could only find one blog among the non-blogspot crowd that seemed to have encoding troubles — “peyariliyin pinaaththalgaL.” At first I thought it was in some mysterious legacy encoding, but it turns out that blogdrive.com seems to have its servers set to send Windows 1252. This one, on the same server, specified more fonts and ended up being visible to me. So it was mainly a font thing.

(Incidentally, poking about in the occasional English comment in Tamil blogs, it seems to be the case that the language name is often transliterated as Tamizh rather than Tamil.)

Try typing the string jxauxdo in that box. And press “Trovu”, if you like, that will search Google for ĵaŭdo (Esperanto for “Thursday”). Notice that jx → ĵ and ux → ŭ “on the fly,” as you type. (Come to think of it, maybe “transliteration” isn’t the right word for this process…)

So, backing up a bit, Esperanto has a few odd characters in its orthography:

Even today those characters are relatively rare in fonts–if you can’t see them I imagine this post may not make too terribly much sense. 8^)

The good doktoro even got a little flak back in the day, for choosing to include such unusual characters in a supposedly universal language. Nowadays, however, they’re all in Unicode–here’s the full info for ŝ, for example:

U+015D LATIN SMALL LETTER S WITH CIRCUMFLEXŝ

But pragmatically speaking, there’s still a problem with input. Suppose you are a gold-star-wearing green-flag-waving Esperanto afficionado, and you want to post something on the internet. How do you actually type these characters? The “right” answer is that you install a keyboard layout for the language in question, and you memorize its layout.

This is a pain, of course.

And it’s nothing new: in the (typographical) bad old days of all-ASCII USENET, Unicode wasn’t widely available, and what people would generally do (for many languages, not just Esperanto) was come up with all-ASCII transliteration systems. The “x-system” added to the table above was probably the most popular. It so happens that there is no letter x in Esperanto, so it didn’t cause any massive problems with ambiguity.

And the function gets called with an onkeyup="xAlUtf8(this.value)" inside the input tag.

(Using onkeyup is actually sort of verboten these days–it should be done with unobtrusively, etc.)

So anyway, that’s a pretty interesting way to enter some unusual characters. It’s interesting to muse on just how far one could take this approach. Would it be possible to create a script that would handle an entire writing system? Say, a script that would convert an entire textarea from an ASCII-based transliteration to Unicode characters, on the fly? Japanese and Chinese are definitely excluded from this approach (every Chinese character in RAM? Er, no.) but people who use those languages generally already have keyboard input taken care of.

That would be neat: you could, for instance, have textareas where users without keyboard layouts could input something in Amharic or Persian or whatever without having the keyboard layout actually installed.

But as it stands, it’s just simple substitution, and no string which is to be substituted can be a substring of another such string. In order to handle a more generalized set of substitutions, you’d probably need to use a Trie structure. (nice trie implementation in Python by James Tauber. )

I’m sure there are complications that would arise from what’s called “font shaping” — that is, how operating systems combine adjacent characters. In Arabic or Thai, for instance, characters vary depending on which characters they’re adjacent to. How does this process affect text in textareas, for instance, or text which is mushed around with Javascript?

This is forward-thinking on the part of the Indian government; for a long time it seemed to be the case that the only major website that encoded Hindi in UTF-8 was a foreign site, BBCHindi. Most news sites in Hindi use any of a bewildering array of proprietary encodings, with a proprietary font to accompany it. (Intended presumably to lock in users).

… and replace length with jlength. You don’t have to change anything else in your source, which is rather nice. (In Python, for instance, you have to label Unicode strings.) You can just put Unicode stuff right in your source files, and pretty much think of those strings as “letters” in an intuitive way. Pretty much.

That’s the way I think of Unicode: it allows you to think of letters as letters , and not as “a letter of n bytes.” (Remember letters, o children of the computer age?)

Behind the scenes, your content will (probably) be stored in UTF-8, which is a variable-length, multi-byte encoding. This means that when you see the following:

U+062A ARABIC LETTER TEH ت

That single letter is actually two bytes behind the scenes (0xD8 and 0xAA).

However, when you see:

U+3041 HIRAGANA LETTER SMALL Aぁ

…there are three bytes behind the scenes (0xE3, 0x81, and 0x81). ASCII letters are still one byte, as ASCII is a subset of Unicode.

Unicode hides all that nonsense from you as a programmer. You can tell program to count the number of letters in an Arabic word or a Japanese word and it will tell you what you really want to know: how many letters are in those words, not how many bytes. Who cares how many bytes happen to be used to encode a U+062A ARABIC LETTER TEH? It’s just a letter!

So yeah. End rant.

But there are still some rough patches in Ruby’s Unicode support (or in my understanding of it; a correction of either would be appreciated).

As Mike mentioned, there are about a bazillion methods in String, so there’s a lot more testing that could be done. I guess one approach to problems like these would be to write jindex, jreverse, and so on. The approach I have in mind (converting strings to arrays) would probably be slow… these are the kind of functions that I imagine would best be implemented way down in C, where linguistics geeks like myself dare not tread.

Thanks to Chad Fowler for catching an error in an earlier version of this post… oops!

A few years back I taught ESL, and I had several Thai students. Thai names are quite complex; the post above goes into the details on the “official” Thai names, which are derived from Pali and generally given to people by Buddhist monks.

But that’s not the end of it. Most people also have a short, one- or two-syllable nickname as well. And if they’re of Chinese heritage they often also have a Chinese given name, and finally a family name.

I think I will start collecting names for myself in different languages, heh.