Everything to do with phonetics. Please note: comments not signed with your genuine name may be removed.

Tuesday, 20 April 2010

inputting accented letters

As I have mentioned before (blog, 26 Aug 2008 and 25 Jan 2010), the Guardian newspaper is now able to print accented (= diacritic-bearing) Latin letters not only for west European languages but also for those from eastern Europe. My picture shows part of a page from the 15 April edition of the paper. In the course of an obituary for Anna Walentynowicz, “godmother of the Solidarity trade union” they managed Gdańsk and Wybrzeża correctly (though they didn’t attempt Solidarność). But look what happened to Wałęsa! The barred l was fine, but the ogonek under ę turned into a cedilla instead, giving ȩ. I don’t know whether the Guardian journalist tried to compose the letter from an e plus a diacritic, and chose the wrong diacritic, or whether s/he just picked the wrong precomposed letter from a dropdown list. Either way, it provokes the question: what language, if any, has U+0229 LATIN SMALL LETTER E WITH CEDILLA in its orthography? If none, who uses this symbol, and for what? Presumably it must be or have been in use, because otherwise Unicode would not have admitted it. (It’s in the Unicode block Latin Extended-B, in a section headed “Miscellaneous additions”.)Can anyone enlighten us?Stranger still, at U+1E1D (Latin Extended Additional) Unicode also recognizes LATIN SMALL LETTER E WITH CEDILLA AND BREVE, i.e. ḝ. You can check it out on the on-line Unicode site here. The on-line version of the obituary has ę correctly.

Commenting on yesterday’s blog, with reference to the character ö, one contributor apologized “Sorry, I don’t have easy access to diacritics”. I was wondering just what input device he was using that can’t input an o-umlaut. Even a mobile phone can do that. On any Windows computer you can key in • Alt+0246 (on the numerical keypad) • or on some computers Alt+148I am sure you can do something similar on a Mac (option-apostrophe?). If using Word you can do • Ctrl+(Shift+):, o • or 00F6, Alt-XFor HTML, including blog comments, you can also type • &#246; • or &#xF6; • or &ouml;.Perhaps these don’t count as “easy access”. Or perhaps people just don’t know about them. If anything, we have a confusingly large range of possibilities.

The Unicode NamesList file says that E WITH CEDILLA is used by Uralicists, which means it's part of the Uralicist (aka Finno-Ugric) Phonetic Alphabet. This file is the repository of such usage information as Unicode has. Unfortunately, I can't find an online explanation of (F)UPA in English.

As for the Latin Extended Additional block, it's the result of an old compromise. Originally, Unicode and ISO 10646 were to be separate standards. After a lot of work, Unicode 1.0 and the Draft International Standard (DIS) were merged to produce Unicode 1.1, which was also the initial version of ISO 10146 and included all the characters present in either standard. This block contains the Latin letters that the DIS (which was at that stage basically a list of names without explanations) contained but Unicode 1.0 did not, and for which the provenance is basically unknown. In a few cases, NamesList has explanations added after the fact, but 1E1D is not one of those cases.

I'm a Pole and today's entry prompted me to do some basic research (that is, Wikipedia) on the ogonek (which, as English wiki aptly informs us, means "a little tail"). Interestingly, Polish wiki says that the letter ę was used also in Latin spelling from the 12th century onwards. In Polish, the letter ę corresponds to the nasal vowel /ɛ̃/. Would anybody be so kind as to answer whether this was also the case in Medieval Latin?

The way I've heard it, the Latin 'ae' that turned into a monophthong and later merged with 'e' was written with the 'e caudata', which looks like ę, in Mediaeval Latin. In fact, the standard explanation seems to be that Polish ogonek came from the 'e caudata'. I hadn't heard about the 'e cedilla' being used for this purpose.

One other thing: Unicode doesn't encode all letter-accent combinations. The theory is that all such combinations can be encoded with the base letter (like U+0065 'latin small letter e') and the combining accent (like U+0328 'combining ogonek'). In practice, of course, most such combinations are encoded in Unicode as a single character (like U+0119 'latin small letter e with ogonek'), because of compatibility issues with existing code standards which generally encoded accented letters as single characters. The vast majority of Unicode-compliant text encodes accented letters as single characters where available.

Therefore, if an accented Latin letter is encoded in Unicode, it means that there is some language or transcription system out there that uses it, but the converse is not true. There are accented letters in use in some orthographies which can only be encoded as the base letter plus combining diacritic in Unicode and not as a single character, because they were obscure enough not to be included in pre-Unicode code standards.

And the umlaut has always been Alt+u, V on my PCs too. I was used to Macs from the time they were invented, and it was grievous to have to accept in due course that transferring to the death-dealing PC was inevitable. Ctrl+(Shift+):, V was the least of my annoyances, but I immediately set up the key combination Alt+u, V to mimic the Mac. In fact I considered most of the key combs native to Word to be rubbish. It's always worth setting up your own less fussy or more intuitive ones (until you have to use someone else's computer).

I'm using Linux. My default keyboard layout has ö as AltGr+[, followed by a lower-case "o". This makes äëïöüẅÿ trivially easy. Better still, AltGr+;'# followed by vowels offers acute, circumflex, and grave accents respectively. And a cedilla is on AltGr+=. There are other accents on other keys (for instance AltGr+@ gives a haček), but I don't tend to use them much.

Therefore, if an accented Latin letter is encoded in Unicode, it means that there is some language or transcription system out there that uses it

Unfortunately not. In earlier versions of Unicode, Turkish s-cedilla and Romanian s-comma and t-comma were unified as s-cedilla and t-cedilla (a state of affairs inherited from ISO 8859-2). When this overunification was undone by adding proper s-comma and t-comma characters, t-cedilla became useless, as no language or transcription system uses such a thing. WP says that it was once proposed for use in learned French words ending in -tion, -tial where orthographic t is pronounced /s/, but never adopted.

I'm currently studying Navajo, and the diacritics in use for this language are causing some fun problems with various applications. Navajo has tones, with low unmarked and high marked with the acute accent. Navajo also distinguishes between oral and nasal vowels, with oral unmarked and nasal marked with the ogonek. The combinations Vowel + Acute + Ogonek do not seem to exist anywhere as distinct code points, and the combining diacritics don't always work properly -- for instance, MS Office doesn't recognize U+0328 'combining ogonek', inserting a '2' instead; meanwhile, using vowel + ogonek + combining acute does funny things with the letter 'i' due to the tittle (dot). <... sigh ...>

Doh! Found my own solution. After some more poking around, I was pointed to a font issue -- Arial Unicode MS seems to have the proper glyphs for 'combining ogonek', among others, allowing me to use i + acute + combining ogonek to get around the tittle problem. Hope this helps someone else.

The note in Unicode NamesList.txt file about Uralicist usage seems to apply to the letter "A WITH DOT ABOVE", not to "E WITH CEDILLA".Letter e has two pronunciations in Latvian language: "narrow" [ɛ] and "wide" [æ]. E with cedilla ȩ, while not a part of the standard Latvian orthography, is sometimes used in dictionaries and learning books to indicate the "wide" pronunciation. Long [æ:] sound is written as e with cedilla and macron: ȩ̄. This letter doesn't seem to be available precomposed in Unicode. (The accent in ȩ is a real cedilla, not a comma as in Latvian palatalized consonants ķ, ļ, ņ, ŗ.)The letter ȩ is also used in the Latvian phonetic alphabet, an extended version of standard Latvian alphabet used by local linguists. There it can take some other accents to indicate syllable tone: ȩ̃, ȩ̂, ȩ̀.In Unicode, ȩ is followed by some letters for Livonian, a near-extinct Finno-Ugric language spoken in Latvia. I don't know whether this sequence is accidental or not.The letter ḝ is a mystery to me.