Friday, 8 June 2007

What's new in Unicode 5.1 ?

Back in November 2005 I asked What's new in Unicode 5.0 ? in anticipation of its release in July of the following year. Now that Unicode 5.0 has been out for nearly a year I thought it would be good time to look ahead to what is in store for Unicode 5.1. Just to be clear, Unicode 5.1 won't be released until the spring or summer of 2008, but the character repertoire is already basically fixed, and there are unlikely to be any major changes (but if there are I will update this post). Well in the end there was one major change -- see addendum at bottom of the page [2007-10-19]. See bottom of post for a list of fonts with Unicode 5.1 coverage.

The additions to Unicode 5.1 will correspond to Amendments 3 and 4 of ISO/IEC 10646:2003. A total of 1,102 new characters are added in Amd.3, although four (U+097B, U+097C, U+097E and U+097F) are already in Unicode 5.0, and a total of 636526 new characters are expected to be added to Amd.4, so that Unicode 5.1 will have 1,7341,624 additional characters compared with Unicode 5.0, making a grand total of 100,823100,713 encoded characters (graphic, format and control characters) in Unicode, breaking the 100K mark for the first time (and for all those who are worried that 17 planes are just not enough, that still leaves room for another 873,707 873,817 characters).

The additions for 5.1 are not as controversial as those for 5.0, and maybe not be as exciting as 5.2 promises to be, but it will include twelve eleven new scripts [Lanna now postponed to Amd.5], which equals nearly equals 3.0 as being the largest number of scripts added in a single version of Unicode. From 5.1 Unicode will cover 76 75 scripts (including Braille which is classified as a script in Unicode), as shown in the table below. Regular readers of my blog will realise that there are still many more historic and less comon scripts waiting to be encoded.

*Numbers of characters do not necessarily represent the total number of encoded characters used for the script (and are not necessarily the same as the number of characters in the same-named block), but are the number of characters that are uniquely assigned to that script by Unicode (i.e. excluding characters that have the Unicode script property of "common" or "inherited"). Some differences in the figures for particular scripts (e.g. Katakana and Latin) reflect changes in script assignment in Unicode 5.1.

For me, the highlights of Unicode 5.1 are the encoding of the symbols on the enigmatic Phaistos Disc (first proposed for encoding ten years ago, but delayed because of some opposition to encoding undeciphered symbols found on a unique artefact), and the encoding of a wide range of letters used in medieval manuscripts and early printed books, so that finally texts such as The Calixtus Bull can be represented exactly as they are written. The script that has had the biggest makeover for 5.1 is Myanmar, with changes to the encoding model to finally make it useable, as well as additions to support minority languages such as Mon, S'gaw Karen, Western Pwo Karen, Eastern Pwo Karen, Geba Karen, Kayah, Shan and Rumai Palaung (see Andrew Cunningham's The Myanmar script and Unicode for a useful overview of support for the Myanmar script) And then there are a handful of Tibetan (U+0FCE, U+0FD2..U+0FD4), Mongolian (U+18AA) and CJK (U+9FC3) characters that I am responsible for, which I am of course pleased to see make it into the standard.

Amendment 3

Amendment 3 is now at the FDAM stage of the ISO ballot process, and its repertoire is fixed, so the code points given below can be relied on. The ISO 15924 code for new scripts is given in square brackets, and the number of new characters is given in curly braces.

Amendment 4

Amendment 4 is now at the FPDAM stage of the ISO ballot process, and its repertoire is unlikely to change significantly, but there may be changes, and the code point allocations could possibly change. The ISO 15924 code for new scripts is given in square brackets, and the number of new characters is given in curly braces.

What's Not in Unicode 5.1

Egyptian Hieroglyphs (an initial set of 1,063 characters corresponding to Gardiner's Sign List) are not in 5.1, but are in Amd.5 which is currently undergoing its first ballot, and should correspond to Unicode 5.2 (there will probably be several minor versions before Unicode 6.0 is published). Other scripts that are in Amd.5 are Meitei Mayek, Bamum (removed for further study), Tai Viet and Avestan. Amd.5 also includes two new blocks for a set of controversial Old Hangul Jamo.

Not yet ready for inclusion in Unicode 5.2 is Tangut. A first proposal has now been submitted to the UTC, but has not yet reached WG2. Because of the complexity of the Tangut repertoire and probable issues about "ownership" of the script, it may take some time to reach an agreement on encoding Tangut, and so may not be in Unicode for a few more versions yet. [Well, I was wrong about that—it has made it into Amd.6 which means that it is scheduled for inclusion in Unicode 5.2]

However, the big and unexpected hole in 5.1 (Amd.4) is CJK-C, which is the first installment of the tens of thousands of additional Han characters submitted for encoding by members of the Ideographic characters Rapporteur Group (IRG). This set of 4,219 CJKV ideographs was included in PDAM4, but was moved from Amd.4 to Amd.5 at the last WG2 meeting (in Frankfurt at the end of April). I will look at CJK-C in more detail in my next post.

Addendum [2007-10-19]

At the WG2 meeting in Hangzhou last month (which I had hoped to attend if it was in Ürümqi as originally planned) two important changes to the Amd.4 repertoire were made.

Firstly, 17 additional Myanmar characters (including 10 Shan digits) were added in order to complete the extensions to the Myanmar script required to support the Shan language.

Secondly, the agreement on encoding the Lanna script achieved at the Frankfurt WG2 meeting in the Spring fell apart, with China demanding significant changes to the proposal. The end result was that Lanna was removed from Amd.4, and put back to Amd.5 (this will mean that it will miss the train for Unicode 5.1 next year). In addition, the script name is to be changed to TAI THAM due to objections to the name "Lanna" by China. (There have been a lot of disputes over script names recently, with user communities objecting to traditional English script names such as Pollard and Fraser.)

So now the repertoire of Amds. 3 and 4 have been finalised, and consequently the contents of Unicode 5.1 are now fixed, and will be going beta in the Spring. However, I think that Amd.5 is going to be the interesting one, as it includes both CJK-C and Egyptian hieroglyphs (but with Bamum removed by request of the user community, and Meitei Mayek removed due to fierce differences of opinion on danda disunification within WG2).

Unicode 5.1 Fonts [2008-04-28]

Now that Unicode 5.1 has been released (April 2008) a lot of people want to be able to make use of all the new scripts and characters, but obviously can't if they don't have any fonts that support the new Unicode 5.1 characters. So here is a list of some freeware and shareware fonts that do have Unicode 5.1 coverage (Unicode 5.1 coverage in brackets):

Yeah, I agree it could well be a board game, but in my opinion the symbols on it need encoding in order that people are able to discuss it -- whether in futile attempts to decipher it or in order to describe the rules of the game. If it does turn out to be a game board then encoding the signs on it is little different from encoding Mahjong and Domino tiles. In any case Unicode will not define Phaistos Disc symbols as a script -- the characters are just there for people to use for whatever reason they want. So I don't think it makes the standard look stoopid.

Well, whatever; I guess once we go down the road of encoding game symbols, we might as well do the Disk too.

I have this board game called "Ur" somewhere -- the rules and pieces are modern inventions, but the board game is authentically Sumerian. Googling for "Ur" "board game" will find a sufficiency of information and pictures.