Sunday, 27 April 2008

What's new in Unicode 5.2 ?

As most of us are still trying to get to grips with Unicode 5.1, which was only released three weeks ago, it may seem a little premature to start talking about Unicode 5.2, but I'm blogging about it early this time because 5.2 promises to a very important release of Unicode, with 12,7996,648 new characters and a record 1615 new scripts, including the long awaited CJK Extension-C (4,149 characters) and major historical scripts such as Egyptian Hieroglyphs (1,071 characters) and Tangut (5,910 characters), as well as the famous woman's writing of southern China (Tangut and Nüshu were originally in Amd.6, but have since been removed for further study, and will not now be encoded until Unicode 6.0 at the earliest).

[This blog post has been updated several times since first published on 2008-04-27. The most recent update on 2009-08-10 reflects the final repertoires of ISO/IEC 10646:2003 Amdendments 5 and 6, which will be identical to the contents of Unicode 5.2 (Unicode 5.2 Code Charts).]

Unicode 5.2 will correspond to Amendments 5 and 6 of ISO/IEC 10646: 2003 (see Unicode Liaison Report for WG 2 meeting 52). Both these amendments have now completed their two rounds of technical balloting, and so no more changes will be made to their character repertoire. It is anticipated that Unicode 5.2 will be released at the end of September 2009 (which incidentally will be the first autumnal release of a new Unicode version since 3.0 in September 1999).

Glyph Changes

Amendment 5 will also introduce changes to the representative glyph shape used in the code charts for the following characters (the new glyphs are given in N3465) :

04A8 CYRILLIC CAPITAL LETTER ABKHASIAN HA

04A9 CYRILLIC SMALL LETTER ABKHASIAN HA

04BE CYRILLIC CAPITAL LETTER ABKHASIAN CHE WITH DESCENDER

04BF CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER

11EC HANGUL JONGSEONG IEUNG-KIYEOK

11ED HANGUL JONGSEONG IEUNG-SSANGKIYEOK

11EE HANGUL JONGSEONG SSANGIEUNG

11EF HANGUL JONGSEONG IEUNG-KHIEUKH

1680 OGHAM SPACE MARK

19D1 NEW TAI LUE DIGIT ONE

Amendment 6 (1,037 characters)

Amendment 6 has now completed its two rounds or technical balloting (PDAM and FPDAM ballots), and after it has completed its final FDAM ballot it will be published. No more technical changes can now be made to the character repertoire, and so the character names and code points in the Amd.6 Code Charts can be relied on.

New Scripts

Bamum @ A6A0..A6FF (88 characters) [originally in Amd.5, but removed for further study, and now added back to Amd.6]

Aegyptus (includes the 1,071 characters in the new Egyptian Hireroglyphs block [13000..1342F], as well as many as yet unencoded hieroglyphs and other characters in the Supplementary Private Use Area-A) [NB Under Windows 7 Egyptian hieroglyphs and all the other Unicode 5.2 characters in the Supplementary Multilingual Plane render as two .notdef glyphs in Notepad and most other Windows applications — this is due to a problem with the version of Uniscribe that ships with Windows 7, which supports Unicode 5.1 but is not forwardly compatible with Unicode 5.2]

HanaMin (includes the eight new characters in the main CJK Unified Ideographs block [9FC4..9FCB], all 4,149 characters in the CJK-C block, the three new characters in the CJK Compatibility Ideographs block [FA6B..FA6D], most of the characters in the Enclosed Ideographic Supplement block, and the four new characters in the Enclosed CJK Letters and Months block])

Why doesn't N3465 show the three CJK Unified Ideographs' additions from ARIB-B24? It only shows the three CJK Compatibility Ideographs' ARIB additions. What were the codepoints assigned to such three new ideographs? Source document N3318 doesn't mention exact codepoints (only U+XXXX instead).

Someone on the Unicore mailing list recently asked the same question about the five new HKSCS characters recently added to Amd.6, but which are not shown in the PDAM6.2 repertoire (N3546). Ken Whistler's reply was:

Font limitations. Generally, Michael [Everson] doesn't do CJK additions in Unibook, particularly under meeting deadlines. That is a known issue and has been the case for several of these repertoire documents from WG2 meetings.

Michel [Suignard] knows about this and separately tracks any CJK additions. These onesey-twosey CJK additions for the URO do get correctly into the ballot documents.

NB My blog post was out of date when it stated that there three CJK unified ideographs and three CJK compatibility ideographs are being added to Amd.5 -- in fact one of the proposed unified ideographs has been encoded as a compatibility ideograph, so there are actually two new unified ideographs (9FC4 and 9FC5) and four new compatibility ideographs (FA6B..FA6E). I have now corrected the post.

Remember that Unicode 5.2 won't be released until about October of this year (and it won't even be going beta until later this month), so you can't officially use any of the new Unicode 5.2 characters yet.

Note also that Unicode does not provide a font for the characters it encodes, and it may take months or years for vendors and font developers to provide support for the new scripts. Fonts with extensive Unicode coverage, such as Code2000/Code2001 and Everson Mono, will probably be updated to include some of the new Unicode 5.2 characters soon after its release, but some new scripts may remain fontless for several years if no-one is interested in creating a working Unicode font for a particular sscript.

I will append a list of fonts with Unicode 5.2 coverage to this post (as I did with my Uniocde 5.1 post) when any such fonts come to my attention.

I will be discussing Unicode 6.0 at the end of October (after the end of the next WG2 meeting in Tokyo).

I am also keen to see Tangut encoded as soon as possible, but because of technical disagreements on how Tangut should be encoded, it will not be in Unicode until version 6.1 at the earliest.

There are many people who want Unicode to be frozen and stable, with no more additions, but I do not think that will happen for a few more years yet. In particular, China is currently undertaking to get all minority and historical scripts that are and have been used in China encoded (including major historical scripts such as Tangut, as already mentioned). Whilst China is involved in this work and also in the work to encode more Han characters then I believe that Unicode will continue to grow. But when China decides that it has everything it needs, then I think that the end will have come.

* UnFonts include Hangul Jamo Ext-A and Ext-B* HanaZono font includes all of CJK-C, plus some of Enclosed Ideographic supplement* New Athena Unicode includes the Unicode 5.2 additions to Coptic...*Padauk from SIL includes Myanmar Ext-A*Tai Heritage Pro from SIL supports Tai Viet script.