Saturday, 26 May 2012

What's new in Unicode 6.2 ?

The answer to the question "What's new in Unicode 6.2 ?" is rather short :

U+20BA

TURKISH LIRA SIGN

Yep, that's it, just a single new character. The Unicode Technical Committee (UTC) decided earlier this month to fast track the encoding of the recently announced currency symbol, as it had previously done with the newly invented Indian Rupee Sign ₹ (U+20B9, added to Unicode 6.0 in 2010) and the Euro Sign € (U+20AC, one of only two characters added to Unicode 2.1 in 1998 [kudos to anyone who knows what the other character was, and a special prize to anyone who has ever had cause to use it]). However, whereas the Indian Rupee Sign was fast tracked into an already scheduled release, the Turkish Lira Sign has the dubious honour of being the first ever character to be given an entirely new version of Unicode all to itself, Unicode 6.2, which will probably be released in late September or early October 2012. This also means that 2012 will be the first ever year during which more than one major or minor version of Unicode has been released.

Unicode releases are normally coordinated with publications of new editions or amendments to the corresponding international standard, ISO/IEC 10646 (see Unicode and ISO/IEC 10646 for details of the relationship between these two standards), but the next amendment to ISO/IEC 10646:2012 (i.e. Amendment 1, covering Linear A, Palmyrene, Manichaean, Khojki, Khudawadi, Bassa Vah, Duployan, and additional Wingdings symbols) isn't scheduled to start its final ballot until the end of this year, so a version of Unicode corresponding to Amendment 1 could not be released until spring 2013. In order to meet expected demand to use the newly devised currency sign as soon as possible, the UTC therefore decided not to wait until the next anticipated version of Unicode next year, but instead release a new version especially for the Turkish Lira Sign, on the assumption that the character is uncontroversial and will be accepted into ISO/IEC 1064 anyway. Of course this puts the ISO committee (WG2) in a slightly awkward position, as the ISO/IEC 10646 and Unicode repertoires need to be identical (and preferably synchronised), but Unicode 6.2 will probably be published before the committee even has a chance to discuss the proposal for the first time at its next meeting in October, and so faced with a fait accompli by the UTC it will have to accept the Turkish Lira Sign into ISO/IEC 1064 at the earliest opportunity regardless of what individual national body members of the committee may think of the new currency symbol. And as the UTC is looking into ways of making quicker releases of Unicode in response to industry demand to encode urgent-use characters, perhaps we will see more intercalary releases of Unicode with only one or two character additions in the future (there are probably some people who are looking forward to an accelerated release of Unicode 6.3 to meet the demand for the New Greek Drachma Sign, but that might be more controversial given the existence of the unused and unloved Drachma Sign ₯ at U+20AF [not to be confused with the ancient Greek Drachma Sign 𐅻 at U+1017B]).

The broader Unicode community did not all agree with the assessment that this was an uncontroversial addition, and a tsunami of emails has engulfed the Unicode mailing list since the initial announcement on 15 May. I don't want to be drawn into this futile argument, but if you want to start using the Turkish Lira Sign today, you can, as it is already included in Michael Everson's free Rupakara font. And if you are eager to take a closer look at Unicode 6.2, then I have just released beta versions of BabelPad and BabelMap that support Unicode 6.2 (NB the Unicode 6.2 data incorporated into BabelMap and BabelPad is provisional and subject to change before Unicode 6.2 is officially released, and so should not yet be relied on).

What Else ?

What else can we say about Unicode 6.2 ? Well, U+0709 ܉ SYRIAC SUBLINEAR COLON SKEWED RIGHT is getting a new formal alias: SYRIAC SUBLINEAR COLON SKEWED LEFT; U+1240F 𒐏 CUNEIFORM NUMERIC SIGN FOUR U through U+12414 𒐔 CUNEIFORM NUMERIC SIGN NINE U are having their numeric values changed from '4' through '9' to '40' through '90'; and U+065F ARABIC WAVY HAMZA BELOW is moving from inherited script to Arabic script. On a more practical point, the Unicode 6.2 code charts will for the first time show variation sequences, which are now growing in number at a startling rate.

On Beyond 6.2

The main side effect of this special release of Unicode 6.2 will be to push back the date of the release of version of Unicode synchronised with ISO/IEC 10646:2012 Amendment 1, which was originally anticipated for release next spring. It is now probable that the next version of Unicode (shall we call it Unicode 7.0?) will be synchronised with ISO/IEC 10646:2012 Amendments 1 and 2, and will not be released until early 2014. I will blog about the contents of Unicode 7.0 in October this year.

In the meantime, it is probable that an "update version" of Unicode (i.e. Unicode 6.2.1), which includes any required changes to character properties and updates to the standard annexes, but which does not include any changes to character repertoire, will be released in spring 2013. Unicode 6.2.1 will include the addition of 1,002 standardized variants for CJK Unified Ideographs, corresponding to CJK Compatibility Ideographs, as an alternative, roundtripable mechanism for representing compatibility ideographs. I suspect that this will confuse the hell out of implementations that assumed that variation sequences for CJK Unified Ideographs only ever used Variation Selectors 17 through 256, and that VS1 through VS16 were only used for variation sequences that did not feature Han ideographs.

[Update (2013-10-01): In fact Unicode 6.2.1 turned into Unicode 6.3, which was released on the last day of September 2013; and Unicode 7.0 is probably delayed until the second half of 2014.]

Help! I wrote to Unicode Error Reporting months ago about character U+A980 JAVANESE SIGN PANYANGGA. The informative alias of this character should be "= candrabindu", not "= ardhacandra" as in the current code chart. In n3319, Michael Everson says that this character analogues to Devanagari Candrabindu. Chapter 11 of Unicode Standard also says the same thing. The Balinese counterpart of this character U+1B01 BALINESE SIGN ULU CANDRA also has informative alias "= candrabindu" (Javanese and Balinese script are closely related). I still get no response from error reporting I made, and nothing is changed in Javanese code chart in Unicode 6.0 and 6.1. What should I do? Will they fix it in 6.2?

U+A980 JAVANESE SIGN PANYANGGA still has the informative alias "ardhacandra" in the pre-beta 6.2 code charts, but there is still time for this to be changed before 6.2 is released. I don't know anything about Javanese, so cannot judge whether "ardhacandra" or "candrabindu" is correct, but can only suggest that you report the issue again on the Unicode reporting form, giving appropriate evidence for the correction.

A couple questions… We're still waiting for October for the middle dot, like you said in a comment here, right? http://babelstone.blogspot.com/2011/06/whats-new-in-unicode-61.html On a related note, has there been any talk about encoding a Latin letter middle line (I'm not sure what it would be called) — a hyphen look-alike — used in Passamaquoddy transliteration? It recently was a minor challenge typesetting it in the Passamaquoddy-Maliseet dictionary (2008 — http://books.google.com/books?id=4zCDOgAACAAJ&dq=dictionary+passamaquoddy&source=bl&ots=XP4d3TBSU5&sig=NC_ezYaTS1nKCPNLtDnRJSo3G_0&hl=en&sa=X&ei=nII5UP3iM8jG6AHx5YC4Bw&ved=0CC8Q6AEwAA — sorry for the ugly URL) because the authors used a hyphen for it, causing line-wrapping issues. Or is the non-breaking hyphen the correct character for that? I don't know the authors personally, but I heard about the challenge and was curious. Thanks!

The US is still opposing the encoding of the middle dot letter, but as they don't have any reasonable arguments against it I am cautiously optimistic that it will get into the next version of Unicode after 6.2 (I don't yet know whether this will be called 6.3 or 7.1).

I haven't heard anything about the hyphen-like character you mention. I will mention it to Michael Everson, who I am sure would be interested in proposing it for encoding if necessary (BTW, for Google Books urls, you only need the id parameter, so http://books.google.co.uk/books?id=4zCDOgAACAAJ&dq).

http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4306.pdf The middle dot is not in Amendment 1, but Amendment 2 due to the US pushing it back. Also see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4314.pdf for proof. I think Amendment 2 is in the next version after 6.2, but I'm not sure. Will the US push it back again or not? I hope not!

The numeric values for CUNEIFORM NUMERIC SIGN FOUR U is still 4 in http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt and not the more sensible 40 you give above.

Yes, I need to update this post to reflect the actual publication of 6.2!

Do you know what has happened ?

Yes, Ken Whistler argued that the benefits of making the corrections would be outweighed by other problems such changes would introduce ("The change which the UTC made in consensus 131-C30 at first looks innocuous, but it turns out to have hidden consequences which in hindsight make it an undesirable change, in my opinion."). See 12210-cuneiform.txt for details if you have access to the UTC document registry.

By the way, I don't see these plane 1 characters on your block. They seem to be replaced by a surrogate pair :-(

I made a typo : your "block" should have been your blog. Each characters out of the BMP are replaced with 2 instance of U+FFFD REPLACEMENT CHARACTER in my browser (Firefox 16.0.1)

And I don't have access to the UTC document registry, but I can guess that there are many other changes/updates to be done on cuneiform numbers, and correcting only part of the problems can be premature.

Thanks for the clarification, I've fixed the display problem in the blog now. That is an issue with Blogger that I had forgotten about: supra-BMP characters are converted to a pair of escaped surrogates, which IE interprets OK, but most other browsers do not. The solution is to escape the characters before posting.

The alias of U+A980 JAVANESE SIGN PANYANGGA is still "ardhacandra" in the Unicode 6.2 Javanese codechart. Chapter 11 of Unicode Standard (http://www.unicode.org/versions/Unicode6.2.0/ch11.pdf), page 400, confirmed that U+A980 is analogues to U+0901 DEVANAGARI SIGN CANDRABINDU. People interested in Javanese script may see this as inconsistency between script description in http://www.unicode.org/versions/Unicode6.2.0/ch11.pdf and code chart (http://www.unicode.org/charts/PDF/UA980.pdf). "Ardhacandra" (half moon) is not the same as "candrabindu" (moon dot).

I did reported it again as you suggested, and received e-mail from Unicode that said that they'll look at the problem. But nothing was changed in Unicode 6.2 Javanese code chart. Perhaps in 6.3 or 7.0? :)