Sunday, 1 November 2009

What's new in Unicode 6.0 ?

[2010-08-30 : The Indian Rupee Sign (see N3862) has now been accepted for fast-tracking into Unicode 6 at U+20B9 by the Unicode Technical Committee, although it is not in either of the corresponding amendments of ISO/IEC 10646, which will cause a temporary desynchronization between the two standards until Unicode 6.1.]

[2010-06-02 : Unicode 6.0 is now in Beta, and is scheduled for release at the end of September on or about the 11th October 2010.]

[2010-04-24 : The character repertoire, code points and characters names for Unicode 6.0 are now fixed.]

Now that Unicode 5.2 has been out for a month, I think that it would be a good idea to look forward to Unicode 6.0, which is scheduled for release in late 2010. Unicode 6.0 will correspond to a new (2nd) edition of ISO/IEC 10646 (ISO/IEC 10646:2010), which itself corresponds to ISO/IEC 10646:2003 plus Amendments 1 through 8, of which Amendments 7 and 8 include 2,089 2,087 new characters that are not in Unicode 5.2 (if this is confusing, it might be helpful to try reading my post on the relationship between Unicode and ISO/IEC 10646) plus the Indian Rupee Sign (U+20B9) that is not yet included in ISO/IEC 10646. In sumary, Unicode 6.0 will have a total of 109,448 characters109,449 characters in 206 blocks covering 93 scripts.

Because of problems with the fonts for the CJK-B block, the 2nd edition of ISO/IEC 10646 will have a multi-column format for the CJK, CJK-A, CJK-C and CJK-D blocks, but the large CJK-B block (42,711 characters) will be presented in a single column format with a single font. In order to rectify this failing at the earliest opportunity, it has been decided to immediately start work on yet another new edition of the standard (the 3rd edition) instead of publishing a series of amendments as is normally the case. A summary of the additions which will be made to the 3rd edition (which will correspond to the version of Unicode after 6.0) is available here.

Whereas Unicode 5.2 saw the encoding of fifteen new scripts and a total 6,648 new characters, Unicode 6.0 only has three new scripts (Mandaic, Batak and Brahmi) and a total of 2,089 2,087 new characters. Nevertheless, Unicode 6.0 includes some of the most controversial additions to the standard for a long time. In particular, the addition of a large set of characters corresponding to Japanese Emoji 絵文字 used on mobile phones has been the cause of much heated debate (original proposal documents N3582 and N3583). Google and Apple have pushed hard for the encoding of emoji in Unicode in order to solve interoperability issues between the various vendors, who currently use different variants of emoji at different private-use code points. Two groups of emoji in particular have caused a lot of contention.

Firstly, a group of five characters representing specific cultural icons (Mount Fuji, Tokyo Tower, Statue of Liberty, Silhouette of Japan and Statue of Moyai) have been vigorously opposed because they give the appearance of setting a precedent for encoding hundreds of other characters representing cultural or nationalistic icons, such as the Great Wall of China, the Pyramids of Giza, the Eiffel Tower, Tower Bridge, Mount Kilimanjaro, etc. etc. Some of us would have prefered to encode generic versions of these characters (e.g. Snow-Capped Mountain instead of Mount Fuji), but Google insisted that these characters had specific semantics that generic versions of the characters would not be able to represent, so in the end they were accepted as is. Note however, that they are not precedents for encoding other characters representing cultural icons, as they were not encoded because of the importance of the objects these characters represent, but for interoperability reasons (cross-mapping to existing emoji codes). Of course, if mobile phone vendors start adding emoji for the Great Wall of China, etc. then ....

Secondly, a group of ten characters representing the flags of ten specific countries (People's Republic of China, Germany, Spain, France, the UK, Italy, Japan, Korea, Russia and the US) caused a great deal of consternation, as it seemed unreasonable to encode flag symbols for a few select countries and not for others. Two solutions were put forward to solve the problem. The US proposed encoding them as ten characters named EMOJI COMPATIBILITY SYMBOL-n with a glyph shape comprising EC-n in a dashed box (i.e. completely hide the fact that these characters map to emoji map symbols). On the other hand, Ireland and Germany proposed encoding 256 characters representing all currently assigned ISO 3166 two-letter country codes (see N3680). Neither of these proposals were acceptable to the other parties, and in the end a compromise solution to encode twenty-six "regional indicator symbols" (see N3727) was accepted. These characters may be combined into two-character sequences corresponding to ISO 3166 two-letter country codes, and applications may then render such sequences with the corresponding country flag. Of course, this does not provide a solution for the representation of flags for countries and regions that do not have an ISO 3166 two-letter code. For example, mobile phone vendors may want to display the Welsh flag in order to indicate Welsh language (GB-WLS) options, but could not do so using the currently defined "regional indicator symbols" mechanism.

The encoding of emoji has opened up the standard to the encoding of other related symbols that were traditionally considered outside the scope of character encoding (e.g. transport and map symbols, and symbols for playing cards), so in addition to characters deriving from emoji usage you will find in Unicode 6.0 many other symbols that have been proposed for encoding (see the expanded emoji proposal by Ireland and Germany).

Amendment 7 [225 characters]

Amendment 7 has now completed its two rounds of technical balloting, and so its repertoire (including code points and character names) is stable. Code charts for Amendment 7 are available here.

Amendment 8 [1,864 1,862 characters]

Amendment 8 has now completed its two rounds of technical balloting, and so its repertoire (including code points and character names) is stable. Code charts for Amendment 8 are available here.

Please note that the original emoji proposal (N3582/N3583) does not show the final distribution of the proposed characters amongst various existing and new blocks, and underwent extensive changes. If you wish to follow the paper trail from original proposal to final allocation then you should peruse the following documents:

Latin Extended-D {A720-A7FF} : one letter for the Uralic Phonetic Alphabet [N3571], two letter for the Janalif alphabet [N3581], ten old Latvian letters [N3587], and one middle dot letter [N3567] (removed to the next edition)

15 comments:

I believe some of your links are broken:- 'CJK Unified Ideographs Extension D' does say N3560 but links to N3584,- 'Enclosed Ideographic Supplement', I couldn't find any info for this search term with the linked documents

'Enclosed Ideographic Supplement', I couldn't find any info for this search term with the linked documents

The original emoji proposal (N3582/N3583) does not show the final distribution of the proposed characters amongst various existing and new blocks by the committees involved. So, although N3583 includes nine squared ideographs and two circled ideographs, it does not include the term "Enclosed Ideographic Supplement".

For anyone who is morbidly interested in the details of the emoji proposal, I have now added a list of all relevent documents.

I think you are forgetting one more source: N3565 regarding the two heavy low quotes for German. Those quotes are not mentioned at all in N3583, and in N3607 they appear as already-encoded characters (rather than new yellow-colored proposed characters).

Yes, thanks for pointing that out. The Indian Rupee Sign (as proposed in document N3862) has been accepted for fast-tracking into Unicode 6.0, which means that Unicode and ISO/IEC 10646 will be out of sync until the next (3rd) edition of ISO/IEC 10646 is published as it is too late to add it to the current edition of ISO/IEC 10646. I will update this post accordingly.

Some other fonts that partially support Unicode 6.0:- Symbola, by George Douros, available from Unicode Fonts for Ancient Scripts, which supports additions to the Superscripts and Subscripts, Miscellaneous Technical, Miscellaneous Symbols, Dingbats, Miscellaneous Mathematical Symbols-A blocks, and the newly added Playing Cards block- DejaVu 2.32 supports the new Indian Rupee sign, as well as U+A78D Latin capital letter turned H, and the latest snapshot contains the Playing Cards block and U+26E2 Astronomical symbol for Urnaus

Emmanuel Vallois has left a new comment on the post "What's new in Unicode 6.0 ?":

Some other fonts that partially support Unicode 6.0:- Symbola, by George Douros, available from Unicode Fonts for Ancient Scripts, which supports additions to the Superscripts and Subscripts, Miscellaneous Technical, Miscellaneous Symbols, Dingbats, Miscellaneous Mathematical Symbols-A blocks, and the newly added Playing Cards block- DejaVu 2.32 supports the new Indian Rupee sign, as well as U+A78D Latin capital letter turned H, and the latest snapshot contains the Playing Cards block and U+26E2 Astronomical symbol for Urnaus