Friday, 1 January 2016

Unicode version 9.0 is scheduled for release in June 2016. The final repertoire is now fixed, and 7,500 characters (including 72 emoji) will be added to Unicode 9.0. This will bring the total number of graphic and format characters in the Unicode Standard to 128,172 characters(in case you are concerned that Unicode is running out of space, that still leaves room for another 846,293 characters to be encoded). In summary, Unicode 9.0 wil include 11 new blocks (named ranges of characters) and cover 6 new scripts (Osage, Newa, Bhaiksuki, Marchen, Tangut, and Adlam), making a total of 270 blocks and 135 scripts.

Emoji

74 Emoji characters have been accepted for encoding in Unicode 9.0. However, two of these characters have been de-emojified at the request of Apple: U+1F946 RIFLE (representing Shooting or Hunting) and U+1F93B MODERN PENTATHLON (which includes Pistol Shooting as one of its disciplines) will have no Unicode properties to suggest that they are emoji. So the two characters will still be encoded in Unicode 9.0, but as plain symbols not as emoji characters; and it is unlikely that any major vendors will implement them as emoji.

These characters are currently under ISO ballot for inclusion in ISO/IEC 10646:2016 (5th ed.) (see WG2 N4705 pages 130, 131, 135, and 137–138). Most of the 8,514 characters in this document will feed into Unicode version 10.0 in June 2017, but due to the urgent need of netizens to be able to use new emoji at the earliest possible date, the Unicode Technical Committee (UTC) has a habit (policy?) of fast-tracking emoji characters into the Unicode standard out of synchronization with the corresponding ISO standard (ISO/IEC 10646). On January 26 these 74 emoji characters were authorized for inclusion in the Unicode 9.0 beta, and unless any national bodies have strong and compelling objections to any of these emoji characters in the current CD ballot (which closes 29 February 2016), then these 74 emoji characters will definitely be in Unicode 9.0. A final decision will be made when the UTC meets in early May 2016.

In the end, at the UTC meeting in May 2016, the UTC decided to only accept 72 emoji characters. At the request of Apple (in response to several well-publicized emoji gun incidents, and a campaign against adding more violent emoji to Unicode), U+1F946 RIFLE and U+1F93B MODERN PENTATHLON (which includes shooting as one of its disciplines) were de-emojified, and will be encoded in Unicode 9.0 as plain non-emoji symbols. Of course, people can still use U+1F946 🥆 RIFLE (or various combinations of the letters A-Z, and many other Unicode characters) to threaten other people in text messages, but the threats will not need to be taken seriously because the rifle character will not be displayed in colour (and it is quite likely that major vendors will not support this character at all in their fonts).

It's Not All About Emoji !

Emoji make up 99% of the noise and hype surrounding Unicode 9.0, but they account for only 1% of the new characters.

7,227 of the 7,426 non-emoji characters to be added to Unicode 9.0 are included in ISO/IEC 10646:2014 (4th ed.) Amendment 2, and are highlighted in this document (along with one currency sign, nine CJK unified ideographs, 36 emoji characters, and 5 emoji modifier characters which were fast-tracked into Unicode 8.0). These characters have all been through at least two rounds of ISO technical ballots, and they are now stable (they cannot be moved, removed, or renamed). The remaining 199 characters are included in the Committe Draft for ISO/IEC 10646:2016 (5th ed.) (full draft is downloadable as N4446). This edition has not yet completed its two rounds of technical ballots by ISO national bodies, but the UTC has decided to fast-track the Adlam script, the Newa script, and Japanese TV symbols (in addition to the 74 emoji discussed above) into Unicode 9.0. It is not unusual for the UTC to fast-track urgently-required characters (such as currency symbols and emoji) into a version of Unicode before they have completed their final technical ballot, but it is unprecedented to fast-track complete scripts, especially when the first technical ballot has not yet completed.

Newa in particular has been a very difficult script to get encoded because of technical and political differences of opinion about what characters to include and the encoding model to use (see the long list of documents relating to Newa in the table below). As recently as the first ballot on the Committee Draft for ISO/IEC 10646 in August 2015 the UK national body expressed concerns over the encoding of murmured resonants as atomic characters (L2/15-262 p. 16), so the encoding of Newa cannot be considered to be uncontroversial. By fast-tracking Adlam and Newa into Unicode 9.0, the UTC has effectively stiffled any ISO national body opposition to the Newa repertoire that the UTC has agreed upon. The CD ballot for ISO/IEC 10646 closes 29 February 2016, which theoretically allows the UTC time to tweak (or even withdraw) any of the fast-tracked characters in response to ballot comments by ISO national bodies, but any requests to change the character repertoire, character positions or character names for Newa or Adlam in the final ISO technical ballot (DIS ballot) later this year will have to be rejected as the encoding of Newa and Adlam is already a fait accompli.

Fast-tracked characters from the ISO/IEC 10646 CD are marked ** in the tables below.

7,297 of the 7,500 new characters in Unicode 9.0 belong to six new scripts :

Of the 7,500 characters added to Unicode 9.0 (including the 74 emoji), 7,357 characters are included in 11 new blocks, and 143 characters are added to existing blocks, as detailed in the two tables below. The code points and character names for all these characters are now fixed, and will not be changed. Draft official Unicode data files are available here, and I have made a plain text list of all the new characters to be added to Unicode 9.0 available here.