Thursday, August 30, 2007

Changes in Java to support supplementary Unicode characters

Support for supplementary characters might need changes in the Java language as well as the API. A few questions come to mind.

How do we support supplementary characters at the primitive level (char is only 16 bits)?

How do we support supplementary characters in low level API's (such as the static methods of the Character class) ?

How do we support supplementary characters in high level API's that deal with character sequences?

How do we support supplementary characters in Java literals?

How do we support supplementary characters in Java source files?

The expert commitee that worked on JSR-204 dealt with all these questions and many more (I'm sure) . After deliberating as well as experimenting with how the changes would affect code, they came up with the following solution.

The primitive char was left unchanged. It is still 16 bits and no other type has been added to the Java language to support the supplementary range of unicode characters.

Low level API's, such as static methods of the Character class, accepted the char primitive type before support for supplementary characters was provided in Java. However, since Java 5.0, methods such as isLetter(...) of the Character class provide an overloaded method that accepts an int representing the code point, along with the earlier method that accepted a char.

High level API's will continue to work "as is" for most developers. They represent character sequences as UTF-16 sequences. Some methods in String and StringBuffer now have parrallel methods to work with code points. Some such methods are codePointAt(...) , codePointBefore(...), and codePointCount(). For example the codePointCount() method returns the number of code points in a String, which may not be the same as the number of characters in the String, if some characters are from the supplementary range and are represented as surrogate pairs.

Identifiers in Java can contain any letter or digit. Many supplementary characters are letters or digits. To allow supplementary characters to be used in identifiers, the Java compiler and other tools were modified to use different API methods (isJavaIdentifierPart(int), isJavaIdentifierStart(int)).

Since we need to support supplementary characters all the way, they also need to be supported in Java source files. I will discuss how to include unicode characters in Java source files and get them to compile using the Java compilers -encode option, in the next blog post.

While I was reading about encoding, I came accross this interesting blog post that describes a situation when an I18N enables Java program ceased to work after the build machine was moved from a Windows box to a Red Hat box. The reason of course was encoding related issues.

Note: This text was originally posted on my earlier blog at http://www.adaptivelearningonline.net