Wednesday, January 5, 2011

Representing Non-Unicode Text on the JVM

JRuby is an implementation of Ruby, and in order to achieve the high level of compatibility we boast we've had to put in some extra work. Probably the biggest area is in management of String data.

Strings in Ruby 1.8

Ruby 1.8 did not differentiate between strings of bytes or strings of characters. A String was just an array of bytes, and the representation of those bytes was dependent on how you used them. You could do regular expression matches against non-text binary sequences (used by some to parse binary formats like PNG). You could treat them as UTF-8 text by setting global encoding variables to assume UTF-8 in literal strings. And you could use the same methods for dealing with strings of bytes you would with strings of characters...split, gsub, and so on. The lack of a character abstraction was painful, but the ability to do character-string operations against byte-strings was frequently useful.

In order to support all this, we were forced in JRuby to also represent strings as byte[]. This was not an easy decision. Java's strings are all UTF-16 internally. By moving to byte[]-based strings, we lost many benefits of being on the JVM, like built-in regular expression support, seamless passing of strings to Java methods, and easy interaction with any Java libraries that accept, return, or manipulate Java strings. We eventually had to implement our own regexp engines (or the byte[]-to-char[]-to-byte[] overhead would kill us) and calls from Ruby to Java still pay a cost to pass Strings.

But we got a lot out of it too. We would not have been able to represent binary content easily with a char[]-based string, since it would either get garbled (when Java tried to decode it) or we'd have to force the data into only the lower byte of each char, doubling the size of all strings in memory. We have some of the fastest String IO capabilities of any JVM language, since we never have to decode text. And most importantly, we've achieved an incredibly high level of compatibility with C Ruby that would have been difficult or impossible forcing String data into char[].

There's also another major benefit: we can support Ruby 1.9's "multilingualization".

Strings in Ruby 1.9

Ruby 1.9 still represents all string data as byte[] internally, but it adds an additional field to all strings: Encoding (there's also "code range", but it's merely an internal optimization).

The biggest problem with encoded text in MRI was that you never knew what a string's encoding was supposed to be; the String object itself only aggregated byte[], and if you ever ended up with mixed encodings in a given app, you'd be in trouble. Rails even introduced its own "Chars" type specifically to deal with the lack of encoding support.

In Ruby 1.9, however, Strings know their own encoding. Strings can be forced to a specific encoding or transcoded to another. IO streams are aware of (and configurable for) external and internal encodings, and there's also default external and internal encodings. And you can still deal with raw binary data in the same structure and with the same String-manipulating features. For a full discussion of encoding support in Ruby 1.9, see Yehuda Katz's excellent post on Ruby 1.9 Encodings: A Primer.

As part of JRuby 1.6, we've been working on getting much closer to 100% compatible with Ruby 1.9. Of course this has meant working on encoding support. Luckily, we had a hacking machine some years ago in Marcin Mielzynski, who implemented not only our encoding-agnostic regexp engine (a port of Oniguruma from C), but also our byte[]-based String logic and almost all of our Encoding support. The remaining work has trickled in over the subsequent years, leading up the the last few months of heavy activity on 1.9 support.

How It Works

You might find it interesting to know how all this works, since JRuby is a JVM-based language where Strings are supposed to be UTF-16 always.

First off, String, implemented by the RubyString class. RubyString aggregates an Encoding and an array of bytes, using a structure we call ByteList. ByteList represents an arbitrary array of bytes, a view into them, and an encoding. All operations against a String operate within RubyString's code against ByteLists.

Regexp is implemented in RubyRegexp using Joni, our Java port of the Oniguruma regular expression library. Oniguruma accepts byte arrays and uses encoding-specific information at match time to know what constitutes a character in that byte array. It is the only regular expression engine on the JVM capable of dealing with encodings other than UTF-16.

The JRuby parser also ties into encoding, using it on a per-file basis to know how to encode each literal string it encounters. Literal strings in the AST are held in StrNode, which aggregates a ByteList and constructs new String instances from it.

The compiler is an interesting case. Ideally we would like all literal strings to still go into the class file's constant pool, so that they can be loaded quickly and live as part of the class metadata. In order to do this, the byte[]-based string content is forced into a char[], which is forced into a java.lang.String that goes in the constant pool. At load time, we unpack the bytes and return them to a ByteList that knows their actual encoding. Dare I claim that JRuby is the first JVM language to allow representing literal strings in encodings other than UTF-16?

What It Means For You

At the end of the day, how are you affected by all this? How is your life improved?

If you are a Ruby user, you can count on JRuby having solid support for Ruby 1.9's "M17N" strings. That may not be complete for JRuby 1.6, but we intend to take it as far as possible. JRuby 1.6 *will* have the lion's share of M17N in-place and working.

If you are a JVM user...JRuby represents the *only* way you can deal with arbitrarily-encoded text without converting it to UTF-16 Unicode. At a minimum, this means JRuby has the potential to deal with raw wire data much more efficiently than libraries that have to up-convert to UTF-16 and downconvert back to UTF-8. It may also mean encodings without complete representation in Unicode (like Japanese "emoji" characters) can *only* be losslessly processed using JRuby, since forcing them into UTF-16 would either destroy them or mangle characters. And of course no other JVM language provides JRuby's capabilities for using String-like operations against arbitrary binary data. That's gotta be worth something!

I want to also take this opportunity to again thank Marcin for his work on JRuby in the past; Tom Enebo for his suffering through encoding-related parser work the past few weeks; Yukihiro "Matz" Matsumoto for adding encoding support to Ruby; and all JRuby committers and contributors who have helped us sort out M17N for JRuby 1.6.

It seems you are not aware of the CharSequence interface. There never was a need to use UTF-16 Strings in Java’s built-in regex implementation. And there are a lot more APIs working with CharSequence implementations instead of forcing you to use String in Java 7.