15

Miscellaneous M17n Details

We've now discussed the core of Ruby 1.9's m17n (multilingualization) engine. String and IO are where you will see the big changes. The new m17n system is a big beast though with a lot of little details. Let's talk a little about some side topics that also relate to how we work with character encodings in Ruby 1.9.

More Features of the Encoding Class

You've seen me using Encoding objects all over the place in my explanations of m17n, but we haven't talked much about them. They are very simple, mainly just being a named representation of each Encoding inside Ruby. As such, Encoding is a storage place for some tools you may find handy when working with them.

First, you can receive a list() of all Encoding objects Ruby has loaded in the form of an Array:

The aliases() method returns a Hash keyed with the alternate names Ruby knows about. The values are the actual Encoding name that alias refers to. You can use either a name or an alias when referring to an Encoding by name, like with calls to Encoding::find() or IO::open().

Finally, there's one more gotcha you should be aware of if you're going to write some code that supports a large set of Ruby's Encodings. Ruby ships with a few dummy?()Encodings that don't have character handling completely implemented. These are used for stateful Encodings. You will want to filter them out of Encodings you try to support to avoid running into problems:

Notice that I got the requested bytes in both cases. However, those Strings were assigned the source Encoding as normal. In the first case, that built a valid UTF-8 String. However, the second case is invalid and may later cause me fits as I try to use the String.

There are a couple of exceptions though, where a String escape can actually change the Encoding of the literal. First, you'll likely remember that using a multibyte character is not allowed if you don't change the source Encoding:

Notice that the Encoding of the String was upgraded to ASCII-8BIT to accommodate the bytes. We'll talk a lot more about that special Encoding later in this post, but for now just make note of the fact that this exception gives you an easy way to work with binary data.

Octal escapes (\###), control escapes (\cx or \C-x), meta escapes (\M-x), and meta-control escapes (\M-\C-x) all follow the same rules as the hex escapes (\x##) we've just been discussing.

The other exception is the \u#### escape that can be used to enter Unicode characters by codepoint. When you use this escape, the String gets a UTF-8 Encoding regardless of the current source Encoding:

Notice how the String received a UTF-8 Encoding in all three cases, regardless of the current source Encoding. This exception gives you an easy way to work with UTF-8 data, no matter what your native Encoding is.

The Unicode escape can be followed by exactly four hex digits as I've shown above, or you can use an alternate form \u{#…} where you place between one and six hex digits between the braces. Both forms have the same effect on the String's Encoding.

Working with Binary Data

Not all data is textual data. Ruby's String class can also be used to hold raw byte sequences. For example, you may want to work with the raw bytes of a PNG image.

Ruby 1.9 has an Encoding for this which basically just means treat my data as raw bytes. You can think of this Encoding as a way to shut off character handling and just work with bytes:

See how switching the Encoding (without changing the data) shut off Ruby's concept of characters? The character count became the same as the byte count and Ruby started giving a more raw version of the inspect()String to show those are just bytes.

If you expected this Encoding to be called BINARY, you are half right. As you
can see I could use that name above because it is a valid alias. Ruby switched
to the real name in the inspect() message though. Ruby actually refers to theEncoding as ASCII-8BIT, which leads us to another twist.

Obviously, there's not really such a thing a "ASCII-8BIT" outside of Ruby. Even while working with binary data though, it's not uncommon to want to make a check for some simple ASCII pieces. For example, the first few signature bytes of a PNG image do contain the simple ASCII String"PNG":

Ruby makes this possible by making ASCII-8BIT compatible?() with US-ASCII. That allows tricks like the above where I validated the PNG signature with a simple US-ASCII Regexp. Thus, ASCII-8BIT means ASCII plus some other bytes and you can choose to treat parts of it as ASCII when that helps you work with the data.

It's worth noting that Ruby will now fallback to an ASCII-8BIT Encoding anytime you read() by bytes:

That makes sense, because you could chop up characters when reading by bytes. If you really need to read() some bytes but keep your Encoding you will need to set and validate it manually. Here's one way you might do something like that:

In that example, I just read() the fixed bytes I wanted and then push forward byte by byte until my data is valid in the desired Encoding. I had to test a dup() of the data and only force_encoding() when I was sure I was done reading, because UTF-8 and ASCII-8BIT are not compatible?() and would have raised Encoding::CompatibilityError as I was adding on bytes.

Working with binary data also requires you to know one more thing about Ruby's IO objects. Ruby has a feature where it translates some data you read on Windows. The translation is super simple: "\r\n" sequences read from an IO object are simplified to a solo "\n". This features is to help make Unix scripts work well on a platform that has different line endings. It does create a gotcha though: when you're going to read any non-text data, be it binary data or just a non-ASCII compatible Encoding like UTF-16, you need to warn Ruby not to do the translation for your code to be properly cross-platform.

By the way, this isn't new. This was even true in the Ruby 1.8 era.

Telling Ruby to treat the data as binary and not perform any translation (again, only active on Windows) is simple. You can just add a "b" for binary to your mode String in a call to open(). Thus you would read with something like:

open(path, "rb") do |f|
# ...
end

or write with code like:

open(path, "wb") do |f|
# ...
end

If you always knew about this quirk and you did a good job of always doing this, give yourself a big pat on the back because you're all set. If you didn't, you've got a bad habit you'll need to break. Don't feel too bad about it though. I've known about this quirk since my Perl (which does the same thing) days and I've always tried to follow it. However, about ten different bugs were recently filed against one of my libraries that amounted to me missing this "b" in several places. It's easy to forget.

Ruby 1.9 is much more strict about the binary flag. It's going to complain if you don't add it when it feels it is needed. For example:

I showed the external_encoding() there to show that it's exactly what I specified. However, as a reward for adding in these "b"'s we've been bad about leaving out in the past, Ruby will now assume you want ASCII-8BIT when you supply the "b" and not an external_encoding():

It's worth nothing that Ruby 1.8 accidently helped train us to leave out the magic "b". For example, you could use IO::read() to slurp some data, but that method didn't provide a way to indicate that the data was binary. In truth, you really needed this monster for a safe cross-platform read of binary data: open(path, "rb") { |f| f.read }. It's no surprise that IO::read() was more common. IO::readlines() and IO::foreach() had the same issue. The core team has acknowledged these problems with some new additions. First, you can now pass a Hash as the final argument to all the methods that open an IO and use that to set options like :mode or separately :external_encoding, :internal_encoding, and :binmode (the name for the magic "b"). Here are some examples:

As one last shortcut along these lines, the new IO::binread() method is the same as IO.read(…, mode: "rb:ASCII-8BIT").

Regex Encodings

Now that all our data has an Encoding, it only makes sense that our Regexp objects would need to be tagged as well. That is the case, but the rules for how an Encoding is selected differs for Regexp. Let's talk a little about how and why.

After we did all that talking about the source Encoding Ruby goes and ignores it on us. You can see that the Regexp was set to US-ASCII instead of the UTF-8 that was in effect at the time. Surprising though that may be, there is actually a pretty good reason for it.

My Regexp literal only contained seven bit ASCII, so Ruby chose to simplify the Encoding. If it left it at the source Encoding of UTF-8, it would be useful for checking UTF-8 data. As it is though, it can now be used to check any ASCII compatible?() data. You can see in the output that the expression was tried against three different String's, because they are all ASCII compatible?(). (It did fail to match one since I changed the rules of how to interpret the data and one character became two bytes, but the attempt was still made.) The fourth match could not be attempted, because UTF-16 is not ASCII compatible?().

Of course, if your Regexp includes eight bit characters, you use the special escapes that change an Encoding, or you apply one of the old Ruby 1.8 style Encoding options, you can get a non-ASCII Encoding:

I used /u which you will probably remember as a way to get a UTF-8 Regexpfrom the old Ruby 1.8 system. The /e (for EUC_JP) and /s (for a Shift_JIS extension called Windows-31J) options still work too. Ruby 1.9 also still supports the old /n option, but it has some warning tossing exceptions for legacy reasons and I recommend just avoiding it going forward. You can build an ASCII-8BIT Regexp in another way I'll show in just a moment.

As of Ruby 1.9.2, this concept of a lenient Regexp, one that will match any ASCII compatible?()Encoding, has a new name:

A fixed_encoding?()Regexp is one that will raise an Encoding::CompatibilityError if matched against any String that contains a different Encoding from the Regexp itself, as long as the String isn't ascii_only?(). If fixed_encoding?() returns false, the Regexp can be used against any ASCII compatible?()Encoding. There's also a new constant with this name that can be used to disable the ASCII downgrading:

Note how a Regexp will take the Encoding of the String passed to Regexp::new() when Regexp::FIXEDENCODING is set. You can use this combination to build a Regexp in any Encoding you need, including the ASCII-8BIT I mentioned earlier.

Once your Regexp is at least compatible to your data's Encoding, pattern matches function as they always have. (Well, in truth, Ruby 1.9 brings us a powerful new regular expression engine called Oniguruma, but that's another topic for another time.) Under average circumstances, Ruby 1.9's RegexpEncoding selection option mean that they are compatible with a lot of data and everything should just work for you. However, if you end up getting some errors at match time, you may need to abandon the simple /…/ literal and use the new features I've shown to build a Regexp that perfectly matches your data's Encoding.

Handling a BOM

Note that Ruby doesn't even support a UTF-16 Encoding. Instead, you must pick between UTF-16BE and UTF-16LE for "Big Endian" or "Little Endian" byte order. This indicates whether the most significant byte comes first or last:

Now, when someone goes to read your UTF-16 data back, they'll need to know which byte order you used to get things right. You could just tell them which order was used the same way you'll probably tell them that the data is UTF-16 encoded. Or you could add a BOM to the data.

A Unicode BOM is just the character U+FEFF at the beginning of your data. There's no such character for the reversed bytes U+FFFE, so this makes it easy to correctly tell the order of the bytes. Another minor advantage is that this BOM probably indicates you are reading Unicode data. A lot of software will check for this special start of the data, use it to set the proper byte order, and then pretend it didn't even exist by removing it from the data they show users.

Ruby 1.9 won't automatically add a BOM to your data, so you're going to need to take care of that if you want one. Luckily, it's not too tough. The basic idea is just to print the bytes needed at the beginning of a file. For example, we can add a BOM to a UTF-16LE file as such:

Notice that I just used the Unicode escape to add the BOM character to the data. Because my output String was in UTF-8, Ruby had to transcode it to UTF-16LE and that process arranged the bytes correctly for me, as you see in the sample output.

Reading a BOM is a similar process. We will need to pull the relevant bytes and see if they match a Unicode BOM. When they do, we can then start reading again with the Encoding we matched. We might code that up like this:

These examples just deal with Unicode BOM's, but you would handle other BOM's in a similar fashion. Find out what bytes are needed for your Encoding, write those out before the data, and later check for them when reading the data back. The String escapes we discussed earlier can be handy when writing the bytes and binread() is equally handy when checking for the BOM.

I do recommend including a BOM in Unicode Encodings like UTF-16 and UTF-32, but please don't add them to UTF-8 data. The UTF-8 byte order is part of its specification and it never varies. Thus you don't need a BOM to read it correctly. If you add one, you damage one of the great UTF-8 advantages in that it can pass for US-ASCII (assuming it's all seven bit characters).

Another method got a neat upgrade in Ruby 1.9: Integer.chr(). You can use this method in Ruby 1.8 to convert simple byte values into single character Strings. However, the method is limited to single byte values. This example shows both how it works and the limit:

However, Ruby 1.9 adds a new twist. The method now takes an optional Encoding argument, or the String name of an Encoding. If you provide an Encoding, the method will convert a codepoint (which you can get with ord() or codepoints()) into a String:

I've read all your articles about Encoding in detail but I can't find a solution to the following problem.

My app receives a string from another app such as: "%e47%e14%e1a". When I unescape the HTML I end up with something like this "\xE47\xE14\xE1a" which correspond to the codepoints: 3655, 3604 and 3610 from the Thai Alphabet. Using Ruby 1.9.2 "\xE47\xE14\xE1a".valid_encoding? returns false, because these are the byte sequences but rather the Unicode codepoints. How can I use Ruby here to convert these codepoints to the proper strings, ็ดบ?

About

James Edward Gray II was a part the Ruby community before Rails
ever shipped. He wrote code and documentation that now come with
the language. He ran two Red Dirt Ruby Conferences and is was
a regular on the Ruby Rogues podcast for years. He now creates
videos showing real programming in action.
He does all of this just because he loves to program. This site is
where he writes about that.

Projects

Latest Tweets

Maybe if the number of people solving the #AdventOfCode keeps dropping, I have hopes of making the leaderboard around Christmas. #day9