Friday, July 17, 2009

The design and implementation of Ruby M17N - Translation

This is a translation of the article posted to Rubyist Magazine vol. 0025 published on Februrary, 2009. The original article is written in Japanese by Yui Naruse. The article is not new, so, probably, many poeple might have read this article via an online translation service. However, the article is really long and is not easy to understand even in my first language. To learn Ruby's M17N and train my English, I'm tackling to translate it. I hope this translation will help Ruby programmers who have not yet read the article so far.

Eventually, Ruby 1.9.1 has been released in January 31, 2009 in JST. It took almost a year for development of the new version since Ruby 1.9.0-0’s release back in December 25, 2007 in UTC time.

Ruby 1.9 has had many new features and changes, some of which are not compatible to Ruby 1.8. For example, YARV is, now, the Ruby VM. In addition, Ruby 1.9 had Oniguruma for its regular expression engine and Enumerator in its core. Among them, introducing Ruby multilingualization (M17N) would be the biggest change.

Ruby multilingualization (M17N) of Ruby 1.9 uses the code set independent model (CSI) while many other languages use the Unicode normalization model. To make this original system happen, an encoding convert engine called transcode has been newly added to Ruby 1.9. In this article, I will show you that what is a multilingualization, what Ruby M17N supports, and how you can write a code in the Ruby 1.9 way.

Firstly, M17N is the short form of multilingualization. As many of you know, a computer can handle only a bit, byte, which is a group of bits, or arrays of bytes. Single byte character sets like US-ASCII are easy to process for computer software; however, multi bytes character sets are not easy and need some ideas to manage them well. Also, other relevant ideas are necessary to use more than one encoding at the same time on single software. I’ll start with a brief explanation of typical internationalizations, and then go to M17N.

L10N is the short form of Localization, and the idea to localize computer software (cf. national language support (NLS)). Given localized software, users can read and write their native languages and see appropriate area-specific information on it. Historically, Japanese people have used a lot of software made in the U.S. or European countries. In general, those imported programs have had a poor localization for Japanese people; therefore, Japanese developers have worked to translate messages output from software into Japanese. Besides, software architecture itself should have changed to handle multi-byte encodings such as Shift_JIS or EUC_JP since original software was often designed only for single-byte character sets. Unless, troubles would happen when the software decides word boundaries or spaces between words or characters. To resolve this sort of troubles, Japanese programmers had made great efforts to make imported software to handle multi-byte encodings in early days.

I18N is the short form of Internationalization and the ideas to internationalize computer software. Precisely, the ideas are:- Multi-byte character sets/encodings should be supported in software- The mechanism to covert languages, currency symbols, or other region specific notations to different ones easily should be fixed.

At the beginning, efforts to support multi-byte character sets was focused on creating ISO 2022 (one of multi-byte encodings) frameworks. Later, the idea of internationalization got close to supporting Unicode. The Unicode support has improved by adopting the abstraction to handling messages or region specific notations. The gettext library is an example. Also, Rails’ internationalization has come along with using Unicode support discussed here.

As I talked before, M17N is the short form of multilingualization and has the ideas:- Localization for more than one language on single software should be availableSee http://www.jpnic.net/ja/research/200605-dom/chapter1-2.pdf- More than one language should be available to use at the same timeSee http://www.m17n.org/m17n-lib-en/index.html

The UCS Normalization and CSI models are well known models for multilingulaliation. Two models use different mechanisms to have internal character encodings in computer software. Naturally, pros and cons are there.

The UCS normalization model has the only one character set, which is called Universal Character Set, to handle characters internally. The most remarkable advantage of this model is that people don’t need to modify their programs along with it. Thus, in most cases, people can keep using just localized (not internationalized), old computer software even after this model has been newly introduced. When this model woks on a back-end, encoding to and decoding from internal code points are done in every input/output processing. In other words, all input characters are converted to an array of internal code points before processing. On the contrary, all internal code points are converted to a byte array before outputting. Since this approach successfully standardized the only one internal code set and does not require programmers to modify old software, many languages and operating systems such as Perl, Python, Java, .NET, Windows and Mac OS X, have this UCS normalization model inside. I mean almost all languages and operating systems use the UCS normalization model.

Perl uses the UCS Normalization model with Unicode. Supported character sets in this model depend on Unicode version that the system chose. Since UTF-8, one of the Unicode encodings, is a variable-width encoding, language implementers will have trouble to map character sequences to byte sequences and vice versa. It is known that Perl implementers have paid a lot of efforts to handle multi-byte characters, for example, using caches to memorize each character position. Probably, our challenge would be how well we can implement the idea of Grapheme Cluster, which is a user perceived character, consists of more than one Unicode code point but expresses a single character, for example, “G" + acute-accent. Also, the definition of Unicode Scalar Value is important to think about what is a character when Unicode is used for the language implementation.

Java 1.5 and later versions use UTF-16 for its internal character code, while Java 1.4 and older versions have used the 16-bit-fixed width to store code points in Basic Multilingual Plane (BMP) of Unicode or ISO/IEC 10646. Consequently, old versions could handle the range just between U+0000 and U+FFFF. Java’s change in assigned bits came after a serious Unicode problem was revealed. I mean 16 bits were too short to assign code points to all characters in the world. Then, Unicode 2.0 resolved this flaw by inventing the Surrogate Pair, which had an enough range to assign a greater code points than BMP by using a pair of two 16-bit code units. In light of this background, all Java programmers should know that a unit of a character might be only a half of the Surrogate Pair. Also, .NET Framework implements in the same way, so programmers should think about this tricky stuff.

In addition, Python, too, uses UTF-16 for its internal character code by default. However, Python has an option to change UTF-16 to UTF-32 by setting –enable-unicode=ucs4 (version 2.x), or –with-wide-unicode (version 3.0). UTF-32 has been adopted to some versions of Fedora and Ubuntu.

The internal character code of Mosh, a Scheme interpreter, is UTF-32. Not like other Unicode encodings, UTF-32 uses a fixed-length encoding, which means that Unicode code points are stored in exactly 32 bits in any case. Consequently, system does encoding/decoding of characters every time; however, architecture of converting characters will be simple since UTF-32 encoding uses fixed-length and not used for communication.

TRON uses UCS Normalization model, but does not do Unicode for its internal character code. The TRON project defined TRON code, which includes Unicode 2.0, for the internal code. Another example of TRON code is soopy.

The Code Set Independent (CSI) model does not have a common internal character code set not like the UCS Normalization. Under the CSI model, all encodings are handled equally, which means, Unicode is one of character sets. The most remarkable feature of the CSI model is that the model does not require a character code conversion since external and internal character codes are identical. Thus, the cost for conversion can be eliminated. Besides, we can keep away from unexpected information loss caused by the conversion, especially by cutting bits or bytes off. Ruby uses the CSI model, so do Solaris, Citrus, or other system based on the C library that does not use __STDC_ISO_10646__. If the C library of system does not define __STDC_ISO_10646__, stored data in wchar_t are not always the same ones. On the other hand, when __STDC_ISO_10646__ is defined, stored data in wchar_t is always mapped to the same character; for example, 0x3042 is mapped to “あ" in Japanese Hiragana. Therefore, when the system uses the CSI model, programmer should be careful not to judge character codes easily just look at data on memory. This is important to avoid bugs mixed in. To avoid character related bugs, programmers should use defined functions for characters when they handle strings.

As I explained before, Ruby uses the Code Set Independent (CSI) model, while many other languages uses the UCS Normalization model. Ruby succeeds in reducing computational overhead that comes from unnecessary encoding conversions by using the CSI model. Moreover, it is possible to handle various character sets even though they are not based on Unicode.

Since Ruby M17N uses the CSI model, we are unable to determine the encoding of a given string. Besides, each string might have a different encoding. In light of this complexity, Ruby’s String object is designed to have its own encoding in it. Consequently, every string processing is done based on the encoding the String object has.

Basically, script encoding determines an encoding of string literals in a source code. Each source file has its unique script encoding, which is available to get by __ENCODING__ keyword in Ruby runtime. We can specify ASCII compatible encodings for the script encoding, and set it in a magic comment line. (I’m going to talk about the magic comment later) When no magic comment is there, Ruby applies US-ASCII encoding to a given source code. Thus, the magic comment explained below is necessary to write non-ASCII strings in a Ruby script.

However, when we give a Ruby script to runtime through standard input, or by command-line with –e option, system locale will be applied to the script encoding only if the magic comment is missing. Thus, we don’t need to add the magic comment just for a line of a Ruby program.

The magic comment is used to specify the script encoding of a given Ruby script file. The magic comment is similar to the encoding attribute of a XML declaration in each XML file. As I explained, US-ASCII will be applied if the magic comment is missing in a file. The magic comment should be on the first line unless the script file does not have a shebang line. When we want to write shebang line, the magic comment comes on the second line. The format of the magic comment must match to the regular expression, /coding[:=]\s*[\w.-]+/ , which is, generally, the style of Emacs or Vim modeline. Namely, the magic comment must be a comment as its name illustrates.

#!/bin/env ruby# -*- coding: utf-8 -*-puts "Emacs Style"

# vim:fileencoding=utf-8puts "Vim Style 1"

# vim:set fileencoding=utf-8 :puts "Vim Style 2"

#coding:utf-8puts "Simple"

We will get the “invalid multibyte char" error, when we write non-ASCII string literals in a Ruby script with no magic comment. The error warns you that non US-ASCII characters are written in the script. Raising the error is good to keep platform independent of a source code. Only a script’s author knows what encoding he or she used to write the script. Usually, people don’t know the encoding of the script written by somebody else. Although NKF.guess or other utilities would figure out the Japanese encoding, it is so hard to guess European encodings of the script that someone wrote before in some place. In light of this difficulty, Ruby 1.9 requires the magic comment if programmers want to write non-ASCII characters in a script file. Therefore, the magic comment is valid only in the file. We don’t have any feature to require after we specify one of script encodings.

Ruby 1.9 IO object has a feature to set appropriate encodings to input strings and to convert encodings. Also, we can let IO object convert output encodings automatically. The external and internal encodings of IO object are decisive factors of this sort of behavior.

We should think about whether we can set the external encoding or not rather than to invoke the String#force_encoding method against a string input from an IO object. Also, we should think about setting the internal encoding rather than using String#encode method.

Encoding.defualt_external returns a default external encoding of IO object, while Encoding.default_internal does a default internal encoding of IO object. These encodings are used only when the encoding is not specified explicitly over standard I/O, command-line arguments, or a file opened in a script.

When Encoding.default_internal is defined, the encoding of every input string is supposed to be identical to the returned value from Encoding.default_internal. In the same way, the encodings of returned strings from libraries are expected to be the value of Encoding.default_internal.

However, it is not recommended assigning Encoding.default_external to the initial value of strings returned from libraries. Since Encoding.default_external is only for the default value of the external encoding, we don’t have any information about the internal encoding. In addition, we should be care about the internal encoding because a default value of Encoding.default_internal is nil. Don’t misunderstand that Encoding.default_internal seems to be suitable to the initial value.

We can set the values of Encoding.default_external and Encoding.default_internal by –E command-line option, whose format is -Eex[:in] . –U command-line option sets UTF-8 to the both values. When we use the –U command-line option, the script encoding of a given script from standard input or command-line by –e option is assumed to be UTF-8, too. However, the –U option does not have any effect on a script encoding of the script given by a file.

Here’s how locale encoding is determined. Ruby runtime tries to get $LANG environment variable on both Unix and Windows to decide locale_charmap. If $LANG variable exists, the value will be the same encoding value as $LANG has. Otherwise, the runtime tries to pick it up by GetConsoleCP*4 on Windows or cygwin. Once the value of locale_charmap has been fixed, we can get it from Encoding.locale_charmap. Remember that miniruby always returns ASCII-8BIT, and no nl_langinfo environment returns nil.

After locale_charmap has been fixed, the locale encoding is determined from. The locale encoding is identical to the value of Encoding.find(Encoding.locale_charmap), but it will be US_ASCII when locale_charmap is nil or ASCII-8BIT when locale_charmap is an unknown name for Ruby. We can get the determined locale encoding by Encoding.find(“locale").

The locale encoding is mainly used to set the default value of default_external as I discussed here. Since default_external is a default value of IO object’s external encoding, it is applied to $stdin、$stdout、$stderr, which are always ready to use. On the other hand, we should explicitly set the encoding to open files or others. Ruby developers concluded that the encodings for standard I/O should be the same as the one used on a console since standard I/O is available only on the console. Thus, they agreed to use $LANG environment variable on Unix and Windows platforms or GetConsoleCP on Windows for the encodings of IO object. They have an idea that UTF-16LE would be a substitute on Windows platform using Unicode compliant API. However, the idea was not included since it would have a problem in compatibility between Ruby 1.9 and 1.8. *5

For the background of the locale encoding, it is possible on Windows platform to have an incorrect encoding when a programmer uses the value of default_external not for standard I/O or default_external’s default value. I think you’d better to report in the ruby-dev mailing list if you want to use the locale encoding not only for setting the default value of default_external. Be aware that future versions of Ruby might use GetACP instead of GetConsoleCP.

A filesystem encoding is used to handle characters on file system. For example, character encodings of filenames got from file system are the field of the filesystem encoding. Thus, the filesystem encoding is totally different from the locale encoding described here. Currently, no Ruby API is provided to get the filesystem encoding so far. *6

Windows stores filenames in the UTF-16LE *7 encoding on FAT32, NTFS, or other file system such that long filenames are supported. Also, FAT file system on Windows NT uses UTF-16LE for the filenames after files are read by system. Consequently, I can say Windows, especially Windows NT, uses UTF-16LE to handle the filenames.

Since Ruby 1.9.1 uses ANSI API, Windows gives Ruby filenames after converting them to ANSI or the OEM code pages *8. This means that the filenames are always encoded in ANSI or the OEM code pages when Ruby 1.9.1 gets them. Thus, the Ruby’s filesystem encoding is invariably ANSI or the OEM code pages on Windows platform. Ruby assigns the encoding of strings to ANSI or the OEM code pages, and, if necessary, converts strings into the appropriate encoding specified by command-line options.

Since Ruby 1.9.1 uses ANSI API, Windows gives Ruby filenames after converting them to ANSI or the OEM code pages *8. This means that the filenames are always encoded in ANSI or the OEM code pages when Ruby 1.9.1 gets them. Thus, the Ruby’s filesystem encoding is invariably ANSI or the OEM code pages on Windows platform. Ruby assigns the encoding of strings to ANSI or the OEM code pages, and, if necessary, converts strings into the appropriate encoding specified by command-line options.

On Unix Platform, we are unable to detect encodings of filenames reside in filesystem. For this difficulty, Ruby regards the locale encoding as the filesystem encoding, and sets the locale encoding to a byte array that made from a given filename.

On HFS+ of Mac OS X, filenames are saved by UTF-16, whose format is the Normalization Form D modified by Apple. When we use POSIX API, we can get filenames encoded in UTF-8. I mean the encoding of a filename will be UTF8-MAC if the filenames are saved using Carbon libraries of OS X. Thus, Ruby assigns UTF8-MAC to the filesystem encoding on Mac OS X. However, when POSIX API has been used to save the filenames, Ruby handles them exactly the same as the way on Unix.

Ruby M17N implemented the idea described in previous section, although implementation was not simple for various reasons. Those were, for example, lack of resources for development, pursuit of higher usability, and preservation of backwards compatibility.

As you may know, Ruby 1.9.1 returns 83 when you run Encoding.list.length (Encoding.list.length #=> 83), which shows Ruby currently supports 83 encodings. However, all 83 encodings are not supported equally mainly from lack of development resources. Encodings are grouped in three categories according to the level of support Ruby assures. Three categories are ASCII compatible encodings, ASCII incompatible encodings, and dummy encodings. Since Ruby M17N uses the CSI model, Ruby should know how to handle encoded strings, rather than has a conversion table from some external to/from internal encodings like the UCS Normalization model.

Ruby fully supports strings encoded in the encodings of this category. ASCII compatible encodings means that every character in the US-ASCII area is mapped to the range \x00-\x7F. This is the only one category that we can use in a Ruby source code. The most remarkable feature of this category would be that we could compare or concatenate a pair of strings even though encodings of two strings are not equivalent under a condition. The condition is that strings to be compared and concatenated should consist of ASCII characters, and “String#ascii_only?" should return true. Ruby succeeded in getting over a hedge of encodings.

Ruby assumes that code points 0x00-0x7F are mapped to US-ASCII in every encoding. For example, even in Shift_JIS encoding, the code points 0x00-0x7F should be US-ASCII for Ruby M17N, although the range is mapped to JIS X 0201 Roman. But, conversion by transcode is exception, and follows the ordinary definition of Shift_JIS.

When we say strings are ASCII ONLY, the strings consist of just ASCII characters and are encoded by one of ASCII compatible encodings. ASCII ONLY strings are available to compare, concatenate, and match with regular expression against strings encoded in other ASCII compatible encodings.

ASCII-8BIT is one of ASCII compatible encodings, and is applied to an ASCII compatible octet sequence. Thus, strings are ASCII compatible but different from typical definition of “string" and also different from a binary form. This encoding is not an ASCII incompatible encoding; also, strings can be compared, concatenated with strings of ASCII characters. Please understand that Ruby 1.9.1 does not have ASCII incompatible binary encodings since they are considered unnecessary encodings for Ruby.

Emacs-Mule Encoding is an internal encoding Emacs/Mule uses. The approach of this encoding is going toward for the multilingualization of ISO 2022, which is a stateless, variable-length encoding. An example is the stateless-ISO-2022-JP encoding.

The definition of ASCII incompatible encodings is that characters in US-ASCII area are mapped to another code points of \x00-\x7F. UTF-16BE, UTF-16LE, UTF-32 BE, and UTF-32LE are categorized to this encoding in Ruby 1.9.1. Ruby partially supports this sort of encodings. We can’t use US-ASCII incompatible encodings in a Ruby script; besides, we can’t concatenate with ASCII strings if the underlined strings are encoded in the ASCII incompatible ones.

As I talked about, Ruby 1.9.1 partially supports UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE, all of which does not have BOM (Byte Order Mark) *9. Therefore, please feel free to delete U+FEFF, bytes of BOM, since U+FEFF is treated as ZERO WIDTH NO-BREAK SPACE. Be careful that Ruby 1.9.1 does not support UTF-16 and UTF-32.

The lack of development resources is the reason of UTF-16 and UTF-32 unsupported. Once, Ruby developers tried to support these encodings; however, they figured out it was not easy. To support these encodings, Ruby needs to calculate byte position paying attention to BOM, provide endian sensitive methods for each encoding, and tackle complicated IO related processing. In light of these difficulties, Ruby 1.9.1 gave up to support UTF-16 and UTF-32. No one will oppose to support these two encodings, please provide a patch. Probably, the patch will be taking in.

Since ASCII incompatible encodings are supported only partially, it is recommended to convert strings into UTF-8 when various string operations are expected.

Although Ruby defines UCS-2BE as an alias of UTF-16BE, Ruby 1.9.1 does not support UCS-2BE. The alias name, UCS-2BE, is used to read data encoded in UCS-2BE for users’ convenience.

Dummy Encodings are the ones Ruby knows just names of them. Ruby regards strings of dummy encodings as byte sequences and does not see them as strings. Even though a string has ASCII characters only, comparison, concatenation, or other string operations are not supported for dummy encodings. ISO-2022-JP, UTF-7, or other stateful encodings are in this category. We should convert them into stateless-ISO-2022-JP, or UTF-8 before we use strings.

Ruby can have a new encoding support as an extended library. The one of C API, rb_enc_replicate, will help you to define the new encoding by creating a replica of other already supported encodings. Or, rb_define_dummy_encoding will help to create a dummy encoding. (The idea of the “replica" is that we can manage encodings from C API, but cannot do anything against them from Ruby.) It would not be easy to define the new encoding from scratch. However, I encourage you to request a standard support of the encodings you want.

In spite of my recommendation, when you need to have your own implementation of the new encoding support, you should create the replica from one of supported encodings. In addition, you should be careful to define a new dummy encoding since strings encoded in dummy encodings are unable to concatenate with ASCII ONLY strings. Make sure your choice of a dummy encoding is truly correct.

Ruby defines the names, “locale," “external," and “internal" to refer three internal encodings: locale encoding, default external encoding, and default internal encoding. When we want to know the script encoding in each source file, we can use __ENCODING__ keyword.

Encoding class provides utility methods to access encodings such as getting a list of encodings, or managing special encodings. In Ruby, Encoding class does not have a conversion table for encodings; instead, the class keeps encoding byte structures, and character information using Oniguruma encoding module. The encoding conversion table is maintained by Encoding::Converter, which is a member of a transcode group.

Other than Encoding#name and Encoding#inspect, we have the method “Encoding#dummy?” to know whether the given encoding is categorized in dummy encodings or not.

To know the given encoding is ASCII compatible or not, we’d better to use Encoding::Converter.asciicompat_encoding instead of methods in Encoding class. This method returns nil for ASCII incompatible encodings and non-existing encodings. Or, it returns the ASCII compatible encoding whose character sets are equivalent to the given encoding when the given one is the ASCII compatible or dummy encoding.

Encoding class has more methods, for examples, Encoding.compatible?(str1, str2). The method is used to judge two String or Encodings objects are available to compare or concatenate, and return the resulted encoding after concatenation. Please see the document of Encoding class for further details.

In Ruby 1.8 and older versions, String was just a byte array. This design has brought us high flexibility and low cost over encoding conversions. In older versions, Ruby could successfully convert byte arrays to appropriate encodings whatever those are Shift_JIS, EUC-JP, or UTF-8. However, the design also had negative effects that string operations are limited to the regular expression with $KCODE, or methods proviced by jcode.rb. Because of this limitation, some people have dissatisfied with Ruby.

Introduced Ruby M17N, String object in Ruby 1.9 is still a byte array, but the array has an encoding related to unlike older versions. This change made every operation available to being encoding compliant. When the String object is encoded in one of the encodings Ruby supports, we can use the String object as string, literally, whatever the encoding is. In this section, I’ll explain that what part of String object has been changed in Ruby 1.9.

Since strings are arrays of characters, I’m going to start from Character Class. Ruby M17N does not have any class that expresses a character literally, but has a String object whose content is a single character.

In early days of Ruby M17N development, Character class was on the table. It was not the special case of String class, but just Character class. However, in the course of designing, Ruby developers figured out that they do not need a class definition just for a character since String class can cover every feature of the Character class. I mean code points, encodings, and data stored in a byte array are also the element of the String class. In terms of Ruby-ism, the Character class is designed to be a String class whose content is just one character.

This design has a couple of advantages to cover various string operations. For example, a string is ready to use just setting an appropriate encoding to a byte array read from external resources. Or, we can change the unit of a character by replacing an encoding. On the other hand, the design has a downside that a performance goes down for a difficulty to identify a targeted character position in a string of a variable-length encoding.

When we want to get a character from a given string, we can use String#[]. Ruby 1.8 String indexer returns a byte value of a specified index, I mean Fixnum type value will be returned. Look at examples below. The first example of Ruby 1.8 returns the value of the 0th byte when the index is zero, and its value is 0xE3.

On the other hand, Ruby 1.9 returns a character as the value of the string indexer. Like I talked, the character means a string whose content is just a single character. The second example shows this. The first letter “あ” of the given string, “あいう” is returned when the index 0 is specified. This example describes well that Ruby 1.9 sees the String object as an array of characters literally.

Ruby 1.9 does not have Character class defined, but we can write a character literal in a Ruby program. Not only a conventional style like “?a,” but also a new multilingualized style using non-ASCII character like “?あ” are available to write in it. In addition, Unicode notation is now available to use. Unicode notation is similar to the one that is an escape symbol preceded ASCII code expression. The encoding of the character is UTF-8 when Unicode notation is used, or the same one of the script encoding of a source code.

Ruby has a String#ord method to convert a character to a code point. What if we try this using a Hiragana character, “あ”.ord? The result depends on the encoding tied to the character. For example, we get 12354 when the encoding is UTF-8. Then, what if we try converse method “chr” against 12354 got from “あ”.ord? We get an exception instead of the expected character. This method needs an encoding to convert into. Thus, 12354.chr("UTF-8") gives us the Hiragana character “あ” as we intend.

In most cases, a String literal remains same as was in Ruby 1.8. One of a couple of new features is that the form of Unicode escapes is added. In Ruby 1.9, we can use \uXXXX and \u{XXXX} to express a character in addition to the traditional forms of \OOO and \xHH.

When a String object is composed by String literals, the encoding of that String object is usually the same as the script encoding. The exception is String literals by Unicode escapes. In this case, UTF-8 is applied to the String literals whatever the script encoding is. If non-ASCII String object is created using byte escapes under the script encoding is set to US-ASCII, then the resulted encoding will be ASCI-8BIT.

“String#length” method has been changed to a character aware one. This method returned a length of byte array that expresses a String in Ruby 1.8. However, Ruby 1.9’s “String#length” method returns a number of characters in the String. When we want the length of byte array of the given String, a newly added “String#bytesize” method is the one.

String#each" method was removed from Ruby 1.9. Since a String object is not enumerable anymore, this method became unclear what should be iterated.

To make it clear, Ruby 1.9 has four kinds of methods to enumerate a String. Those are “String#each_byte” to iterate a byte, “String#each_codepoint” for a code point, “String#each_char” for a character, and “String#each_line” for a line. When a block follows these methods, we get the same result as the old each method did; meanwhile, when no block comes after the methods, we get Enumerator object. Also, Ruby 1.9 has plural forms of these methods, “String#bytes,” “String#codepoints,” “String#chars”, and “String#lines.” These plurals clarify that a String object is not only an array of characters but also still an array of bytes, code points, or lines.

The methods explained here except each_codepoint have been back ported to Ruby 1.8, and are available to use since Ruby 1.8.7.

String comparison and concatenation have been changed largely in Ruby 1.9. Even if two strings are identical in terms of a byte array, Ruby returns false for String#== when encodings of two strings are not the same. String comparison results in true only when both byte arrays and encodings are matched respectively. However, one exception exists. Even though the encodings of two strings are different, we get true only if two encodings are ASCII compatible and strings to be compared are all ASCII characters.

In case of string concatenations, Ruby raises Encoding::CompatibilityError when encodings of two strings to be concatenate are not the same. However, when both encodings are ASCII compatible, and at least one of two strings is an ASCII only, the concatenation is possible. Or, concatenation with empty character is always possible whatever the encodings are.

So far, I talked about the String as an array of characters. Now, I’m going to pick another side of a String up here, a String as a byte array. Not like Ruby 1.8, Ruby 1.9’s byte array support in String is limited, and just three methods are provided. Those are "String#getbyte(index)" to read a byte, "String#setbyte(index, value)" to write a byte, and “String#bytesize” to get a byte length. Proposals of more sophisticated features will be welcomed.

"String#force_encoding" is a destructive method to force String to change its encoding. This method is useful when we create a new String of another desired encoding by combining with "Object#dup" without modifying a byte array.

"String#force_encoding" should be sparsely used since Strings have already had appropriate encodings assigned when those are created, or read from files specifying the encoding. The method might be useful when we need network library or XML library in which encodings are managed out of the Ruby world. For example, encodings set in HTTP headers or XML declarations are. Or, we can use str.force_encoding("ASCII-8BIT"), when we want to start using String as a byte array, which was treated as an array of characters before.

If you need to use "String#force_encoding" in your library, you should reconsider your library design. You should not use this method thoughtlessly. A correct design does not need this method. To warn people not to use this method easily, the method had the name force_encoding not like set_encoding or encoding=, and impresses it is destructive.

"String#valid_encoding?" judges a String whether it has a correct byte structure in terms of the encoding assigned to it. We can use this method to know that the String has a right byte structure of the String’s encoding; however, we can’t know that every character in the String is defined in the assigned encoding. To know the character in the String is defined in the encoding, we should try conversion using Encoding::Converter.

"String#gsub(pattern, hash)" is one of new methods of Ruby 1.9.1. Before this method has been added, we need to use a block when we want to replace a character by matched one. The problem was that using a block is costly.

Internally, "str.gsub(pattern, hash)" method works in the same way as "str.gsub(pattern){hash[$&]}" did. But, the new method works really faster than the old one. This is because the new method works mainly in the C library layer. Thus, the method is thought to be effective especially in escaping a specified character.

I’m going to add the information about "String#inspect" and "String#dump" methods here to clarify those usage although the methods are not changed in Ruby 1.9. The "String#inspect" method is defined to know what the String is by just giving a glance at. When we want to escape or dump Strings, "String#dump" and "Marshal.dump" methods work as we expected.

* String#inspect #=> an easy way to check it using p* String#dump #=> dump use (str == eval(str.dump) is guaranteed)

Many people might not be aware that Ruby’s regular expression has been encoding sensitive since older versions. For example, in Ruby 1.8, /a/e.kcode returns “euc.” However, the implementation for Regular Expression in Ruby 1.8 was behind to other languages because of its GNU regex based implementation. The old implementation could handle only SJIS、EUC、and UTF8 encodings; besides, it does not have a feature to look backwards.

Ruby 1.9 uses the Oniguruma 5.9.1 equivalent regex engine. This new engine enables to use more colorful rules than before, lookbefind feature, and named capture groups; moreover, matching with subexpressions by context-free grammars is also available to use.

Regular expression matching is absolutely encoding sensitive. I mean, “/./” matches any character except newline if both encodings are the same; otherwise, we get Encoding::CompatibilityError.

Since Regexp#force_encoding is not immutable, use Regexp.new(reg.source.force_encoding(enc)). When you want to change the encoding of regular expression, use Regexp.new(reg.source.encode(enc)) instead of non-existing Regexp#encode. We can use Regexp.new for ASCII incompatible encodings, too. Be careful that you might get unexpected results if you use code points in regular expression. This happens when the regular expression depends on the order of the used code points and is converted into a different encoding.

Ruby’s regular expression has a similar idea to ASCII ONLY. When Regexp object has an ASCII compatible encoding and ASCII only expression, the object can match with a String object whose encoding is ASCII compatible. In this case, “Regexp#fixed_encoding?” returns false.

In old Ruby versions, we have applied $&, Regexp.last_match, $1, and $2 to use matched strings for some purpose in a program. Returned value from String#match is also used to assign into a variable before.

Capture syntax of regular expression make it possible without using orthodox measures. We can directly assign a matched value to a local variable by using capture syntax. When a regular expression literal on the left side of “=~” has named capture groups but does not have dynamic expansion of #{} or others, the captured string is assigned to a value of a local variable after matching. The name of the local variable should have a correct name of the named capture group.

It would be not common to match strings with a regular expression that has escaped byte arrays. This does not work in ASCII-8BIT and UTF-8 because of a compatibility problem. The reason is the matching operation is not a string operation but byte operation. The regular expression expects to be used for comparison of byte arrays. In addition, the byte array might be an illegal byte array. Thus, we should convert both side of the expression into ASCII-8BIT before the operation.

The second example below is possible because ASCII-8BIT is a compatible encoding to ASCII. We can regard this as a regular expression matching of ASCII ONLY characters. If the encoding is not ASCII incompatible, the result will be ArgumentError like in the first example.

IO class is encoding sensitive, so we should be careful returned values whether those are strings or byte arrays. To use IO class, we should know the idea of “external encoding” and “internal encoding” to convert and set encodings to the strings.

The external and internal encodings affects to the encoding set and converted by IO class. When the internal encoding is not given, the external encoding is applied to input String object. See the table below for details.

We can set the external and internal encoding by the second argument of IO#open, the third argument of IO#open as an option of Hash, and IO#set_encoding after opening a stream.

The methods in IO class are categorized into four based on what kind of data IO class handles. The four are the byte, character, byte array, and string category.

Character operation methods, IO#getc, IO#ungetc, and IO#readchar, has been changed to handle characters as String not like old versions in which those were Fixnum. Also, IO#each_char method handles characters. A return value from these methods follows the encoding converting rule illustrated in the table above.

In light of the changes that IO#getc now handles characters, IO#getbyte and IO#readbyte methods are added to IO class. Also, IO#each_byte is newly added method for byte handling.

When we want to operate byte arrays, IO#binread, IO#read(size), IO#read_nonblock, IO#readpartial, and IO#sysread are the methods. The encoding of a byte array is always ASCII-8BIT.

The applied encoding to methods for string operation depends on a combination of a couple of factors. Example method of this kind is IO#read method that does not have an argument for size.

IO#external_encoding, IO#internal_encoding return external encoding, internal encoding respectively. The returned encodings are used to judge conversions. We should be careful that these methods do not simply return external and internal encodings that IO object has.

A file path encoding is normally determined based on a filesystem encoding of a platform, so it varies that how and what encoding are applied to the file path on each platform. Let me remind you, Ruby does not provide any API to get filesystem encoding; thus, no Ruby API to get file path encoding is out there.

On Unix Operating System, we can’t determine filesytem encoding in general. Ruby returns a byte array of a filename after setting the filesystem encoding or specified encoding by a command-line option. At this time, Ruby does not convert the byte array; therefore, when Ruby hands the filename over to the system, the filename is also a byte array of a String object.

The filesystem encoding of Mac OS X is UTF8-MAC. Consequently, Ruby returns a filename encoded in UTF8-MAC or the specified encoding by a command-line option. When Ruby gives OS X the filename, Ruby converts it into UTF8-MAC if the encoding is other than ASCII-8BIT, but does not any conversion if the encoding is ASCII-8BIT. Please remind that this behavior will be possibly changed in future since the design is not stable. (Like the one Ruby does on Unix)

When ANSI API is used for Ruby implementation, strings handed from system to Ruby runtime are encoded in ANSI or the OEM code pages. Thus, Ruby returns a filename as a string encoded in filesystem encoding by default. If a specific encoding is given to Ruby runtime by a command-line option, Ruby returns the filename string encoded in the specified encoding. There are two types of Ruby’s behavior when Ruby passes the filename to the operating system. If the filename string has ASCII-8BIT as it’s encoding, Ruby gives system the string without any conversion since it should be a byte array. Or, Ruby gives the filename after converting it in the filesystem encoding.

For the lack of development resources, Ruby 1.9.1 behaves just I described here. However, Ruby 1.9.2 has a plan to use Unicode API instead of ANSI API. After Unicode API is used by Ruby implementation, Ruby returns the filename encoded in Unicode when a command-line option specifies one of Unicode encodings. In this case, filenames are never converted in ANSI or the OEM code pages. In the same way, when Ruby gives system the filename without converting, I mean, encoded in Unicode if one of Unicode encodings is specified. The advantage of this new design is Ruby can cover wider range of characters that was once unable to handle correctly by an encoding extracted from system locale.

$KCODE has been deprecated in Ruby 1.9. If you have programs depend on $KCODE, you need to modify them to work under new Ruby M17N design. The substitute, Encoding.default_internal, is ready in Ruby 1.9.1; however, you should be careful that the default value of Encoding.default_internal is nil. Besides, encoding conversion will be done automatically in IO object when a value is set to Encoding.default_internal.

The ideas of “replica encoding” and “base encoding” in Ruby 1.9.0 has been deprecated in Ruby 1.9.1. These ideas were originally invented to share implementations between encodings that have the same byte structure. Ruby developers thought the ideas were possible to define supersets or subsets of character sets.

However, the flaw of the two ideas came out. To define the supersets and subsets of encodings based on similarity of the byte structure worked well only for EUC-JP and Shift_JIS encodings. On the other hand, many other encodings needed to be newly defined. Finally, Ruby developers admitted that the ideas were insufficient, and decided to remove from Ruby 1.9.1.

Since C API still has the ideas of replica and basic encodings, we can see what those ideas were.

The –K command-line option is not recommended anymore although it is still available to use. When we use –K option in Ruby 1.9, we can set some encoding to a default value of script encoding and Encoding.default_external. If no encoding is given by –K option, Ruby applies US-ASCII to the default value of the script encoding. Encoding.default_internal is not related to –K option, so it remains nil.

The –K command-line option has survived for backwards compatibility. The option works when we want to run Ruby 1.8 codes on Ruby 1.9 without any modification. Since –K is not recommended to use, we should write the magic comment in each script file. The magic comment is the best answer to make scripts run on both 1.8 and 1.9, especially, for new scripts.

Every Ruby version has released with character code conversion libraries, Kconv, NKF (the implemetation of Kconv), and Iconv. However, these libraries have flaws that encodings supported by them are limited. In addition, the libraries depend on each platform. To fix the issues, the transcode library written by Martin Dürst is bundled in Ruby 1.9. Using the transcode library, String#encode method and Encoding::Converter class are newly defined.

Ruby 1.9’s encoding conversion library rewrites both the byte array of a String and the encoding assigned to it. When we want to change only the encoding of the String, String#force_encoding is the method we use.

String#encode and String#encode! methods are the most basic encoding conversion tools. Theses methods allow us to give options by hashes. Also we can specify behaviors using them when an illegal byte sequence is found in the original String (:invalid => nil | :replace), or an undefined character in the given encoding is found (:undef => nil | :replace). We can use these methods to mix converted strings in XML documents (:xml => :text | :attr), and to replace non-LF new line character to a line feed(LF).

When we create an instance of Encoding::Converter class, we can give a source and destination encodings or an array to specify conversion paths to its constructor. In addition, following constants are available to set to an option field of the constructor as well as conversion options used in String#encode method.

Encoding::Converter#convert method has a unique feature to take an area out from a given String and convert the selected area only. Thus, we don’t need to mind possible invalid byte sequences in the strings while converting them read from a stream if we use Encoding::Converter#convert method.

If an invalid byte sequence is found while executing Encoding::Converter#convert method, Ruby raises Encoding::InvalidByteSequenceError. In case of an undefined character, Ruby raises Encoding::UndefinedConversionError. Encoding::Converter#convert method is unable to restart the conversion once the exception is raised. Instead, we can use a following Encoding::Converter#primitive_convert method to escape invalid and undefined characters, or specify detailed behaviors.

Encoding::Converter#primitive_convert method is the best and only one to specify fine grained behaviors to a converter. Using this method, we can keep portability and specify how to manage invalid characters and undefined characters after the conversion.

Above is the example to show how we can convert characters with escaping invalid characters and undefined characters in destination. As in the code, branching by a return value, we can go forward using the information stored in Encoding::Converter#primitive_errinfo.

Iconv is a wapper of character conversion library bundled in Unix platform. The supported encodings and behaviors of this method depends on unix distributions, so we should use transcode based methods in Ruby 1.9.1.

If you are planning to create a library that works on Ruby 1.9, you should code inUS-ASCII only. You should keep your library from relying on a specific encoding or character set although non-ASCII String literals can be used with the magic comment. When you need to use a specific encoding in your library, you should see theencoding of a given method argument. For example, when you write Hiragana to/from Katakana conversion library, you should avoid writing a Hiragana/Katakana map. Instead, you should convert using a correspondent encoding based on the encoding of the method argument. I recommend not modifying original strings.Your library should choose an output encoding according to the priority below:

The encoding of a user-supplied argument, if it is given

Encoding.default_encoding.String#encode will do this by default like so:

* Choose one encoding for the UCS model (UTF-8, EUC-JP, or whatever you like).* Set the chosen encoding to Encoding.defult_internal.* Convert characters into UCS only if those are not UCS.* Deal characters and bytes differently like the CSI model

* Ruby 1.8 could have only one encoding for its internal processing. Thus, the choice should be UCS model only.* Since automatically converted results would be often culprits of later processing, Encoding.default_internal should be nil.* You should use Ruby 1.9’s string processing methods. If you want to use methods of Ruby 1.8.7 or older, you need to redefine them by yourself.* Also, you should use Ruby 1.9’s byte processing methods. If you want to use methods of Ruby 1.8.7 or older, you need to redefine them by yourself.

3 comments:

and Can I link to this article from the original article for English readers?

> YARV is, now, a bundled interpreter."YARV is, now, the Ruby VM."YARV is not interpreter, it is a VM.

> Enumerator in its standard API"Enumerator in its core"Enumerator is moved from bundled library to core.

> U.S. or Eastern Europe"U.S. or Eastern Europe" or "U.S. or Western Europe"

> was o oftentypo

> for example, by using Ruby’s gettext library"for example, gettext library"This is not Ruby's case.

> Derivation of I18N, L10N, and M17Nhttp://www.i18nguy.com/origini18n.html add this link

> The UCS normalization model unifies all internal character codes on earth and defines the only one character set, which is called Universal Character Set.'unify character codes on earth' is not in this model.This model only says 'the system uses only one code'.(although in many cases the code is internationalized)And the term 'Universal Character Set' doesn't mean Unicode, it is a common noum in this context.For example, TRON doesn't use Unicode.