Ruby 1.9 Encodings

Posted by Mobomo on January 03, 2011

When i came to Ruby 1.9, the first problem i met is the encodings. Gregory Brown said, in a training session at the Lone Start Rubyconf, “Ruby 1.8 works in bytes. Ruby 1.9 works in characters.” In Ruby 1.8, you have to deal with those bytes and it does not provide any functions with encodings. But in Ruby 1.9, i think you must know about the encoding stuff to make you life easier. Let us talk about the fouce Encodings in Ruby 1.9 by examples.

The Source File Encoding

The source file encoding is the character encoding of a given source file. It is US-ASCII by default. When you create a String literal in your code, it is assigned the Encoding of your source. So you have to changing the source Encoding when you want to place any non-ASCII content in a String literal.

1

2

3

4

5

6

7

8

9

10

11

12

13

$cat no_encoding.rb

p"Ã¤Â¸Â­Ã¦â€“â€¡".encoding

$ruby no_encoding.rb

no_encoding.rb:1:invalid multibyte char(US-ASCII)

<br/>

$cat encoding.rb

#!ruby19

# encoding: utf-8

p"Ã¤Â¸Â­Ã¦â€“â€¡".encoding

$ruby encoding.rb

#<Encoding:UTF-8>

As you can see in the no_encoding.rb, the error came out as “invalid multibyte char (US-ASCII)” when there is an chinese string in the source file. That is because when nothing of encoding is specified, Ruby will default to ASCII. But after the encoding is specified by adding the encoding comment, it works.

The String Encoding

Each string has its own own encoding, which you can access with String#encoding method:

1

2

3

4

5

6

7

8

9

10

11

12

ruby-1.9.2-head>string="Ã¤Â¸Â­Ã¦â€“â€¡"

=>"Ã¤Â¸Â­Ã¦â€“â€¡"

ruby-1.9.2-head>string.encoding

=>#<Encoding:UTF-8>

<p>You could transcode the stringintoadifferent encoding by using String#encode:

<pre>

ruby-1.9.2-head>string_in_gb2312=string.encode("GB2312")

=>"x{D6D0}x{CEC4}"

But the transcoding will fail if the encoding does not support all characters in your string:

1

2

3

4

ruby-1.9.2-head>string_in_ascii=string.encode("us-ascii")

Encoding::UndefinedConversionError:U+4E2Dfrom UTF-8toUS-ASCII

The External Encoding

The encoding of the data in an IO stream is known by Ruby as the object’s external encoding.The default external Encoding is pulled from your environment.

1

2

3

4

ruby-1.9.2-head>Encoding.default_external

=>#<Encoding:UTF-8>

Here is how the exernal encoding works:

1

2

3

4

5

6

7

8

9

10

ruby-1.9.2-head>f=File.open("example.txt")

=>#<File:example.txt>

ruby-1.9.2-head>f.external_encoding

=>#<Encoding:UTF-8>

ruby-1.9.2-head>content=f.read

=>"Ã¨Â¿â„¢Ã¦ËœÂ¯Ã¤Â¸â‚¬Ã¤Âºâ€ºÃ§Â¤ÂºÃ¨Å’Æ’Ã¦â€“â€¡Ã¦Å“Â¬"

ruby-1.9.2-head>content.encoding

=>#<Encoding:UTF-8>

if the file is not going to use the default extrenal encoding, you can override it: