That's pretty much correct. Ruby's Unicode support is somewhat weak
compared to python or perl.
Only UTF-8 is supported. No support for UTF-16 is available, afaik.
Basically... here's everything you wanted to know about ruby's Unicode but
were afraid to ask....
* $KCODE can be set to support an encoding directly, but this is *NOT*
needed to have a script work with unicode.
It is just a simple shortcut so that any regex like /./ will do the right
thing.
* Without $KCODE, regexp with unicode support is available. It is done
using /u language option, like
t =~ //u
or
Regexp.new(regex, options, 'u')
(or, alternatively, //m which is for multi-byte -- meaning ANSI, UTF-8,
EUC, or SJIS depending on
what $KCODE is set to, albeit I believe this is now no longer needed as
setting $KCODE will alredy
adjust all regexes).
* Supporting u"" like python can be added to some extent very easily. See:
http://redhanded.hobix.com/inspect/closingInOnUnicodeWithJcode.html
This allows you to then do:
c = u'U+00a9' # same as \xc2\xa9
* You can also use:
[].pack('U*')
"".unpack('U*')
to pack/unpack utf-8 strings. This allows you to easily count
characters and iterate thru them,
without the need of jcode (which really is only needed for getting succ
to work).
* jcode.rb is kind of a ruby hack and it is incomplete. Methods such as:
reverse, capitalize, casecmp, swapcase, all the strip functions and probably
others are not defined and will return incorrect results, depending on the
language.
* Ruby's $KCODE does not add a UTF-8 <->Latin1 encoding conversion, unlike
python's unicode strings. So, albeit with the above, you can do:
question = u'U+00bfHabla espaU+00f1ol?' # ¿Habla español?
puts question
similar to python's:
question = u'\u00bfHabla espa\u00f1ol?' # ¿Habla español?
print question
You will not get the corresponding Latin1 string when you print it (unlike
python's unicode strings).
* To properly do the above, and convert Latin1<->UTF8 for printing, you
should use iconv.
ruby -rinconv -e 'puts Iconv.iconv("UTF-8", "ISO-8859-1", "\xf1")'
Iconv, by default, does *NOT* get installed by the One-Click Windows
installer, even thou it is supposed to be a
standard part of ruby.
Adding something then like:
class UString
require 'iconv'
def to_s
puts Iconv.iconv("UTF-8", "ISO-8859-1", self)
end
end
will do the trick for Why's UString class.
* The ruby interpreter should have no problem reading a utf-8 .rb script
file, but you have to prefix it by calling
> ruby -Ku file.rb (or set RUBYOPTS to -Ku, so ruby always runs with that)
Note, however, that window's notepad, when saving UTF-8 files adds a valid
albeit meaningless 3-byte BOM (byte-order sequence) at start which will not
work fine with ruby1.8 (and will also corrupt unix shebang lines on
most -all?- unixes). This sequence is not valid utf-8 unicode, albeit it is
allowed by the standard. Ruby, just as Unix shebangs, does not deal with
this appropiately.