On Tue, 18 Jan 2005, Yukihiro Matsumoto wrote:
| In message "Re: The face of Unicode support in the future"
| on Tue, 18 Jan 2005 12:08:34 +0900, Wes Nakamura <wknaka / pobox.com> writes:
|
| |Is there opposition to a separate unicode string class, that would
| |coexist with the current byte-based string class? I find a fixed-width
| |unicode-based string type to be much easier to deal with rather
| |than individual encodings. With the byte-based system you would have to
| |worry about the language of the text in each string, and check
| |encodings before doing something like a string compare.
|
| That's true in C strings (char* or wchar_t*), which you have to
| allocate by yourself, and handle then character-wise, but not for
| strings in Ruby with much higher abstraction in API. The lower level
| processing like allocation and resizing internal buffer, etc. are
| handled automagically.
|
Will this be efficient enough? When using a non-fixed-width encoding,
String#[] won't run in constant time.
Since "How to support unicode (and other character sets)" is a problem
that's already facing the jruby developers, I have a few questions
that go into more detail:
1. This method is mentioned:
String#encoding, returns a string specifying the encoding
But I haven't seen this, is there also:
String#encoding=
I assume that setting the encoding would do nothing to the internal
representation of the string (based on char *), it would just affect
how methods that work on strings deal with characters, etc.
2. What is the default encoding for strings? What encoding would
String.new("") have #encoding set to?
3. Are literal strings assumed to be a certain encoding, (encoding of
the script?) or can you specify an encoding at the time of creation?
"string in encoding \x{xxxx} of the script file" (#encoding automatically
set to script's encoding, xxxx taken as bytes in the same
encoding)
Specifying an encoding for a literal may not work since the
literal's encoding could conflict with the script's encoding.
This wouldn't work (if there were an encoding argument) since the
bytes of the literal do not correspond with the desired utf-16 characters:
String.new("\x{30b9} in script that's not utf-16", "utf-16")
This would work:
String.new("\x{e382b9} in script that's ascii", "utf-8")
([e3 82 b9] being the utf-8 equivalent of [30 b9] in utf-16).
(also see 3b)
Maybe it's just easier to assume that all string literals are in
the encoding of the script and any \x{} sequences represent bytes
in the same encoding.
3a. If there is a way of creating literal strings in other encodings,
is there also a way of creating literal regex's in other encodings?
(You could always create them as Regex.new(string_in_some_encoding)).
3b. In \x{xxxx}, does the number have to be a 4-digit (hex) number?
How would you specify a utf-8 character, which can be more than 2 bytes?
Is the \x{} syntax basically \x{byte byte byte..}?
4. Will String#explode return an array of Fixnums, basically a byte array,
of the raw char * values?
This would mean that s.explode.size is not necessarily == s.size
5. When using String#[idx]= to set a single character, it must take as
an argument a string which has a size of 1 (i.e. one codepoint) but
internally (i.e. #explode) doesn't necessarily have a size of 1?
6. Right now there is Fixnum#chr. Will there be Array#chr(encoding) or
something similiar? So you could do something like:
[ 0x30, 0xb9 ].chr("utf-16")
7. Will strings that, when converted to the same encoding, are identical,
give different results for #intern when left in different encodings?
What happens to an interned string with a binary encoding? Is it interned
based on the internal bytes of the string rather than the characters?
Wes