Community

YAUST - Yet Another Unified String Theory :)
Well, here's my proposal for cleaning up strings. I tried to
- be as practical as possible
- leave full control over encoding when one wants to have it
- remove any possible confusion as to what each type is
- allow efficiency where possible, without excessive effort
First, the proposed changes are listed, followed by rationale.
==============================
I. drop char and wchar
--
II.
create cchar (1-byte unsigned character of platform-specific encoding,
C-equivalent)
create utf8 (1 byte of UTF8)
create utf16 (2 bytes of UTF16)
leave dchar as is
--
III.
version(Windows) {
alias utf16[] string;
} else
version(Unix/Linux) {
alias utf8[] string;
}
add suffix ""s for explicitly specifying platform-specific encoding
(i.e. the string type), and make auto type inference default to that
same type (this applies to the auto keyword, not undecorated strings).
Add docs explaining that string is just a platform-dependant alias.
--
IV.
add the following implicit casts for interoperability
from: cchar[], utf8[], utf16[], dchar[]
to : cchar*, utf8*, utf16*, dchar*
all of them ensure 0-termination. If cchar is converted to any other
form, it becomes the appropriate Unicode char. In the reverse direction,
all unrepresentable characters become '?'. when runtime transcoding
and/or reallocation is required, make them produce a warning.
--
V.
add the following implicit (transcoding) casts
from: cchar[], utf8[], utf16[], dchar[]
to : cchar[], utf8[], utf16[], dchar[]
when runtime transcoding is required, make them produce a warning (i.e.
always, except when casting from T to T).
--
VI.
modify explicit casts between all the array and pointer types to
- transcode rather than paint
- use '?' for unrepresentable characters (applies to encoding into
cchar*/cchar[] only)
- not produce the warnings from above
--
VII.
create compatibility kit:
module std.srccompatibility.oldchartypes;
// yes, it should be big and ugly
alias utf8 char;
alias utf16 wchar;
--
VIII.
add the following methods to all 4 array types
utf8[] .asUTF8
utf16[] .asUTF16
dchar[] .asUTF32
cchar[] .asCchars
ubyte[] .asUTF8 (bool dummy) // I think there's no UTF-8 BOM
ubyte[] .asUTF16LE(bool includeBOM)
ubyte[] .asUTF16BE(bool includeBOM)
ubyte[] .asUTF32LE(bool includeBOM)
ubyte[] .asUTF32BE(bool includeBOM)
--
IX.
modify the ~ operator between the 4 types to work as follows:
a) infer the result type from context, as with undecorated strings
b) if calling a function and there are multiple overloads
b.1) if both operand types are known, use that type
b.2) if one us known and another is undecorated literal, use the known type
b.3) if neither is known or both are known, but different, bork
--
X.
Disallow utf8 and utf16 as a stand-alone var type, only arrays and
pointers allowed
========================
Point I. removes the confusion of "char" and "wchar" not actually
representing characters.
Point II. explicitly states that the strings are either UTF-encoded,
complete characters* or C-compatible characters.
Point III. makes the code
string abc="abc";
someOSFunc(abc);
someOtherOSFunc("qwe"s); // s only neccessary if there is more than one
option
least likely to produce any transcoding.
Point IV. makes it nearly impossible to do the wrong thing and doesn't
require explicit casts when interfacing to C code, assuming the C
functions are declared properly (i.e. the correct of the two 1-byte
types is declared). When used with literals, the 0 can be appended
compile-time, like it is now.
Point V. makes it easier to use different types without explicit
casting, but will still produce warnings when transcoding happens. In
most cases it will be obvious anyway.
Point VI. breaks behavior of other array casts (which only paint), but
strings are getting special behavior anyway, and you can still paint via
void[], and even more importantly, if you need to paint between
UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong
in the first place.
Point VII. will make it somewhat easier to make the transition.
Point VIII. provides an alternative to casting and allows specifying
endianness when writing to network and/or files. The methods should be
compile-time resolvable when possible, so this would be both valid and
evaluated in compile time:
ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
Point IX. allows concatenation of strings in different encodings without
significantly increasing the complexity of overloading rules, while also
not requiring an inefficient toUTFxx followed by concatenation (which
copies the result again).
Point X. prevents some invalid code:
- treating a UTF-8 code unit as a character
- treating a UTF-16 code unit as a character
- iterating over code units instead of characters
Note that it is still possible to iterate over the string using a cchar
and dchar, which actually do represent characters. Also note that for
I/O purposes, which are the only thing one should be doing with code
units, you can still paint the string as void[] or byte[] (or even
better, call one of the methods above), but then you give up the view
that it is a string and lose language support/special treatment.
So, what do you guys/gals think? :)
xs0
* note that even dchar[] still doesn't neccessarily contain complete
characters, at least as seen by the user. For example, the letter
LATIN_C_WITH_CARON can also be written as LATIN_C + COMBINING_CARON, and
they are in fact equivalent as far as Unicode is concerned (afaik).
Splitting the string inbetween will thus produce a "wrong" result, but I
don't think D should include any kind of full Unicode processing, as
it's actually needed quite rarely, so that problem is ignored...

xs0 wrote:
<snip>
> III.
>
> version(Windows) {
> alias utf16[] string;
> } else
> version(Unix/Linux) {
> alias utf8[] string;
> }
>
> add suffix ""s for explicitly specifying platform-specific encoding
> (i.e. the string type), and make auto type inference default to that
> same type (this applies to the auto keyword, not undecorated strings).
> Add docs explaining that string is just a platform-dependant alias.
>
The idea (platform-independence) here is correct. :) The only thing is
that you _don't_ need to know, which utf-implementation the current
compiler is using. If you are using Unicode to communicate with the user
and/or native D libraries, you don't need to do any string conversions -
they all use the same string representation, for god's sake.
> IV.
>
> add the following implicit casts for interoperability
>
> from: cchar[], utf8[], utf16[], dchar[]
> to : cchar*, utf8*, utf16*, dchar*
>
> all of them ensure 0-termination. If cchar is converted to any other
> form, it becomes the appropriate Unicode char. In the reverse direction,
> all unrepresentable characters become '?'. when runtime transcoding
> and/or reallocation is required, make them produce a warning.
You mean C/C++ -interoperability?
Replacing all non-ASCII characters with '?'s means that we don't
actually want to support all the legacy systems out there. So it would
be impossible to write Unicode-compliant portable programs that
supported 'ä' on the Windows 9x/NT/XP command line without version() {}
-logic?
> V.
>
> add the following implicit (transcoding) casts
>
> from: cchar[], utf8[], utf16[], dchar[]
> to : cchar[], utf8[], utf16[], dchar[]
>
> when runtime transcoding is required, make them produce a warning (i.e.
> always, except when casting from T to T).
Again, the main reason for Unicode is that you don't need to transcode
between several representations all the time.
> VI.
>
> modify explicit casts between all the array and pointer types to
> - transcode rather than paint
> - use '?' for unrepresentable characters (applies to encoding into
> cchar*/cchar[] only)
> - not produce the warnings from above
>
> --
>
> VII.
>
> create compatibility kit:
>
> module std.srccompatibility.oldchartypes;
> // yes, it should be big and ugly
>
> alias utf8 char;
> alias utf16 wchar;
>
You know, sweeping the problem under the carpet doesn't help us much.
char/wchar won't get any better by calling them with a different name.
Still char won't be able to store more than the first 127 Unicode symbols.
> VIII.
>
> add the following methods to all 4 array types
>
> utf8[] .asUTF8
> utf16[] .asUTF16
> dchar[] .asUTF32
> cchar[] .asCchars
Why, section V. already allows you to transcode these implicitely.
> ubyte[] .asUTF8 (bool dummy) // I think there's no UTF-8 BOM
> ubyte[] .asUTF16LE(bool includeBOM)
> ubyte[] .asUTF16BE(bool includeBOM)
> ubyte[] .asUTF32LE(bool includeBOM)
> ubyte[] .asUTF32BE(bool includeBOM)
>
This looks pretty familiar. My own proposal does this on a library level
for a reason. You see, conversions from Unicode to ISO-8859-x/KOI8-R/...
should be allowed. It's easier to maintain the conversion table in a
separate library. This also saves Walter from a lot of unnecessary work.
UTF-8 _does_ have a BOM.
> IX.
>
> modify the ~ operator between the 4 types to work as follows:
>
> a) infer the result type from context, as with undecorated strings
> b) if calling a function and there are multiple overloads
> b.1) if both operand types are known, use that type
> b.2) if one us known and another is undecorated literal, use the known type
> b.3) if neither is known or both are known, but different, bork
>
If we didn't have several types of strings, this all would be much easier.
> X.
>
> Disallow utf8 and utf16 as a stand-alone var type, only arrays and
> pointers allowed
>
Yes, this is a 'working' solution. Although I would like to be able to
slice strings and do things like:
char[] s = "Älyttömämmäksi voinee mennä?"
s[15..21] = "ei voi"
writefln(s) // outputs: Älyttömämmäksi ei voi mennä?
Of course you can do this all using library functions, but tell me one
thing: why should I do simple string slicing using library calls and
much more complex Unicode conversion using language structures.
> Point I. removes the confusion of "char" and "wchar" not actually
> representing characters.
>
True.
> Point II. explicitly states that the strings are either UTF-encoded,
> complete characters* or C-compatible characters.
True.
> Point III. makes the code
>
> string abc="abc";
> someOSFunc(abc);
> someOtherOSFunc("qwe"s); // s only neccessary if there is more than one
> option
>
> least likely to produce any transcoding.
Of course you need to do transcoding, if the OS-function expects
ISO-8859-x and you're string has utf8/16.
> Point IV. makes it nearly impossible to do the wrong thing and doesn't
> require explicit casts when interfacing to C code, assuming the C
> functions are declared properly (i.e. the correct of the two 1-byte
> types is declared). When used with literals, the 0 can be appended
> compile-time, like it is now.
Why do you have to output Unicode strings using legacy non-Unicode
C-APIs? AFAIK DUI / stardard I/O and other libraries use standard
Unicode, right? At least QT / GTK+ / Win32API / Linux console do support
Unicode.
> Point V. makes it easier to use different types without explicit
> casting, but will still produce warnings when transcoding happens. In
> most cases it will be obvious anyway.
It would easier with only a single Unicode-compliant string-type. Ask
the Java guys.
> Point VI. breaks behavior of other array casts (which only paint), but
> strings are getting special behavior anyway, and you can still paint via
> void[], and even more importantly, if you need to paint between
> UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong
> in the first place.
?
> Point VII. will make it somewhat easier to make the transition.
?
> Point VIII. provides an alternative to casting and allows specifying
> endianness when writing to network and/or files.
Partly true. Still, I think it would be much better if we had these as a
std.stream.UnicodeStream class. Again, Java does this well.
> The methods should be
> compile-time resolvable when possible, so this would be both valid and
> evaluated in compile time:
>
> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
Why? Converting a 14 character string doesn't take much time. Besides,
if all our strings and i/o were utf-8, there wouldn't be any
conversions, right?
> Point IX. allows concatenation of strings in different encodings without
> significantly increasing the complexity of overloading rules, while also
> not requiring an inefficient toUTFxx followed by concatenation (which
> copies the result again).
True, but as I previously said, I don't believe we need to do great
amount of conversions in the runtime-level. All conversions should be
near network/file-interfaces, thus using Stream-classes, right?
> Point X. prevents some invalid code:
> Note that it is still possible to iterate over the string using a cchar
> and dchar, which actually do represent characters. Also note that for
> I/O purposes, which are the only thing one should be doing with code
> units, you can still paint the string as void[] or byte[] (or even
> better, call one of the methods above), but then you give up the view
> that it is a string and lose language support/special treatment.
True.
> Splitting the string inbetween will thus produce a "wrong" result, but I
> don't think D should include any kind of full Unicode processing, as
> it's actually needed quite rarely, so that problem is ignored...
Sigh. Maybe you're not doing full Unicode processing every day. What
about the Chinese? And what is full Unicode processing?

Before anything else: while I agree that a (really well-thought out)
string class would probably be a good solution, the D spec would seem to
suggest an array-based approach is preferred, and Walter isn't one to
change his mind easily :)
Besides, any kind of string class has it's share of problems (one size
never fits all), and with the array based approach it's easy to add
pseudo-methods doing all kinds of funky things, while a language-defined
class makes it impossible.
Jari-Matti Mäkelä wrote:
>> version(Windows) {
>> alias utf16[] string;
>> } else
>> version(Unix/Linux) {
>> alias utf8[] string;
>> }
>>
>> add suffix ""s for explicitly specifying platform-specific encoding
>> (i.e. the string type), and make auto type inference default to that
>> same type (this applies to the auto keyword, not undecorated strings).
>> Add docs explaining that string is just a platform-dependant alias.
>>
> The idea (platform-independence) here is correct. :) The only thing is
> that you _don't_ need to know, which utf-implementation the current
> compiler is using.
Well, sometimes you do and most times you don't (and it is often the
case that at least some part of any app does need to know). I don't
think it's wise to force anything down anyone's throat, so I tries to
give options - you can use a specific UTF encoding, the native encoding
for legacy OSes, or leave it to the compiler to choose the "best" one
for you, where I believe best is what the underlying OS is using.
> If you are using Unicode to communicate with the user
> and/or native D libraries, you don't need to do any string conversions -
> they all use the same string representation, for god's sake.
Well, flexibility will definitely require some bloat in libraries, but
for communicating with the user, you definitely need conversions, if
you're not using the OS-native type (which, again, you do have the
option of using with being explicit about it).
>> add the following implicit casts for interoperability
>>
>> from: cchar[], utf8[], utf16[], dchar[]
>> to : cchar*, utf8*, utf16*, dchar*
>>
>> all of them ensure 0-termination. If cchar is converted to any other
>> form, it becomes the appropriate Unicode char. In the reverse
>> direction, all unrepresentable characters become '?'. when runtime
>> transcoding and/or reallocation is required, make them produce a warning.
>
> You mean C/C++ -interoperability?
Yup.
> Replacing all non-ASCII characters with '?'s means that we don't
> actually want to support all the legacy systems out there. So it would
> be impossible to write Unicode-compliant portable programs that
> supported 'ä' on the Windows 9x/NT/XP command line without version() {}
> -logic?
No, who mentioned ASCII? On windows, cchar would be exactly the legacy
encoding each non-unicode app uses, and conversions between app's
internal UTF-x and cchar[] would transcode into that charset. So, for
example, a word processor on a non-unicode windows version could still
use unicode internally, while automatically talking to the OS using all
the characters its charset provides.
>> add the following implicit (transcoding) casts
>>
>> from: cchar[], utf8[], utf16[], dchar[]
>> to : cchar[], utf8[], utf16[], dchar[]
>>
>> when runtime transcoding is required, make them produce a warning
>> (i.e. always, except when casting from T to T).
>
> Again, the main reason for Unicode is that you don't need to transcode
> between several representations all the time.
Again, sometimes you do and most times you don't. But anyhow, painting
casts between UTF types make no sense, and I don't think explicit casts
are neccessary, as there can't be any loss (ok, except to cchar[]).
>> create compatibility kit:
>>
>> module std.srccompatibility.oldchartypes;
>> // yes, it should be big and ugly
>>
>> alias utf8 char;
>> alias utf16 wchar;
>>
>
> You know, sweeping the problem under the carpet doesn't help us much.
> char/wchar won't get any better by calling them with a different name.
> Still char won't be able to store more than the first 127 Unicode symbols.
I'm not sure if you're referring to those aliases or not, but in YAUST,
there is no single char(utf8) anymore, and I think there's quite a
difference between "char[]" and "utf8[]", especially in a C-influenced
world the Earth is :)
>> add the following methods to all 4 array types
>>
>> utf8[] .asUTF8
>> utf16[] .asUTF16
>> dchar[] .asUTF32
>> cchar[] .asCchars
>
> Why, section V. already allows you to transcode these implicitely.
Yup, but with warnings; using one of these shows that you've thought
about what you're doing, so the compiler is free to shut up :)
>> ubyte[] .asUTF8 (bool dummy) // I think there's no UTF-8 BOM
>> ubyte[] .asUTF16LE(bool includeBOM)
>> ubyte[] .asUTF16BE(bool includeBOM)
>> ubyte[] .asUTF32LE(bool includeBOM)
>> ubyte[] .asUTF32BE(bool includeBOM)
>>
>
> This looks pretty familiar. My own proposal does this on a library level
> for a reason. You see, conversions from Unicode to ISO-8859-x/KOI8-R/...
> should be allowed.
Sure they should be allowed, but D is supposed to be Unicode, so a D app
should generally only deal with that, and other charsets should
generally only exist in byte[] buffers before input or after output.
> It's easier to maintain the conversion table in a
> separate library. This also saves Walter from a lot of unnecessary work.
Well, conversions between UTFs are done already, so the only thing
remaining would be from/to cchar[], which shouldn't be too hard. Others
definitely belong in some library, as they mostly won't be needed, I guess..
> UTF-8 _does_ have a BOM.
It does? What is it? I thought that single bytes have no Byte Order, so
why would you need a Mark?
>> modify the ~ operator between the 4 types to work as follows:
>>
>> a) infer the result type from context, as with undecorated strings
>> b) if calling a function and there are multiple overloads
>> b.1) if both operand types are known, use that type
>> b.2) if one us known and another is undecorated literal, use the known
>> type
>> b.3) if neither is known or both are known, but different, bork
>
> If we didn't have several types of strings, this all would be much easier.
Agreed, but we do have several types of strings :)
>> Disallow utf8 and utf16 as a stand-alone var type, only arrays and
>> pointers allowed
>>
>
> Yes, this is a 'working' solution. Although I would like to be able to
> slice strings and do things like:
>
> char[] s = "Älyttömämmäksi voinee mennä?"
> s[15..21] = "ei voi"
> writefln(s) // outputs: Älyttömämmäksi ei voi mennä?
>
> Of course you can do this all using library functions, but tell me one
> thing: why should I do simple string slicing using library calls and
> much more complex Unicode conversion using language structures.
Because it's actually the opposite - Unicode conversions are simple,
while slicing is hard (at least slicing on character boundaries). Even
in the simple example you give, I have no idea whether the first Ä is
one character or two, as both cases look the same.
>> Point III. makes the code
>>
>> string abc="abc";
>> someOSFunc(abc);
>> someOtherOSFunc("qwe"s); // s only neccessary if there is more than
>> one option
>>
>> least likely to produce any transcoding.
>
> Of course you need to do transcoding, if the OS-function expects
> ISO-8859-x and you're string has utf8/16.
True, I just said "least likely". But at least you can use the same
(non-transcoding) code for both UTF-8 OSes and UTF-16 OSes.
>> Point IV. makes it nearly impossible to do the wrong thing and doesn't
>> require explicit casts when interfacing to C code, assuming the C
>> functions are declared properly (i.e. the correct of the two 1-byte
>> types is declared). When used with literals, the 0 can be appended
>> compile-time, like it is now.
>
> Why do you have to output Unicode strings using legacy non-Unicode
> C-APIs? AFAIK DUI / stardard I/O and other libraries use standard
> Unicode, right? At least QT / GTK+ / Win32API / Linux console do support
> Unicode.
Well, your point is moot, because if there's no such function to call,
then there is no problem. But when there is such a function, you would
hope that the language/library does something sensible by default,
wouldn't you?
>> Point V. makes it easier to use different types without explicit
>> casting, but will still produce warnings when transcoding happens. In
>> most cases it will be obvious anyway.
>
> It would easier with only a single Unicode-compliant string-type. Ask
> the Java guys.
Well, I am one of the Java guys, and java.lang.String leaves a lot to be
desired. Because it's language defined in the way it is, it's
1) immutable, which sucks if it's forced down your throat 100% of time
2) UTF-16 for ever and ever, which sucks if you want it to either take
less memory or don't want to worry about surrogates; just look at all
the crappy functions they had to add in Java 5 to support the entire
Unicode charset :)
>> Point VI. breaks behavior of other array casts (which only paint), but
>> strings are getting special behavior anyway, and you can still paint
>> via void[], and even more importantly, if you need to paint between
>> UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong
>> in the first place.
>
> ?
Well, a sequence of bytes can be either cchar[], UTF-8, UTF-16 or
UTF-32, but not more than one at the same time (OK, unless it's ASCII
only, which fits both the first two). So, for example, if you cast
utf8[] to utf16[], either the data is UTF-8 and you don't get a UTF-16
string (but some mumbo jumbo), or it's UTF-16 and was never valid UTF-8
in the first place.
>> Point VII. will make it somewhat easier to make the transition.
>
> ?
?
>> Point VIII. provides an alternative to casting and allows specifying
>> endianness when writing to network and/or files.
>
> Partly true. Still, I think it would be much better if we had these as a
> std.stream.UnicodeStream class. Again, Java does this well.
Why should you be forced to use a stream for something so simple? What
if you want to use two encodings on the same stream (it's not even so
far fetched - the first line in a HTTP request can only contain UTF-8,
but you may want to send POST contents in UTF-16, for example). Etc. etc.
>> The methods should be compile-time resolvable when possible, so this
>> would be both valid and evaluated in compile time:
>>
>> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
>
> Why? Converting a 14 character string doesn't take much time.
Why would it not evaluate at compile time? Do you see any benefit in
that? And while it doesn't take much time once, it does take some, and
more importantly, allocates new memory each time. If you're trying to do
more than one request (as in thousands), I'm sure it adds up..
> Besides,
> if all our strings and i/o were utf-8, there wouldn't be any
> conversions, right?
Except every time you'd call a Win32 function, which is what's on most
computers?
>> Point IX. allows concatenation of strings in different encodings
>> without significantly increasing the complexity of overloading rules,
>> while also not requiring an inefficient toUTFxx followed by
>> concatenation (which copies the result again).
>
> True, but as I previously said, I don't believe we need to do great
> amount of conversions in the runtime-level. All conversions should be
> near network/file-interfaces, thus using Stream-classes, right?
I agree decent stream classes can solve many problems, but not all of them.
>> Splitting the string inbetween will thus produce a "wrong" result, but
>> I don't think D should include any kind of full Unicode processing, as
>> it's actually needed quite rarely, so that problem is ignored...
>
> Sigh. Maybe you're not doing full Unicode processing every day. What
> about the Chinese? And what is full Unicode processing?
Unicode is much more than a really large character set. There's UTFs,
collation, bidirectionality, combining characters, locales, etc. etc., see
http://www.unicode.org/reports/index.html
So, if you want to create a decent text editor according to Unicode
specs, you'll have to implement "full Unicode processing", but a large
majority of other apps just needs to be able to interface to OS and
libraries to get and display the text, usually without even caring
what's inside, so I see no point to include all that in D, not even as a
standard library (or perhaps after many other things are implemented first)
xs0

xs0 wrote:
> Before anything else: while I agree that a (really well-thought out)
> string class would probably be a good solution, the D spec would seem to
> suggest an array-based approach is preferred, and Walter isn't one to
> change his mind easily :)
I believe we can achieve quite much with just simple array-like syntax.
> Besides, any kind of string class has it's share of problems (one size
> never fits all), and with the array based approach it's easy to add
> pseudo-methods doing all kinds of funky things, while a language-defined
> class makes it impossible.
Although D is able to support some hard coded properties too.
>> The idea (platform-independence) here is correct. :) The only thing is
>> that you _don't_ need to know, which utf-implementation the current
>> compiler is using.
>
> Well, sometimes you do and most times you don't (and it is often the
> case that at least some part of any app does need to know). I don't
> think it's wise to force anything down anyone's throat, so I tries to
> give options - you can use a specific UTF encoding, the native encoding
> for legacy OSes, or leave it to the compiler to choose the "best" one
> for you, where I believe best is what the underlying OS is using.
I'd give my vote for the "let compiler choose" option.
>> If you are using Unicode to communicate with the user and/or native D
>> libraries, you don't need to do any string conversions - they all use
>> the same string representation, for god's sake.
>
> Well, flexibility will definitely require some bloat in libraries, but
> for communicating with the user, you definitely need conversions, if
> you're not using the OS-native type (which, again, you do have the
> option of using with being explicit about it).
But if you let the compiler vendor to decide the encoding, there's a
high probability that you don't need any explicit transcoding.
>>> add the following implicit casts for interoperability
>>>
>>> from: cchar[], utf8[], utf16[], dchar[]
>>> to : cchar*, utf8*, utf16*, dchar*
>>>
>>> all of them ensure 0-termination. If cchar is converted to any other
>>> form, it becomes the appropriate Unicode char. In the reverse
>>> direction, all unrepresentable characters become '?'. when runtime
>>> transcoding and/or reallocation is required, make them produce a
>>> warning.
>>
>> You mean C/C++ -interoperability?
> Yup.
I was just thinking that once D has complete wrappers for all necessary
stuff, you don't need these anymore. Library (wrapper) writers should be
patient enough to use explicit conversion rules.
>> Replacing all non-ASCII characters with '?'s means that we don't
>> actually want to support all the legacy systems out there. So it would
>> be impossible to write Unicode-compliant portable programs that
>> supported 'ä' on the Windows 9x/NT/XP command line without version()
>> {} -logic?
>
>
> No, who mentioned ASCII? On windows, cchar would be exactly the legacy
> encoding each non-unicode app uses, and conversions between app's
> internal UTF-x and cchar[] would transcode into that charset. So, for
> example, a word processor on a non-unicode windows version could still
> use unicode internally, while automatically talking to the OS using all
> the characters its charset provides.
>
You said
"In the reverse direction, all unrepresentable characters become '?'."
The thing is that D compiler doesn't know anything about your system
character encoding. You can even change it on the fly, if your system is
capable of doing that. Therefore this transcoding must use the greatest
common divisor which is probably 7-bit ASCII.
>>> add the following implicit (transcoding) casts
>>>
>>> from: cchar[], utf8[], utf16[], dchar[]
>>> to : cchar[], utf8[], utf16[], dchar[]
>>>
>>> when runtime transcoding is required, make them produce a warning
>>> (i.e. always, except when casting from T to T).
>>
>> Again, the main reason for Unicode is that you don't need to transcode
>> between several representations all the time.
>
> Again, sometimes you do and most times you don't. But anyhow, painting
> casts between UTF types make no sense, and I don't think explicit casts
> are neccessary, as there can't be any loss (ok, except to cchar[]).
You don't need to convert inside your own code unless you're really
creating a program that is supposed to convert stuff. I mean you need
the transcoding only when interfacing with foreign code / i/o.
>>> add the following methods to all 4 array types
>>>
>>> utf8[] .asUTF8
>>> utf16[] .asUTF16
>>> dchar[] .asUTF32
>>> cchar[] .asCchars
>>
>> Why, section V. already allows you to transcode these implicitely.
>
> Yup, but with warnings; using one of these shows that you've thought
> about what you're doing, so the compiler is free to shut up :)
Yes, now you're right. The programmer should _always_ explicitely
declare all conversions.
>>> ubyte[] .asUTF8 (bool dummy) // I think there's no UTF-8 BOM
>>> ubyte[] .asUTF16LE(bool includeBOM)
>>> ubyte[] .asUTF16BE(bool includeBOM)
>>> ubyte[] .asUTF32LE(bool includeBOM)
>>> ubyte[] .asUTF32BE(bool includeBOM)
>>>
>> This looks pretty familiar. My own proposal does this on a library
>> level for a reason. You see, conversions from Unicode to
>> ISO-8859-x/KOI8-R/... should be allowed.
>
> Sure they should be allowed, but D is supposed to be Unicode, so a D app
> should generally only deal with that, and other charsets should
> generally only exist in byte[] buffers before input or after output.
Then tell me, how do I fill these buffers with your new functions? I
would definitely want to explicitely define the character encoding. IMHO
this is much better done using static classes (std.utf.e[n/de]code) than
variable properties.
>> It's easier to maintain the conversion table in a separate library.
>> This also saves Walter from a lot of unnecessary work.
>
> Well, conversions between UTFs are done already, so the only thing
> remaining would be from/to cchar[], which shouldn't be too hard.
Yes, between UTFs, but between legacy charsets and UTFs is not! They
aren't that hard, but as you might know, there are maybe hundreds of
possible encoding types.
> Others
> definitely belong in some library, as they mostly won't be needed, I
> guess..
This isn't a very consistent approach. Some functions belong in some
library, others should be implemented in the language...wtf?
>> UTF-8 _does_ have a BOM.
>
> It does? What is it? I thought that single bytes have no Byte Order, so
> why would you need a Mark?
0xEF 0xBB 0xBF
http://www.unicode.org/faq/utf_bom.html#25
See also
http://www.unicode.org/faq/utf_bom.html#29>> If we didn't have several types of strings, this all would be much
>> easier.
>
> Agreed, but we do have several types of strings :)
I'm trying to say we don't need several types of strings :)
>>> Disallow utf8 and utf16 as a stand-alone var type, only arrays and
>>> pointers allowed
>>>
>>
>> Yes, this is a 'working' solution. Although I would like to be able to
>> slice strings and do things like:
>>
>> char[] s = "Älyttömämmäksi voinee mennä?"
>> s[15..21] = "ei voi"
>> writefln(s) // outputs: Älyttömämmäksi ei voi mennä?
>>
>> Of course you can do this all using library functions, but tell me one
>> thing: why should I do simple string slicing using library calls and
>> much more complex Unicode conversion using language structures.
>
>
> Because it's actually the opposite - Unicode conversions are simple,
> while slicing is hard (at least slicing on character boundaries). Even
> in the simple example you give, I have no idea whether the first Ä is
> one character or two, as both cases look the same.
It's not really that hard. One downside is that you have to parse
through the string (unless compiler uses UTF-16/32 as an internal string
type).
Slicing the string on the code unit level doesn't make any sense, now
does it? Because char should be treated as a special type by the
compiler, I see no other use for slicing than this. Like you said, the
alternative slicing can be achieved by casting the string to void[] (for
i/o data buffering, etc).
>>> Point III. makes the code
>>>
>>> string abc="abc";
>>> someOSFunc(abc);
>>> someOtherOSFunc("qwe"s); // s only neccessary if there is more than
>>> one option
>>>
>>> least likely to produce any transcoding.
>>
>>
>> Of course you need to do transcoding, if the OS-function expects
>> ISO-8859-x and you're string has utf8/16.
>
>
> True, I just said "least likely". But at least you can use the same
> (non-transcoding) code for both UTF-8 OSes and UTF-16 OSes.
Again, the compiler nor the compiled binary don't know anything about
the OS standard encoding. Even some linux-systems still use iso-8859-x.
If you're running windows-programs through vmware or wine on linux, you
can't tell if it's always faster to use UTF-16 instead of UTF-8.
>>> Point IV. makes it nearly impossible to do the wrong thing and
>>> doesn't require explicit casts when interfacing to C code, assuming
>>> the C functions are declared properly (i.e. the correct of the two
>>> 1-byte types is declared). When used with literals, the 0 can be
>>> appended compile-time, like it is now.
>>
>>
>> Why do you have to output Unicode strings using legacy non-Unicode
>> C-APIs? AFAIK DUI / stardard I/O and other libraries use standard
>> Unicode, right? At least QT / GTK+ / Win32API / Linux console do
>> support Unicode.
>
>
> Well, your point is moot, because if there's no such function to call,
> then there is no problem. But when there is such a function, you would
> hope that the language/library does something sensible by default,
> wouldn't you?
No, this brilliant invention of yours causes problems even if we didn't
have any 'legacy'-systems/APIs. You see, Library-writer 1 might use
UTF-16 for his library because he uses Windows and thinks it's the
fastest charset. Now Library-writer 2 has done his work using UTF-8 as
an internal format. If you make a client program that links with these
both, you (may) have to create unnecessary conversions just because one
guy decided to create his own standards.
>>> Point V. makes it easier to use different types without explicit
>>> casting, but will still produce warnings when transcoding happens. In
>>> most cases it will be obvious anyway.
>>
>>
>> It would easier with only a single Unicode-compliant string-type. Ask
>> the Java guys.
>
>
> Well, I am one of the Java guys, and java.lang.String leaves a lot to be
> desired. Because it's language defined in the way it is, it's
> 1) immutable, which sucks if it's forced down your throat 100% of time
I agree.
> 2) UTF-16 for ever and ever, which sucks if you want it to either take
> less memory or don't want to worry about surrogates; just look at all
> the crappy functions they had to add in Java 5 to support the entire
> Unicode charset :)
Partly true. What I meant was that most Java programmers use only one
kind of string class (because they don't have/need other types).
>>> Point VI. breaks behavior of other array casts (which only paint),
>>> but strings are getting special behavior anyway, and you can still
>>> paint via void[], and even more importantly, if you need to paint
>>> between UTF8/UTF16/UTF32/cchar, either the source or destination type
>>> is wrong in the first place.
>>
>> ?
>
> Well, a sequence of bytes can be either cchar[], UTF-8, UTF-16 or
> UTF-32, but not more than one at the same time (OK, unless it's ASCII
> only, which fits both the first two). So, for example, if you cast
> utf8[] to utf16[], either the data is UTF-8 and you don't get a UTF-16
> string (but some mumbo jumbo), or it's UTF-16 and was never valid UTF-8
> in the first place.
Ok. But I thought you said utf8[] is implicitely converted to utf16[].
Then it's always valid whatever-type-it-is.
>>> Point VII. will make it somewhat easier to make the transition.
How? I don't believe.
>>> Point VIII. provides an alternative to casting and allows specifying
>>> endianness when writing to network and/or files.
>>
>>
>> Partly true. Still, I think it would be much better if we had these as
>> a std.stream.UnicodeStream class. Again, Java does this well.
>
>
> Why should you be forced to use a stream for something so simple?
So simple? Ahem, std.stream.File _is_ a stream. Here's my version:
File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF8);
f.writeLine("valid unicode text åäöü");
f.close;
File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF16LE);
f.writeLine("valid unicode text åäöü");
f.close;
Advantages:
-supports BOM values
-easy to use, right?
> What
> if you want to use two encodings on the same stream (it's not even so
> far fetched - the first line in a HTTP request can only contain UTF-8,
> but you may want to send POST contents in UTF-16, for example). Etc. etc.
Simple, just implement a method for changing the stream type:
Stream s = UnicodeSocketStream(socket, mode, encoding);
s.changeEncoding(encoding2);
If you want high-performance streams, you can convert the strings in a
separate thread before you use them, right?
>>> The methods should be compile-time resolvable when possible, so this
>>> would be both valid and evaluated in compile time:
>>>
>>> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
>>
>>
>> Why? Converting a 14 character string doesn't take much time.
>
>
> Why would it not evaluate at compile time? Do you see any benefit in
> that? And while it doesn't take much time once, it does take some, and
> more importantly, allocates new memory each time. If you're trying to do
> more than one request (as in thousands), I'm sure it adds up..
You only need to convert once.
>> Besides, if all our strings and i/o were utf-8, there wouldn't be any
>> conversions, right?
>
> Except every time you'd call a Win32 function, which is what's on most
> computers?
My mistake, let's forget the utf-8 for a while. Actually I meant that if
all strings were in the native OS format (let the compiler decide),
there would be no need to convert.
>>> Point IX. allows concatenation of strings in different encodings
Why do you want to do that?
>>> without significantly increasing the complexity of overloading rules,
>>> while also not requiring an inefficient toUTFxx followed by
>>> concatenation (which copies the result again).
>>
>>
>> True, but as I previously said, I don't believe we need to do great
>> amount of conversions in the runtime-level. All conversions should be
>> near network/file-interfaces, thus using Stream-classes, right?
>
>
> I agree decent stream classes can solve many problems, but not all of them.
"Many, but not all of them." That's why we should have
std.utf.encode/decode-functions.
>>> Splitting the string inbetween will thus produce a "wrong" result,
>>> but I don't think D should include any kind of full Unicode
>>> processing, as it's actually needed quite rarely, so that problem is
>>> ignored...
>
> So, if you want to create a decent text editor according to Unicode
> specs, you'll have to implement "full Unicode processing", but a large
> majority of other apps just needs to be able to interface to OS and
> libraries to get and display the text, usually without even caring
> what's inside, so I see no point to include all that in D, not even as a
> standard library (or perhaps after many other things are implemented first)
Ok, now I see your point. I thought you didn't want full Unicode
processing even as a addon library. I agree, you don't need these
'advanced' algorithms in the core language, rather as a separate
library. Time will tell, maybe someday when we haven't got anything else
to do, Phobos will finally include some cool Unicode tricks.
Jari-Matti

>> Well, flexibility will definitely require some bloat in libraries, but
>> for communicating with the user, you definitely need conversions, if
>> you're not using the OS-native type (which, again, you do have the
>> option of using with being explicit about it).
>
> But if you let the compiler vendor to decide the encoding, there's a
> high probability that you don't need any explicit transcoding.
Sure you may need transcoding, you may use 15 different libraries, each
expecting its own thing.. The one thing that can be done is to not
require transcoding at least when talking to OS, which all apps have to
do at some point. But even then, you should have the option to choose
otherwise - if you have a UTF-8 library that you use in 99% of
string-related calls, it's still faster to use UTF-8 and transcode when
talking to OS.
>>> You mean C/C++ -interoperability?
>>
>> Yup.
>
> I was just thinking that once D has complete wrappers for all necessary
> stuff, you don't need these anymore. Library (wrapper) writers should be
> patient enough to use explicit conversion rules.
But why should one have to create wrappers in the first place? With my
proposal, you can directly link to many libraries and the compiler will
do the conversions for you.
>> No, who mentioned ASCII? On windows, cchar would be exactly the legacy
>> encoding each non-unicode app uses, and conversions between app's
>> internal UTF-x and cchar[] would transcode into that charset. So, for
>> example, a word processor on a non-unicode windows version could still
>> use unicode internally, while automatically talking to the OS using
>> all the characters its charset provides.
>
> You said
> "In the reverse direction, all unrepresentable characters become '?'."
>
> The thing is that D compiler doesn't know anything about your system
> character encoding. You can even change it on the fly, if your system is
> capable of doing that. Therefore this transcoding must use the greatest
> common divisor which is probably 7-bit ASCII.
While the compiler may not, I'm sure it's possible to figure it out in
runtime. For example, many old apps use a different language based on
your settings, browsers send different Accept-Language, etc. So, it is
possible, I think.
>> Again, sometimes you do and most times you don't. But anyhow, painting
>> casts between UTF types make no sense, and I don't think explicit
>> casts are neccessary, as there can't be any loss (ok, except to cchar[]).
>
> You don't need to convert inside your own code unless you're really
> creating a program that is supposed to convert stuff. I mean you need
> the transcoding only when interfacing with foreign code / i/o.
If you don't need to convert, fine. If you do need to convert, I see no
point in it being as easy/convenient as possible.
>> Yup, but with warnings; using one of these shows that you've thought
>> about what you're doing, so the compiler is free to shut up :)
>
> Yes, now you're right. The programmer should _always_ explicitely
> declare all conversions.
Why?
>>>> ubyte[] .asUTF8 (bool dummy) // I think there's no UTF-8 BOM
>>>> ubyte[] .asUTF16LE(bool includeBOM)
>>>> ubyte[] .asUTF16BE(bool includeBOM)
>>>> ubyte[] .asUTF32LE(bool includeBOM)
>>>> ubyte[] .asUTF32BE(bool includeBOM)
>>>>
>>> This looks pretty familiar. My own proposal does this on a library
>>> level for a reason. You see, conversions from Unicode to
>>> ISO-8859-x/KOI8-R/... should be allowed.
>>
>>
>> Sure they should be allowed, but D is supposed to be Unicode, so a D
>> app should generally only deal with that, and other charsets should
>> generally only exist in byte[] buffers before input or after output.
>
> Then tell me, how do I fill these buffers with your new functions?
You don't. Only UTFs and one OS-native encoding are supported in the
language, the latter for obvious convenience. Others have to be done
with a library. Note that the compiler is free to use the same library,
it's not like anything would have to be done twice.
>>> UTF-8 _does_ have a BOM.
>>
>> It does? What is it? I thought that single bytes have no Byte Order,
>> so why would you need a Mark?
>
> 0xEF 0xBB 0xBF
OK, then it's not a dummy parameter :)
>>> If we didn't have several types of strings, this all would be much
>>> easier.
>>
>> Agreed, but we do have several types of strings :)
>
> I'm trying to say we don't need several types of strings :)
Why? I think if it's done properly, there are benefits from having a
choice, while not complicating matters when one doesn't care.
>> Because it's actually the opposite - Unicode conversions are simple,
>> while slicing is hard (at least slicing on character boundaries). Even
>> in the simple example you give, I have no idea whether the first Ä is
>> one character or two, as both cases look the same.
>
> It's not really that hard. One downside is that you have to parse
> through the string (unless compiler uses UTF-16/32 as an internal string
> type).
It is "hard" - if you want to get the first character, as in the first
character that the user sees, it can actually be from 1 to x characters,
where x can be at least 5 (that case is actually in the unicode
standard) and possibly more (and I don't mean code units, but characters).
> Slicing the string on the code unit level doesn't make any sense, now
> does it? Because char should be treated as a special type by the
> compiler, I see no other use for slicing than this. Like you said, the
> alternative slicing can be achieved by casting the string to void[] (for
> i/o data buffering, etc).
Well, I sure don't have anything against making slicing strings slice on
character boundaries... Although that complicates matters - which length
should .length then return? It will surely bork all kinds of templates,
so perhaps it should be done with a different operator, like {a..b}
instead of [a..b], and length-in-characters should be .strlen.
>>> Why do you have to output Unicode strings using legacy non-Unicode
>>> C-APIs? AFAIK DUI / stardard I/O and other libraries use standard
>>> Unicode, right? At least QT / GTK+ / Win32API / Linux console do
>>> support Unicode.
>>
>> Well, your point is moot, because if there's no such function to call,
>> then there is no problem. But when there is such a function, you would
>> hope that the language/library does something sensible by default,
>> wouldn't you?
>
> No, this brilliant invention of yours causes problems even if we didn't
> have any 'legacy'-systems/APIs. You see, Library-writer 1 might use
> UTF-16 for his library because he uses Windows and thinks it's the
> fastest charset. Now Library-writer 2 has done his work using UTF-8 as
> an internal format. If you make a client program that links with these
> both, you (may) have to create unnecessary conversions just because one
> guy decided to create his own standards.
Please don't get personal, as I and many others don't consider it polite.
Anyhow, even if all D libraries use the same encoding, D is still
directly linkable to C libraries and it's obvious one doesn't have
control over what encoding they're using, so I fail to see what is wrong
with supporting different ones, and I also fail to see how it will help
to decree one of them The One and ignore all others.
>> 2) UTF-16 for ever and ever, which sucks if you want it to either take
>> less memory or don't want to worry about surrogates; just look at all
>> the crappy functions they had to add in Java 5 to support the entire
>> Unicode charset :)
>
> Partly true. What I meant was that most Java programmers use only one
> kind of string class (because they don't have/need other types).
Well, writing something high-performance string-related in Java
definitely takes a lot of code, because the built-in String class is
often useless. I see no need to repeat that in D.
>> Well, a sequence of bytes can be either cchar[], UTF-8, UTF-16 or
>> UTF-32, but not more than one at the same time (OK, unless it's ASCII
>> only, which fits both the first two). So, for example, if you cast
>> utf8[] to utf16[], either the data is UTF-8 and you don't get a UTF-16
>> string (but some mumbo jumbo), or it's UTF-16 and was never valid
>> UTF-8 in the first place.
>
> Ok. But I thought you said utf8[] is implicitely converted to utf16[].
> Then it's always valid whatever-type-it-is.
Yes I did and that has nothing to do with the above paragraph, as it's
referring to the current sitation, where casts between char types
actually don't transcode.
>> Why should you be forced to use a stream for something so simple?
>
> So simple? Ahem, std.stream.File _is_ a stream. Here's my version:
>
> File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF8);
> f.writeLine("valid unicode text åäöü");
> f.close;
>
> File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF16LE);
> f.writeLine("valid unicode text åäöü");
> f.close;
>
> Advantages:
> -supports BOM values
> -easy to use, right?
Well, I sure don't think so :P Why do I need a special class just to be
able to output strings? Where is the BOM placed? Does every string
include a BOM or just the file at the beginning? How can I change that?
If the writeLine is 2000 lines away from the stream declaration, how can
I tell what it will do?
I'd certainly prefer
File f=new File("foo", FileMode.Out);
f.write("valid whatever".asUTF16LE);
f.close;
Less typing, too :)
> If you want high-performance streams, you can convert the strings in a
> separate thread before you use them, right?
I don't know why you need a thread, but in any case, is that the easiest
solution (to code) you can think of?
>>>> The methods should be compile-time resolvable when possible, so this
>>>> would be both valid and evaluated in compile time:
>>>>
>>>> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
>>>
>>> Why? Converting a 14 character string doesn't take much time.
>>
>> Why would it not evaluate at compile time? Do you see any benefit in
>> that? And while it doesn't take much time once, it does take some, and
>> more importantly, allocates new memory each time. If you're trying to
>> do more than one request (as in thousands), I'm sure it adds up..
>
> You only need to convert once.
Again, why would it not evaluate at compile time? Do you see any benefit
in that?
>>>> Point IX. allows concatenation of strings in different encodings
>
> Why do you want to do that?
I don't, I want the whole world to use dchar[]s. But it doesn't, so
using multiple encodings should be as easy as possible.
xs0

xs0 wrote:
>> I was just thinking that once D has complete wrappers for all
>> necessary stuff, you don't need these anymore. Library (wrapper)
>> writers should be patient enough to use explicit conversion rules.
>
>
> But why should one have to create wrappers in the first place? With my
> proposal, you can directly link to many libraries and the compiler will
> do the conversions for you.
In case you haven't noticed, most things in Java are made of wrappers.
Even D uses wrappers because they're easier to work with. If you think
that wrapper might be a slow, the D specs allow the compiler to inline
wrapper functions.
>>> No, who mentioned ASCII? On windows, cchar would be exactly the
>>> legacy encoding each non-unicode app uses, and conversions between
>>> app's internal UTF-x and cchar[] would transcode into that charset.
>>> So, for example, a word processor on a non-unicode windows version
>>> could still use unicode internally, while automatically talking to
>>> the OS using all the characters its charset provides.
>>
>>
>> You said
>> "In the reverse direction, all unrepresentable characters become '?'."
>>
>> The thing is that D compiler doesn't know anything about your system
>> character encoding. You can even change it on the fly, if your system
>> is capable of doing that. Therefore this transcoding must use the
>> greatest common divisor which is probably 7-bit ASCII.
>
>
> While the compiler may not, I'm sure it's possible to figure it out in
> runtime. For example, many old apps use a different language based on
> your settings, browsers send different Accept-Language, etc. So, it is
> possible, I think.
You can't be serious. Of course do browsers use several encodings, but
they also let the users choose them. You cannot achieve such a
functionality with a statically chosen cchar-type. If you're going to
change the cchar-type on the fly, characters 128-255 become corrupted
sooner than you think. That's why I would use conversion libraries.
>> You don't need to convert inside your own code unless you're really
>> creating a program that is supposed to convert stuff. I mean you need
>> the transcoding only when interfacing with foreign code / i/o.
>
>
> If you don't need to convert, fine. If you do need to convert, I see no
> point in it being as easy/convenient as possible.
But you don't need to convert inside your own code:
utf8 foo(utf16 param) { return param.asUTF8; }
utf32 bar(utf8 param) { return param.asUTF32LE; }
utf16 zoo(utf32 param) { return param.asUTF16LE; }
void main() {
utf16 string = "something";
writefln( utf16( utf32( utf8(string) ) ) );
}
Doesn't look pretty useful to me, at least :)
It's the same thing with implicit conversions. You don't need them in
your 'own' code.
>> Yes, now you're right. The programmer should _always_ explicitely
>> declare all conversions.
> Why?
Because it will remove all 'hidden' (string) conversions.
>>>> If we didn't have several types of strings, this all would be much
>>>> easier.
>>> Agreed, but we do have several types of strings :)
>> I'm trying to say we don't need several types of strings :)
> Why? I think if it's done properly, there are benefits from having a
> choice, while not complicating matters when one doesn't care.
Of course there's always a benefit, but it makes things more complex.
Are you really saying that having 4 string types is easier than having
just one? With only one type you don't need casting rules nor so many
encumbering keywords etc. You always have to make a tradeoff somewhere.
I'm not suggesting my own proposal just because I'm stubborn or
something, I just know that you _can_ write Unicode-aware programs with
just one string type and it doesn't cost much (in runtime
performance/memory footprint). If you don't believe, please try to
simulate these proposals using custom string classes.
>>> Because it's actually the opposite - Unicode conversions are simple,
>>> while slicing is hard (at least slicing on character boundaries).
>>> Even in the simple example you give, I have no idea whether the first
>>> Ä is one character or two, as both cases look the same.
>>
>>
>> It's not really that hard. One downside is that you have to parse
>> through the string (unless compiler uses UTF-16/32 as an internal
>> string type).
>
>
> It is "hard" - if you want to get the first character, as in the first
> character that the user sees, it can actually be from 1 to x characters,
> where x can be at least 5
Oh, I thought that UTF-16 character is always encoded using 16 bits,
UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?
> (that case is actually in the unicode
> standard) and possibly more (and I don't mean code units, but characters).
Slicing&indexing with UTF-16/32 is straightforward. Just multiply the
index by 2/4. UTF-8 is only a bit harder - you need to iterate through
the string, but it's not that hard. It's usually much faster than O(n).
>> Slicing the string on the code unit level doesn't make any sense, now
>> does it? Because char should be treated as a special type by the
>> compiler, I see no other use for slicing than this. Like you said, the
>> alternative slicing can be achieved by casting the string to void[]
>> (for i/o data buffering, etc).
>
>
> Well, I sure don't have anything against making slicing strings slice on
> character boundaries... Although that complicates matters - which length
> should .length then return? It will surely bork all kinds of templates,
> so perhaps it should be done with a different operator, like {a..b}
> instead of [a..b], and length-in-characters should be .strlen.
Yes, it's true. My solution is a bit inconsistent, but doesn't hurt
anyone: it uses character boundaries inside the []-syntax (also .length
might be character-version inside the braces), but code unit -version
elsewhere. I think D should use an internal counter for data type length
and provide an intelligent (data type specific) .length for the
programmer. {a..b} doesn't look good to me.
>>> Well, your point is moot, because if there's no such function to
>>> call, then there is no problem. But when there is such a function,
>>> you would hope that the language/library does something sensible by
>>> default, wouldn't you?
>>
>> No, this brilliant invention of yours causes problems even if we
>> didn't have any 'legacy'-systems/APIs. You see, Library-writer 1 might
>> use UTF-16 for his library because he uses Windows and thinks it's the
>> fastest charset. Now Library-writer 2 has done his work using UTF-8 as
>> an internal format. If you make a client program that links with these
>> both, you (may) have to create unnecessary conversions just because
>> one guy decided to create his own standards.
>
>
> Please don't get personal, as I and many others don't consider it polite.
Sorry, trying to calm down a bit ;) You know, this thing is important to
me as I write most of my programs using Unicode I/O.
>
> Anyhow, even if all D libraries use the same encoding, D is still
> directly linkable to C libraries and it's obvious one doesn't have
> control over what encoding they're using,
That's true.
> so I fail to see what is wrong
> with supporting different ones, and I also fail to see how it will help
> to decree one of them The One and ignore all others.
Surely you agree that all transcoding is bad for the performance.
Minimizing the need to transcode inside D code (by eliminating the
unnecessary string types) maximizes the performance, right?
> Well, writing something high-performance string-related in Java
> definitely takes a lot of code, because the built-in String class is
> often useless. I see no need to repeat that in D.
IMHO implying regular programmers to use high-performance strings
everywhere as an only option is bad. All strings don't need to be that
fast. It would look pretty funny, if you really needed to choose a
proper encoding just to create a valid 'Hello world!' example.
>> File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF8);
>> f.writeLine("valid unicode text åäöü");
>> f.close;
>>
>> File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF16LE);
>> f.writeLine("valid unicode text åäöü");
>> f.close;
>>
>> Advantages:
>> -supports BOM values
>> -easy to use, right?
>
> Well, I sure don't think so :P Why do I need a special class just to be
> able to output strings? Where is the BOM placed? Does every string
> include a BOM or just the file at the beginning? How can I change that?
> If the writeLine is 2000 lines away from the stream declaration, how can
> I tell what it will do?
>
> I'd certainly prefer
>
> File f=new File("foo", FileMode.Out);
> f.write("valid whatever".asUTF16LE);
> f.close;
>
> Less typing, too :)
Less typing? No you're wrong. Your approach requires the programmer to
remember the correct encoding everytime (s)he writes to that file. In
case you didn't know, valid UTF-x files use BOM only in the beginning of
the file. My UnicodeFile-class knows this. Your solution writes the BOM
every time you write a string (test it, if you don't believe). In
addition, changing the BOM in the middle of a valid UTF-x stream is
illegal. If you want to create a datafile that serializes the 'objects',
you can use regular files just like you did here.
>> If you want high-performance streams, you can convert the strings in a
>> separate thread before you use them, right?
>
>
> I don't know why you need a thread, but in any case, is that the easiest
> solution (to code) you can think of?
No, not the easiest. AFAIK in real life a high-performance web server
uses separate threads for data processing. In case you're writing a
single-threaded application, you can precalculate the string in the
_same_ thread.
>>>>> The methods should be compile-time resolvable when possible, so
>>>>> this would be both valid and evaluated in compile time:
>>>>>
>>>>> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
>>>>
>>>>
>>>> Why? Converting a 14 character string doesn't take much time.
>>>
>>>
>>> Why would it not evaluate at compile time? Do you see any benefit in
>>> that? And while it doesn't take much time once, it does take some,
>>> and more importantly, allocates new memory each time. If you're
>>> trying to do more than one request (as in thousands), I'm sure it
>>> adds up..
>>
>> You only need to convert once.
>
> Again, why would it not evaluate at compile time? Do you see any benefit
> in that?
I think I already said that you really don't know, what would be the
best encoding to use at compile time. You're saying (by having several
types) that the programmer should decide this. Now building portable
multiplatform programs isn't that simple. Your approach implyes you to
define several version {} -blocks for different architectures => it
isn't that simple anymore. You need to use version-blocks because if you
decided to use utf-8, it would be fast on *nixes and slow on Windows.
And if you used utf-16, the opposite would happer.
>>>>> Point IX. allows concatenation of strings in different encodings
>>
>> Why do you want to do that?
>
> I don't, I want the whole world to use dchar[]s. But it doesn't, so
> using multiple encodings should be as easy as possible.
But I'm saying here that we don't need several string types.
Jari-Matti
P.S. I won't be reading the NG for the next couple of days. I'll try to
answer your (potential) future posts as soon as I get back.

Derek Parnell wrote:
> On Fri, 25 Nov 2005 15:50:13 +0200, Jari-Matti Mäkelä wrote:
>
>
> [snip]
>
>
>
>>Oh, I thought that UTF-16 character is always encoded using 16 bits,
>>UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?
>
>
> Wrong, I'm afraid. Some characters use 32 bits in UTF16.
>
> UTF8: 1, 2, 3, and 4 byte characters.
> UTF16: 2 and 4 byte characters.
> UTF32: 4 byte characters (only)
Furthermore, a single visible character can be encoded using more than
one Unicode character (for example, a C with a caron can be both a
single character and two characters, C + combining caron). Since there's
no limit to how many combining characters a single "normal" char can
have, slicing on char boundaries is not solved merely by finding UTF
boundaries, which was my initial point.
xs0

xs0 wrote:
>>> Oh, I thought that UTF-16 character is always encoded using 16 bits,
>>> UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?
>>
>> Wrong, I'm afraid. Some characters use 32 bits in UTF16.
>>
>> UTF8: 1, 2, 3, and 4 byte characters.
>> UTF16: 2 and 4 byte characters.
>> UTF32: 4 byte characters (only)
>
> Furthermore, a single visible character can be encoded using more than
> one Unicode character (for example, a C with a caron can be both a
> single character and two characters, C + combining caron). Since there's
> no limit to how many combining characters a single "normal" char can
> have, slicing on char boundaries is not solved merely by finding UTF
> boundaries, which was my initial point.
Thanks, I wasn't aware of this before.
It seems that I have underestimated the performance issues (web servers,
etc.) of having only one Unicode text type. I have to admit the current
types in D are a suitable compromise. They're not always the "easiest"
way to do things, but have no greater weaknesses either.
I guess the only thing I tried to say was that it really _is_ possible
to write all programs with only a single encoding-independent Unicode
type. But this approach has few big downsides in some performance
critical applications and therefore shouldn't be the default behavior
for a systems programming language like D. On a scripting language it
would be a killer feature, though.
---
* IMO support for indexing & slicing on Unicode character boundaries is
not that obligatory on the language syntax level, but it would be nice
to have this functionality somewhere. :) At least there's little use for
[d,w]char slicing now.
* I wish Walter could fix this [1] bug: (I know why it produces
compile-time errors, but don't know why DMD allows you to do that)
[1] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/30566
I wish it worked like this:
char foo = '\u0000' // ok (C-strings compatibility)
char foo = '\u0001' - '\u007f' // ok
char foo = '\u0080' - '\uffff' // compile error
* A fully Unicode-aware stream system [2] would also be a nice feature:
(currently there's no convenient way to create valid UTF-encoded text
files with BOM)
[2] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5636
That would (perhaps) require Walter/us to reconsider the Phobos stream
class hierarchy.