Community

Which would be the best string type to use - char[], wchar[] or dchar[]? I
want to choose one of them and stick with it throughout my code for the sake
of consistency. My preference would be for wchar[] but using it is not as
smooth as I'd hoped. For example, Object.toString() returns char[], Phobos
seems not to have wchar versions for integer-to-string conversions, and
concatenating sometimes requires casts. It's not too bad, I suppose: I can
use free functions to encode/decode strings and write my own integer
conversion routines. But I am puzzled as to why I need to cast when
concatenating, e.g:
wchar[] text = cast(wchar[])"The quick brown " ~ quickAnimalStr ~
cast(wchar[])" jumped over the lazy " ~ lazyAnimalStr;
Anyway, I'm doing a lot of text processing on Windows XP, which uses UTF16
natively, so it seemed sane to choose the equivalent string type in D. Plus
I read here http://www.digitalmars.com/techtips/windows_utf.html that char[]
is not directly compatible with the ANSI versions of the Windows API (again,
I'm using this a lot).
Given the above considerations, which do you advise I go with?
Cheers,
John.
P.S. Here's an idea: perhaps Walter could add UTF8, UTF16 and UTF32 version
identifiers which we could use on the command line to tell the compiler to
expect that string type as the default (e.g., "version=UTF16"). It would
then mean that char[] becomes an alias for the specified type. When the type
is not specified, char[] goes back to being UTF8.

> Which would be the best string type to use - char[], wchar[] or dchar[]? I
> want to choose one of them and stick with it throughout my code for the
> sake of consistency. My preference would be for wchar[] but using it is
> not as smooth as I'd hoped. For example, Object.toString() returns char[],
> Phobos seems not to have wchar versions for integer-to-string conversions,
> and concatenating sometimes requires casts. It's not too bad, I suppose: I
> can use free functions to encode/decode strings and write my own integer
> conversion routines. But I am puzzled as to why I need to cast when
> concatenating, e.g:
>
> wchar[] text = cast(wchar[])"The quick brown " ~ quickAnimalStr ~
> cast(wchar[])" jumped over the lazy " ~ lazyAnimalStr;
>
> Anyway, I'm doing a lot of text processing on Windows XP, which uses UTF16
> natively, so it seemed sane to choose the equivalent string type in D.
> Plus I read here http://www.digitalmars.com/techtips/windows_utf.html that
> char[] is not directly compatible with the ANSI versions of the Windows
> API (again, I'm using this a lot).
You've pretty much summed up all the pros and cons. XP uses wchars
natively, but Phobos is not too kind to them.
I just use char[] as I'm not planning on translating my programs into
languages which use non-roman alphabets any time soon ;)
> P.S. Here's an idea: perhaps Walter could add UTF8, UTF16 and UTF32
> version identifiers which we could use on the command line to tell the
> compiler to expect that string type as the default (e.g.,
> "version=UTF16"). It would then mean that char[] becomes an alias for the
> specified type. When the type is not specified, char[] goes back to being
> UTF8.
Don't know about it being a language feature, but perhaps something that
could be added to the runtime. Something like a conditional alias that
would define a type like "nchar" to mean "native char".

> I just use char[] as I'm not planning on translating my programs into
> languages which use non-roman alphabets any time soon ;)
You'll be suprised but even Latin-1 set does not fit into
the char.
http://www.bbsinc.com/symbol.html
For example if you will not use wchar you will not be able to see e.g.
Euro sign as one char - two UTF-8 bytes.
:) "Anders F Bjorklund" name will also be represented with one byte more.
,etc.
"Jarrett Billingsley" <kb3ctd2@yahoo.com> wrote in message
news:d0odnf$svr$1@digitaldaemon.com...
>> Which would be the best string type to use - char[], wchar[] or dchar[]?
>> I want to choose one of them and stick with it throughout my code for the
>> sake of consistency. My preference would be for wchar[] but using it is
>> not as smooth as I'd hoped. For example, Object.toString() returns
>> char[], Phobos seems not to have wchar versions for integer-to-string
>> conversions, and concatenating sometimes requires casts. It's not too
>> bad, I suppose: I can use free functions to encode/decode strings and
>> write my own integer conversion routines. But I am puzzled as to why I
>> need to cast when concatenating, e.g:
>>
>> wchar[] text = cast(wchar[])"The quick brown " ~ quickAnimalStr ~
>> cast(wchar[])" jumped over the lazy " ~ lazyAnimalStr;
>>
>> Anyway, I'm doing a lot of text processing on Windows XP, which uses
>> UTF16 natively, so it seemed sane to choose the equivalent string type in
>> D. Plus I read here http://www.digitalmars.com/techtips/windows_utf.html
>> that char[] is not directly compatible with the ANSI versions of the
>> Windows API (again, I'm using this a lot).
>
> You've pretty much summed up all the pros and cons. XP uses wchars
> natively, but Phobos is not too kind to them.
>
> I just use char[] as I'm not planning on translating my programs into
> languages which use non-roman alphabets any time soon ;)
>
>> P.S. Here's an idea: perhaps Walter could add UTF8, UTF16 and UTF32
>> version identifiers which we could use on the command line to tell the
>> compiler to expect that string type as the default (e.g.,
>> "version=UTF16"). It would then mean that char[] becomes an alias for the
>> specified type. When the type is not specified, char[] goes back to being
>> UTF8.
>
> Don't know about it being a language feature, but perhaps something that
> could be added to the runtime. Something like a conditional alias that
> would define a type like "nchar" to mean "native char".
>

Andrew Fedoniouk wrote:
>>I just use char[] as I'm not planning on translating my programs into
>>languages which use non-roman alphabets any time soon ;)
>
> You'll be suprised but even Latin-1 set does not fit into
> the char.
He probably meant "non-US" ? (lone chars holds US-ASCII characters)
> For example if you will not use wchar you will not be able to see e.g.
> Euro sign as one char - two UTF-8 bytes.
Three, actually:
char[1] euro = "\u20AC";
> cannot implicitly convert expression "\u20ac" of type char[3] to char[1]
http://www.fileformat.info/info/unicode/char/20ac/index.htm
Some characters are even 4.
> :) "Anders F Bjorklund" name will also be represented with one byte more.
It actually messed up GDC, my name was added in Latin-1 in a comment...
(when I added the patch to DMD that actually made it check comments too)
It's even more fun when using .length, as it returns bytes (code units)
I use char[] and dchar, myself. (and not wchar[] and wchar[], like Java)
--anders

> He probably meant "non-US" ? (lone chars holds US-ASCII characters)
That's it. It seems weird though that such a (relatively) common letter as
umlaut-o would be represented as 2 bytes in UTF8. Maybe I'm thinking of
ASCII (and not the old kind, where chars 128-255 are lines and stuff).