"Unknown W. Brackets" <unknown@simplemachines.org> wrote in message
news:eapdsg$qeo$1@digitaldaemon.com...
> I'm trying to understand why this 0 thing is such an issue. If your
> second statement is valid, it makes the first moot - 0 or no 0. Why does
> it matter, then?
Declaration of char.init == 0 pretty much means that
D has no strict requirement that char[] shall contain only UTF-8
encoded sequences but any other encodings suitable for
the application.
char.init == 0 will resolve situation we see in Phobos now.
char[] de facto is used for other than utf-8 encodings.
char.init == 0 tells everybody that char can also be used
for representing unicode *code points* with asuumption
that offset value (mapping on full Unicode set, aka codepage) is stored
somewhere in application or well known to it.
char.init == 0 also highlights the fact that it is safe to
use char[] as C string processing functions and passing them to non D
modules and libraries.
Is it UTF-8 encoded or not - does not matter - type is universal enough.
Andrew.
>
> -[Unknown]
>
>
>> Another option will be to change char.init to 0 and forget about the
>> problem
>> left it as it is now. Some good string implementation will
>> contain encoding field in string instance if needed.
>>
>> Andrew.
>>
>>

On Tue, 01 Aug 2006 22:40:56 -0700, Unknown W. Brackets wrote:
> I'm trying to understand why this 0 thing is such an issue. If your
> second statement is valid, it makes the first moot - 0 or no 0. Why
> does it matter, then?
I think the issue is more that Andrew wants to have hex-FF as a legitimate
byte value anywhere in a char[] variable. He misses the point that the
purpose of not allowing it in so we can detected uninitialized UTF-8
strings at run-time.
Andrew, just use ubyte[] variables and you won't have a problem, apart from
conversions between code-pages and Unicode <G>.
In D, ubyte[] is the data structure designed to hold variable length arrays
of unsigned bytes, which is exactly what you need to implement the type
strings you have in KOI-8 encoding.
--
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 4:24:27 PM

> But maybe that's because I never leave things at their defaults. It's
> like writing a story where you expect the reader to think everyone has
> brown eyes unless you say otherwise.
>
Consider this:
char[6] buf;
strncpy(buf, "1234567", 5);
What will be a content of you buffer?
Answer is: 12345\xff . Surprise? It is.
In modern D reliable implementation of this shall be as:
char[6] buf; // memset(buf,0xFF,6); under the hood.
uint n = strncpy(buf, "1234567", 5);
buf[n] = 0;
if you are going to use this with non D modules.
Needless to say that this is a bit redundant.
If D in any case initializes that memory why you need
this uint n and buf[n] = 0; ?
Don't tell me please that this is because your spent
your childhood in boyscout camps and got some high principles.
Lets' put aside that matters - it is purely technical discussion.
Andrew.

Andrew Fedoniouk wrote:
> "Unknown W. Brackets" <unknown@simplemachines.org> wrote in message
> news:eapdsg$qeo$1@digitaldaemon.com...
>> I'm trying to understand why this 0 thing is such an issue. If your
>> second statement is valid, it makes the first moot - 0 or no 0. Why does
>> it matter, then?
>
> Declaration of char.init == 0 pretty much means that
> D has no strict requirement that char[] shall contain only UTF-8
> encoded sequences but any other encodings suitable for
> the application.
Why is this good?
> char.init == 0 will resolve situation we see in Phobos now.
> char[] de facto is used for other than utf-8 encodings.
You mean data with other encodings that still want to use the std.string
functions? I have written template versions that replaces (almost) all
std.string functions that do not rely on encoding.
> char.init == 0 tells everybody that char can also be used
> for representing unicode *code points* with asuumption
> that offset value (mapping on full Unicode set, aka codepage) is stored
> somewhere in application or well known to it.
Maybe it would tell people that. A good thing it isn't so then. Again,
why do you want to store non utf-8 data in a char[]?. What is wrong with
ubyte[] or a suitable typedef?
> char.init == 0 also highlights the fact that it is safe to
> use char[] as C string processing functions and passing them to non D
> modules and libraries.
> Is it UTF-8 encoded or not - does not matter - type is universal enough.
I can't see how that would make it considerably safer.
/Oskar

> I think the issue is more that Andrew wants to have hex-FF as a legitimate
> byte value anywhere in a char[] variable. He misses the point that the
> purpose of not allowing it in so we can detected uninitialized UTF-8
> strings at run-time.
>
What does it mean uninitialized? They *are* initialized.
This is the main point. For any types you can declare
initial value. I bet you are choosing not non existent values
for say enums but some really meaningfull default values.
having strings filled by ff's means that you will get problems
of different kinds - partially initialized strings.
Could you tell me do you ever had situation when
ffffff strings helped you to find problem?
And if yes how it is in principle different from
catching strings with 00000?
Can anyone here say that this fffffffs helped to find
problem?
Andrew.

Derek Parnell wrote:
> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
>
>> (Hope this long dialog will help all of us to better understand what UNICODE
>> is)
>>
>> "Walter Bright" <newshound@digitalmars.com> wrote in message
>> news:eao5st$2r1f$1@digitaldaemon.com...
>>> Andrew Fedoniouk wrote:
>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>> encoded using UTF-16.
>>>
>>> BMP is a subset of UTF-16.
>> Walter with deepest respect but it is not. Two different things.
>>
>> UTF-16 is a variable-length enconding - byte stream.
>> Unicode BMP is a range of numbers strictly speaking.
>
> Andrew is correct. In UTF-16, characters are variable length, from 2 to 4
> bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to
> be up to 6 but that has changed). UCS-2 is a subset of Unicode characters
> that are all represented by 2-byte integers. Windows NT had implemented
> UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid
UTF-16?

Andrew Fedoniouk wrote:
>> But maybe that's because I never leave things at their defaults. It's
>> like writing a story where you expect the reader to think everyone has
>> brown eyes unless you say otherwise.
>>
>
> Consider this:
>
> char[6] buf;
> strncpy(buf, "1234567", 5);
>
> What will be a content of you buffer?
>
> Answer is: 12345\xff . Surprise? It is.
Not really surprising. Had you compiled this in a C program (you are
using C functions after all), you would have gotten:
12345\x?? <- some garbage. Not a zero terminated string.
My manual for strncpy explicitly states:
" if there is no null byte among the first n
bytes of src, the result will not be null-terminated."
/Oskar

On Tue, 1 Aug 2006 23:45:26 -0700, Andrew Fedoniouk wrote:
>> But maybe that's because I never leave things at their defaults. It's
>> like writing a story where you expect the reader to think everyone has
>> brown eyes unless you say otherwise.
>>
>
> Consider this:
>
> char[6] buf;
> strncpy(buf, "1234567", 5);
>
> What will be a content of you buffer?
>
> Answer is: 12345\xff . Surprise? It is.
No, not surprised, just wondering why you didn't code it correctly though.
If you insist on using C functions then it should be coded ...
extern(C) uint strncpy(ubyte *, ubyte *, uint );
ubyte[6] buf;
strncpy(buf.ptr, cast(ubyte*)"1234567", 5);
> In modern D reliable implementation of this shall be as:
>
> char[6] buf; // memset(buf,0xFF,6); under the hood.
> uint n = strncpy(buf, "1234567", 5);
> buf[n] = 0;
Well that is debatable. I'd do it more like ...
char[6] buf; // An array of UTF-8 code units.
uint n = strncpy(buf, "1234567", 5); // Replace the first 5 code-units.
buf[n..$] = 0; // Set remaining code-units to zero.
> if you are going to use this with non D modules.
>
> Needless to say that this is a bit redundant.
>
> If D in any case initializes that memory why you need
> this uint n and buf[n] = 0; ?
>
> Don't tell me please that this is because your spent
> your childhood in boyscout camps and got some high principles.
> Lets' put aside that matters - it is purely technical discussion.
Exactly. And technically you should be using ubyte[] and not char[].
--
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 4:57:15 PM

I fail to understand why I want another ambiguous type in my
programming. I am glad that when I type "int", I know I have a number
and not a pointer.
I am glad that when I type char, I again know what I have. No
guesswork. Your proposals sound like shooting myself in the foot.
No fun. I'll take that helmet you offered first.
-[Unknown]
> "Unknown W. Brackets" <unknown@simplemachines.org> wrote in message
> news:eapdsg$qeo$1@digitaldaemon.com...
>> I'm trying to understand why this 0 thing is such an issue. If your
>> second statement is valid, it makes the first moot - 0 or no 0. Why does
>> it matter, then?
>
> Declaration of char.init == 0 pretty much means that
> D has no strict requirement that char[] shall contain only UTF-8
> encoded sequences but any other encodings suitable for
> the application.
>
> char.init == 0 will resolve situation we see in Phobos now.
> char[] de facto is used for other than utf-8 encodings.
>
> char.init == 0 tells everybody that char can also be used
> for representing unicode *code points* with asuumption
> that offset value (mapping on full Unicode set, aka codepage) is stored
> somewhere in application or well known to it.
>
> char.init == 0 also highlights the fact that it is safe to
> use char[] as C string processing functions and passing them to non D
> modules and libraries.
> Is it UTF-8 encoded or not - does not matter - type is universal enough.
>
> Andrew.
>
>
>
>
>
>
>
>> -[Unknown]
>>
>>
>>> Another option will be to change char.init to 0 and forget about the
>>> problem
>>> left it as it is now. Some good string implementation will
>>> contain encoding field in string instance if needed.
>>>
>>> Andrew.
>>>
>>>
>