Community

I'm writing an introduction/tutorial to using strings in D,
paying particular attention to the complexities of UTF-8 and 16.
I realised that when you want the number of characters, you
normally actually want to use walkLength, not length. Is is
reasonable for the compiler to pick this up during semantic
analysis and point out this situation?
It's just a thought because a lot of the time, using length will
get the right answer, but for the wrong reasons, resulting in
lurking bugs. You can always cast to immutable(ubyte)[] or
immutable(short)[] if you want to work with the actual bytes
anyway.

On Monday, 23 April 2012 at 23:01:59 UTC, James Miller wrote:
> Is is reasonable for the compiler to pick this up during
> semantic analysis and point out this situation?
Maybe... but it is important that this works:
string s;
if(s.length)
do_something(s);
since that's always right and quite common.

James Miller:
> I realised that when you want the number of characters, you
> normally actually want to use walkLength, not length.
As with strlen() in C, unfortunately the result of
walkLength(somestring) is computed every time you call it...
because it's doesn't get cached.
A partial improvement for this situation is to assure
walkLength(somestring) to be strongly pure, and to assure the D
compiler is able to move this invariant pure computation out of
loops.
> Is is reasonable for the compiler to pick this up during
> semantic analysis and point out this situation?
This is not easy to do, because sometimes you want to know the
number of code points, and sometimes of code units.
I remember even a proposal to rename the "length" field to
another name for narrow strings, to avoid such bugs.
-----------------------
Adam D. Ruppe:
> Maybe... but it is important that this works:
>
> string s;
>
> if(s.length)
> do_something(s);
>
> since that's always right and quite common.
Better:
if (!s.empty)
do_something(s);
(or even better, built-in non-ulls, usable for strings too).
Bye,
bearophile

On Monday, 23 April 2012 at 23:52:41 UTC, bearophile wrote:
> James Miller:
>
>> I realised that when you want the number of characters, you
>> normally actually want to use walkLength, not length.
>
> As with strlen() in C, unfortunately the result of
> walkLength(somestring) is computed every time you call it...
> because it's doesn't get cached.
> A partial improvement for this situation is to assure
> walkLength(somestring) to be strongly pure, and to assure the D
> compiler is able to move this invariant pure computation out of
> loops.
>
>
>> Is is reasonable for the compiler to pick this up during
>> semantic analysis and point out this situation?
>
> This is not easy to do, because sometimes you want to know the
> number of code points, and sometimes of code units.
> I remember even a proposal to rename the "length" field to
> another name for narrow strings, to avoid such bugs.
I was thinking about that. This is quite a vague suggestion, more
just throwing the idea out there and seeing what people think. I
am aware of the issue of walkLength being computed every time,
rather than being a constant lookup. One option would be to make
it only a warning in @safe code, so worst case scenario is that
you mark the function as @trusted. I feel this fits in with the
idea of @safe quite well, since you have to explicitly tell the
compiler that you know what you're doing.
Another option would be to have some sort of general lint tool
that picks up on these kinds of potential errors, though that is
a lot bigger scope...
--
James Miller

James Miller:
> Another option would be to have some sort of general lint tool
> that picks up on these kinds of potential errors, though that
> is a lot bigger scope...
Lot of people in D.learn don't even use "-wi -property" so go
figure how many will use a lint :-)
In first approximation you can rely only on what people see
compiling with "dmd foo.d", that is the most basic compilation
use only. More serious programmers thankfully activate warnings.
Bye,
bearophile

On Tuesday, April 24, 2012 01:01:57 James Miller wrote:
> I'm writing an introduction/tutorial to using strings in D,
> paying particular attention to the complexities of UTF-8 and 16.
> I realised that when you want the number of characters, you
> normally actually want to use walkLength, not length. Is is
> reasonable for the compiler to pick this up during semantic
> analysis and point out this situation?
>
> It's just a thought because a lot of the time, using length will
> get the right answer, but for the wrong reasons, resulting in
> lurking bugs. You can always cast to immutable(ubyte)[] or
> immutable(short)[] if you want to work with the actual bytes
> anyway.
At this point, I don't think that it makes any sense to give a warning for
this. The compiler can't possibly know whether using length is a good idea or
correct in any particular set of code. If we really want to do something to
tackle the problem, then we should create a new string type which better
solves the issues. There's a _lot_ more to be worried about due to the fact
that strings are variable length encoded than just their length.
There has been talk of creating a new string type, and there has been talk of
creating the concept of a variable length encoded range which better handles
all of this stuff, but no proposal thus far has gotten anywhere.
As for walkLength being O(n) in many cases (as discussed elsewhere in this
thread), I don't think that it's that big a deal. If you know what it's doing,
you know that it's O(n), and it's simple enough to simply save the result if
you need to call it multiple times.
- Jonathan M Davis

"James Miller" <james@aatch.net> wrote in message
news:qdgacdzxkhmhojqcettj@forum.dlang.org...
> I'm writing an introduction/tutorial to using strings in D, paying
> particular attention to the complexities of UTF-8 and 16. I realised that
> when you want the number of characters, you normally actually want to use
> walkLength, not length. Is is reasonable for the compiler to pick this up
> during semantic analysis and point out this situation?
>
> It's just a thought because a lot of the time, using length will get the
> right answer, but for the wrong reasons, resulting in lurking bugs. You
> can always cast to immutable(ubyte)[] or immutable(short)[] if you want to
> work with the actual bytes anyway.
I find that most of the time I actually *do* want to use length. Don't know
if that's common, though, or if it's just a reflection of my particular
use-cases.
Also, keep in mind that (unless I'm mistaken) walkLength does *not* return
the number of "characters" (ie, graphemes), but merely the number of code
points - which is not the same thing (due to existence of the
[confusingly-named] "combining characters").

On Thursday, April 26, 2012 13:51:17 Nick Sabalausky wrote:
> Also, keep in mind that (unless I'm mistaken) walkLength does *not* return
> the number of "characters" (ie, graphemes), but merely the number of code
> points - which is not the same thing (due to existence of the
> [confusingly-named] "combining characters").
You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's
internals) deals with graphemes. It all operates on code points, and strings
are considered to be ranges of code points, not graphemes. So, as far as
ranges go, walkLength returns the actual length of the range. That's _usually_
the number of characters/graphemes as well, but it's certainly not 100%
correct. We'll need further unicode facilities in Phobos to deal with that
though, and I doubt that strings will ever change to be treated as ranges of
graphemes, since that would be incredibly expensive computationally. We have
enough performance problems with strings as it is. What we'll probably get is
extra functions to deal with normalization (and probably something to count
the number of graphemes) and probably a wrapper type that does deal in
graphemes.
Regardless, you're right about walkLength returning the number of code points
rather than graphemes, because strings are considered to be ranges of dchar.
- Jonathan M Davis

On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
> "James Miller" <james@aatch.net> wrote in message
> news:qdgacdzxkhmhojqcettj@forum.dlang.org...
> > I'm writing an introduction/tutorial to using strings in D, paying
> > particular attention to the complexities of UTF-8 and 16. I realised
> > that when you want the number of characters, you normally actually
> > want to use walkLength, not length. Is is reasonable for the
> > compiler to pick this up during semantic analysis and point out this
> > situation?
> >
> > It's just a thought because a lot of the time, using length will get
> > the right answer, but for the wrong reasons, resulting in lurking
> > bugs. You can always cast to immutable(ubyte)[] or
> > immutable(short)[] if you want to work with the actual bytes anyway.
>
> I find that most of the time I actually *do* want to use length. Don't
> know if that's common, though, or if it's just a reflection of my
> particular use-cases.
>
> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
> return the number of "characters" (ie, graphemes), but merely the
> number of code points - which is not the same thing (due to existence
> of the [confusingly-named] "combining characters").
[...]
And don't forget that some code points (notably from the CJK block) are
specified as "double-width", so if you're trying to do text layout,
you'll want yet a different length (layoutLength?).
So we really need all four lengths. Ain't unicode fun?! :-)
Array length is simple. Walklength is already implemented. Grapheme
length requires recognition of 'combining characters' (or rather,
ignoring said characters), and layout length requires recognizing
widthless, single- and double-width characters.
I've been thinking about unicode processing recently. Traditionally, we
have to decode narrow strings into UTF-32 (aka dchar) then do table
lookups and such. But unicode encoding and properties, etc., are static
information (at least within a single unicode release). So why bother
with hardcoding tables and stuff at all?
What we *really* should be doing, esp. for commonly-used functions like
computing various lengths, is to automatically process said tables and
encode the computation in finite-state machines that can then be
optimized at the FSM level (there are known algos for generating optimal
FSMs), codegen'd, and then optimized again at the assembly level by the
compiler. These FSMs will operate at the native narrow string char type
level, so that there will be no need for explicit decoding.
The generation algo can then be run just once per unicode release, and
everything will Just Work.
T
--
Give me some fresh salted fish, please.

"Jonathan M Davis" <jmdavisProg@gmx.com> wrote in message
news:mailman.2166.1335463456.4860.digitalmars-d@puremagic.com...
> On Thursday, April 26, 2012 13:51:17 Nick Sabalausky wrote:
>> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
>> return
>> the number of "characters" (ie, graphemes), but merely the number of code
>> points - which is not the same thing (due to existence of the
>> [confusingly-named] "combining characters").
>
> You're not mistaken. Nothing in Phobos (save perhaps some of std.regex's
> internals) deals with graphemes. It all operates on code points, and
> strings
> are considered to be ranges of code points, not graphemes. So, as far as
> ranges go, walkLength returns the actual length of the range. That's
> _usually_
> the number of characters/graphemes as well, but it's certainly not 100%
> correct. We'll need further unicode facilities in Phobos to deal with that
> though, and I doubt that strings will ever change to be treated as ranges
> of
> graphemes, since that would be incredibly expensive computationally. We
> have
> enough performance problems with strings as it is. What we'll probably get
> is
> extra functions to deal with normalization (and probably something to
> count
> the number of graphemes) and probably a wrapper type that does deal in
> graphemes.
>
Yea, I'm not saying that walkLength should deal with graphemes. Just that if
someone wants the number of "characters", then neither length *nor*
walkLength are guaranteed to be correct.