Reversing an array of chars/wchars is a common enough operation (mutable
arrays often come from precedent operations that have built it). Currently
std.algorithm.reverse() can't be used:
import std.algorithm;
void main() {
dchar[] s1 = "hello"d.dup;
s1.reverse(); // OK
wchar[] s2 = "hello"w.dup;
s2.reverse(); // error
char[] s3 = "hello".dup;
s3.reverse(); // error
}
I suggest to add a char[]/wchar[] specialization to std.algorithm.reverse()
(or to add a std.string.reverse()), to make it work on those types too.
Generally std.algorithms don't work on UTF8/UTF16 because of the variable
length of its items, but for this specific algorithm I think this is not a
problem because:
1) Reversing an array is an O(n) operation, and decoding UTF adds a constant
overhead, so the computational complexity of reverse doesn't change. 2) If
you reverse an char[] or wchar[] the result will fit in the input array (is
this always true? Please tell me if this isn't true). It "just" needs to
correctly swap the bytes of multi-byte chars, and swap if there are
combined codepoints too.
- - - - - - - - - - - - - - - - - -
And I think std.algorithm.reverse() is sometimes buggy on a dchar[] (UTF32):
import std.algorithm: reverse;
void main() {
dchar[] txt = "\U00000041\U00000308\U00000042"d.dup;
txt.reverse();
assert(txt == "\U00000042\U00000308\U00000041"d);
}
txt contains LATIN CAPITAL LETTER A, COMBINING DIAERESIS, LATIN CAPITAL
LETTER B (see bug 7084 for more details).
A correct output for reversing txt is (LATIN CAPITAL LETTER B, LATIN CAPITAL
LETTER A, COMBINING DIAERESIS):
"\U00000042\U00000041\U00000308"d
See for some code:
http://stackoverflow.com/questions/199260/how-do-i-reverse-a-utf-8-string-in
-place
See also:
http://d.puremagic.com/issues/show_bug.cgi?id=7085
Regarding the printing of unicode strings see also:
http://d.puremagic.com/issues/show_bug.cgi?id=7084

If you want to reverse a char[], then cast it to ubyte[] and reverse that. If
you want to reverse a wchar[], then cast it to ushort[] and reverse that. In
Phobos, strings are ranges of dchar, so reverse is going to reverse code
points. If you want it to reverse code units instead, then you just use the
appropriate cast. There's no reason to have it reverse the code units and
completely mess up unicode strings.
And as I explained in bug# 7085, reverse's behavior with regards to dchar[] is
completely correct. It's reversing the code points, _not_ the graphemes. If
you want to operate on graphemes, you need a range of graphemes, which Phobos
does not yet support. Once it does (or if you implement it yourself), you can
reverse a string based on graphemes if that's what you want to do. But as it
stands, ranges of code points are the most advanced unicode construct that
Phobos currently supports, so that's what its functions are going to operate
on.
- Jonathan M Davis

And as I explained in bug# 7085, reverse's behavior with regards to dchar[] is
completely correct. It's reversing the code points, _not_ the graphemes.

OK. Maybe I will open a differently worded enhancement request, for a
grapheme-aware std.string.

If you want to reverse a char[], then cast it to ubyte[] and reverse that. If
you want to reverse a wchar[], then cast it to ushort[] and reverse that. In
Phobos, strings are ranges of dchar, so reverse is going to reverse code
points. If you want it to reverse code units instead, then you just use the
appropriate cast. There's no reason to have it reverse the code units and
completely mess up unicode strings.

I am not interested in reversing code units. Sorry if my post has led to this
wrong idea. For this specific problem I am not going to cast to ubyte[] or
ushort[] because it gives very wrong results.
It's possible to write a "correct" (that doesn't take into account graphemes)
reverse even if you do not use casts, keeping the array as char[] or wchar[],
reversing the bytes, and then reversing the bytes of each variable-length
codepoint. This is what I was asking to an in-place reverse().
Bye,
bearophile

And as I explained in bug# 7085, reverse's behavior with regards to
dchar[] is completely correct. It's reversing the code points, _not_
the graphemes.

grapheme-aware std.string.

If you want to reverse a char[], then cast it to ubyte[] and reverse
that. If you want to reverse a wchar[], then cast it to ushort[] and
reverse that. In Phobos, strings are ranges of dchar, so reverse is
going to reverse code points. If you want it to reverse code units
instead, then you just use the appropriate cast. There's no reason to
have it reverse the code units and completely mess up unicode strings.

I am not interested in reversing code units. Sorry if my post has led to
this wrong idea. For this specific problem I am not going to cast to
ubyte[] or ushort[] because it gives very wrong results.
It's possible to write a "correct" (that doesn't take into account
graphemes) reverse even if you do not use casts, keeping the array as
char[] or wchar[], reversing the bytes, and then reversing the bytes of
each variable-length codepoint. This is what I was asking to an in-place
reverse().

I don't expect that std.string will _ever_ be grapheme-aware or be processed
by default as a range of graphemes. That's far too expensive as far as
performance goes. Rather, we're likely to have a wrapper and/or separate
range-type which handles graphemes. Then if you want the extra correctness and
are willing to pay the cost, you use that. As I understand it, std.regex does
have the beginnings of such, but we do still need to have a range type of some
variety (probably in std.utf) which fully supports graphemes.
- Jonathan M Davis

Right, I was :-) But you have changed my mind when you have explained me
that nothing in std.algorithm is grapheme-aware. So I have reduced the
amount of what I am asking for.

So, now you're asking that char and wchar arrays be reversible with reverse
such that their code points are reversed (i.e. the result is the same as if
you reversed an array of dchar). Well, I'm not sure that you can actually do
that with the same efficiency. I'd have to think about it more. Regardless,
the implementation would be _really_ complicated in comparison to how
reverse works right now. char[] and wchar[] don't work with reverse,
because their elements aren't swappable. So, you can't just swap elements
as you iterate in from both ends. You'd have to be moving stuff off into
temporaries as you swapped them, because the code point on one side
wouldn't necessarily fit where the code point on the other side was, and in
the worst case (i.e. all of the code points on one half of the string are
multiple code units and all of those on the other side are single code
units), you'd pretty much end up having to copy half the array while you
waited for enough space to open up on one side to fit the characters from
the other side. So, regardless of whether it has the same computational
complexity as the current reverse, its memory requirements would be far
more.
I don't think that the request is completely unreasonable, but also I'm not
sure it's acceptable for reverse to change its performance characteristics
as much as would be required for it to work with arrays of char or wchar -
particularly with regards to how much memory would be required. In general,
the performance characteristics of the algorithms in Phobos don't vary much
with regards to the type that that's used. I'm pretty sure that in terms of
big-o notation, the memory complexity wouldn't match (though I don't recall
exactly how big-o notation works with memory rather than computational
complexity).

Right, I was :-) But you have changed my mind when you have explained me
that nothing in std.algorithm is grapheme-aware. So I have reduced the
amount of what I am asking for.

So, now you're asking that char and wchar arrays be reversible with reverse
such that their code points are reversed (i.e. the result is the same as if
you reversed an array of dchar). Well, I'm not sure that you can actually do
that with the same efficiency. I'd have to think about it more. Regardless, the
implementation would be _really_ complicated in comparison to how reverse
works right now. char[] and wchar[] don't work with reverse, because their
elements aren't swappable. So, you can't just swap elements as you iterate in
from both ends. You'd have to be moving stuff off into temporaries as you
swapped them, because the code point on one side wouldn't necessarily fit where
the code point on the other side was, and in the worst case (i.e. all of the
code points on one half of the string are multiple code units and all of those
on the other side are single code units), you'd pretty much end up having to
copy half the array while you waited for enough space to open up on one side
to fit the characters from the other side. So, regardless of whether it has the
same computational complexity as the current reverse, its memory requirements
would be far more.
I don't think that the request is completely unreasonable, but also I'm not
sure it's acceptable for reverse to change its performance characteristics as
much as would be required for it to work with arrays of char or wchar -
particularly with regards to how much memory would be required. In general,
the performance characteristics of the algorithms in Phobos don't vary much
with regards to the type that that's used. I'm pretty sure that in terms of
big-o notation, the memory complexity wouldn't match (though I don't recall
exactly how big-o notation works with memory rather than computational
complexity).
- Jonathan M Davis