I made a commit that was meant to better certify what functions
threw in UTF.
I thus noticed that some of our functions, are unsafe. For
example:
strings s = [0b1100_0000]; //1 byte of 2 byte sequence
s.popFront(); //Assertion error because of invalid
//slicing of s[2 .. $];
"pop" is nothrow, so throwing exception is out of the question,
and the implementation seems to imply that "invalid unicode
sequences are removed".
This is a bug, right?
--------
Things get more complicated if you take into account "partial
invalidity". For example:
strings s = [0b1100_0000, 'a', 'b'];
Here, the first byte is actually an invalid sequence, since the
second byte is not of the form 0b10XX_XXXX. What's more, byte 2
itself is actually a valid sequence. We do not detect this
though, and create this output:
s.popFront(); => s == "b";
*arguably*, the correct behavior would be:
s.popFront(); => s == "ab";
Where only the single invalid first byte is removed.
The problem is that doing this would actually be much more
expensive, especially for a rare case. Worst yet, chances are you
validate again, and again (and again) the same character.
--------
So here are my 2 questions:
1. Is there, or does anyone know of, a standardized "behavior to
follow when decoding utf with invalid codes"?
2. Do we even really support invalid UTF after we "leave" the
std.utf.decode layer? EG: We simply suppose that the string is
valid?

So here are my 2 questions:
1. Is there, or does anyone know of, a standardized "behavior to
follow when decoding utf with invalid codes"?
2. Do we even really support invalid UTF after we "leave" the
std.utf.decode layer? EG: We simply suppose that the string is
valid?

We don't support invalid unicode being providing ways to check for it and in
some cases throwing if it's encountered. If you create a string with invalid
unicode, then you're shooting yourself in the foot, and you could get weird
results. Some code checks for validity and will throw when it's given invalid
unicode (decode in particular does this), whereas some code will simply ignore
the fact that it's invalid and move on (generally, because it's not bothering
to go to the effort of validating it). I believe that at the moment, the idea
is that when the full decoding of a character occurs, a UTFException will be
thrown if an invalid code point is encountered, whereas anything which
partially decodes characters (e.g. just figures out how large a code point is)
may or may not throw. popFront used to throw but doesn't any longer in an
effort to make it faster, letting decode be the one to throw (so front would
still throw, but popFront wouldn't).
I'm not aware of there being any standard way to deal with invalid Unicode,
but I believe that popFront currently just treats invalid code points as being
of length 1.
- Jonathan M Davis

So here are my 2 questions:
1. Is there, or does anyone know of, a standardized "behavior
to
follow when decoding utf with invalid codes"?
2. Do we even really support invalid UTF after we "leave" the
std.utf.decode layer? EG: We simply suppose that the string is
valid?

We don't support invalid unicode being providing ways to check
for it and in
some cases throwing if it's encountered. If you create a string
with invalid
unicode, then you're shooting yourself in the foot, and you
could get weird
results. Some code checks for validity and will throw when it's
given invalid
unicode (decode in particular does this), whereas some code
will simply ignore
the fact that it's invalid and move on (generally, because it's
not bothering
to go to the effort of validating it). I believe that at the
moment, the idea
is that when the full decoding of a character occurs, a
UTFException will be
thrown if an invalid code point is encountered, whereas
anything which
partially decodes characters (e.g. just figures out how large a
code point is)
may or may not throw. popFront used to throw but doesn't any
longer in an
effort to make it faster, letting decode be the one to throw
(so front would
still throw, but popFront wouldn't).

OK: I guess that makes sense. I kind of which there'd be more of
a documented "two-level" scheme, but that should be fine.

I'm not aware of there being any standard way to deal with
invalid Unicode,
but I believe that popFront currently just treats invalid code
points as being
of length 1.
- Jonathan M Davis

Well, popFront only pops 1 element only if the very first element
of is an invalid code point, but will not "see" if the code point
at index 2 is invalid for multi-byte codes.
This kind of gives it a double-standard behavior, but I guess we
have to draw a line somewhere.

OK: I guess that makes sense. I kind of which there'd be more of
a documented "two-level" scheme, but that should be fine.

It's pretty much grown over time and isn't necessarily applied consistently.

Well, popFront only pops 1 element only if the very first element
of is an invalid code point, but will not "see" if the code point
at index 2 is invalid for multi-byte codes.
This kind of gives it a double-standard behavior, but I guess we
have to draw a line somewhere.

We care about making popFront as fast as possible, and in general, front is
called on the character as well (making the whole way that front and popFront
work for strings naturally inefficient unfortunately), so it makes sense to
skip
the checking as much as possible in popFront. It's basically doing the best
that it can to be as fast as it can, so any checking that it doesn't need to
do is best skipped. Speed is wins over correctness here and anything that we
can do to make it faster is desirable. It's not perfect that way, but since in
most cases the Unicode will be correct, and the correctness is generally
checked by front (or decode), it was deemed to be the best approach.
- Jonathan M Davis