Re: Even faster string equality (see #16863)

Are you sure? I'm not sure what lexcmp does (or should do..), but if say one string has precomposed Unicode letters, and the other doesn't, should it be possible for them to compare the same? Maybe it's not preferable by default? Or is it..?

These things can all be tricky; E.g. minor recent upgrades of PostgreSQL are generally considered safe (and recommended over not upgrading), with a rare "exception" (just just need to be careful and REINDEX afterwards, as broken in a former minor update..):

https://www.postgresql.org/docs/9.5/static/release-9-5-2.html

"PostgreSQL 9.5
introduced logic for speeding up comparisons of string data
types by using the standard C library function strxfrm() as a substitute for
strcoll(). It now emerges
that most versions of glibc (Linux's implementation of the
C library) have buggy implementations of strxfrm() that, in some locales, can
produce string comparison results that do not match
strcoll(). Until this problem
can be better characterized, disable the optimization in
all non-C locales. (C locale is safe since it uses neither
strcoll() nor strxfrm().)

Unfortunately, this problem affects not only sorting but
also entry ordering in B-tree indexes, which means that
B-tree indexes on text, varchar, or char columns
may now be corrupt if they sort according to an affected
locale and were built or modified under PostgreSQL 9.5.0 or 9.5.1. Users
should REINDEX indexes that might
be affected.

It is not possible at this time to give an exhaustive
list of known-affected locales. C
locale is known safe, and there is no evidence of trouble
in English-based locales such as en_US, but some other popular locales such
as de_DE are affected in most
glibc versions."

On Wednesday, June 8, 2016 at 11:55:11 PM UTC-4, Scott Jones wrote:

I have made a branch with an even faster version of == for the String type (it took 28%-40% less time in my testing).

Instead of calling lexcmp, which has to deal with strings of different length, it calls memcmp directly if the string lengths are equal, and checks if the result is 0.

Are you sure? I'm not sure what lexcmp does (or should do..), but if say one string has precomposed Unicode letters, and the other doesn't, should it be possible for them to compare the same? Maybe it's not preferable by default? Or is it..?

These things can all be tricky; E.g. minor recent upgrades of PostgreSQL are generally considered safe (and recommended over not upgrading), with a rare "exception" (just just need to be careful and REINDEX afterwards, as broken in a former minor update..):

"PostgreSQL 9.5
introduced logic for speeding up comparisons of string data
types by using the standard C library function strxfrm() as a substitute for
strcoll(). It now emerges
that most versions of glibc (Linux's implementation of the
C library) have buggy implementations of strxfrm() that, in some locales, can
produce string comparison results that do not match
strcoll(). Until this problem
can be better characterized, disable the optimization in
all non-C locales. (C locale is safe since it uses neither
strcoll() nor strxfrm().)

Unfortunately, this problem affects not only sorting but
also entry ordering in B-tree indexes, which means that
B-tree indexes on text, varchar, or char columns
may now be corrupt if they sort according to an affected
locale and were built or modified under PostgreSQL 9.5.0 or 9.5.1. Users
should REINDEX indexes that might
be affected.

It is not possible at this time to give an exhaustive
list of known-affected locales. C
locale is known safe, and there is no evidence of trouble
in English-based locales such as en_US, but some other popular locales such
as de_DE are affected in most
glibc versions."

On Wednesday, June 8, 2016 at 11:55:11 PM UTC-4, Scott Jones wrote:

I have made a branch with an even faster version of == for the String type (it took 28%-40% less time in my testing).

Instead of calling lexcmp, which has to deal with strings of different length, it calls memcmp directly if the string lengths are equal, and checks if the result is 0.

I'm partial to "natural sort" being the default.. Then we could have
non-default ASCIIbetical or Unicode disregarding precomposed etc.
whatever is fastest.

--
Palli.

> *
> *
> On Wednesday, July 13, 2016 at 9:48:01 AM UTC-4, Páll Haraldsson wrote:
>
> On Thursday, June 9, 2016 at 4:46:46 AM UTC, Jeffrey Sarnoff wrote:
>
> looks sane
>
>
> Are you sure? I'm not sure what lexcmp does (or should do..), but if
> say one string has precomposed Unicode letters, and the other
> doesn't, should it be possible for them to compare the same? Maybe
> it's not preferable by default? Or is it..?
>
> These things can all be tricky; E.g. minor recent upgrades of
> PostgreSQL are generally considered safe (and recommended over not
> upgrading), with a rare "exception" (just just need to be careful
> and REINDEX afterwards, as broken in a former minor update..):
>
> https://www.postgresql.org/docs/9.5/static/release-9-5-2.html> <https://www.postgresql.org/docs/9.5/static/release-9-5-2.html>
>
> "PostgreSQL 9.5 introduced logic for speeding up comparisons of
> string data types by using the standard C library function
> |strxfrm()| as a substitute for |strcoll()|. It now emerges that
> most versions of glibc (Linux's implementation of the C library)
> have buggy implementations of |strxfrm()| that, in some locales, can
> produce string comparison results that do not match |strcoll()|.
> Until this problem can be better characterized, disable the
> optimization in all non-C locales. (C locale is safe since it uses
> neither |strcoll()| nor |strxfrm()|.)
>
> Unfortunately, this problem affects not only sorting but also entry
> ordering in B-tree indexes, which means that B-tree indexes on text,
> varchar, or char columns may now be corrupt if they sort according
> to an affected locale and were built or modified under PostgreSQL
> 9.5.0 or 9.5.1. Users should REINDEX indexes that might be affected.
>
>
> It is not possible at this time to give an exhaustive list of
> known-affected locales. C locale is known safe, and there is no
> evidence of trouble in English-based locales such as en_US, but some
> other popular locales such as de_DE are affected in most glibc
> versions."
>
>
>
> On Wednesday, June 8, 2016 at 11:55:11 PM UTC-4, Scott Jones wrote:
>
> I have made a branch with an even faster version of == for
> the String type (it took 28%-40% less time in my testing).
> Instead of calling lexcmp, which has to deal with strings of
> different length, it calls memcmp directly if the string
> lengths are equal, and checks if the result is 0.
>
> https://github.com/ScottPJones/julia/tree/spj/fasteqstr> <https://github.com/ScottPJones/julia/tree/spj/fasteqstr>
>
> If somebody could make a PR out of this branch, it would be
> appreciated.
>
> Thanks in advance,
> Scott
>

Re: Even faster string equality (see #16863)

We're moving towards the built-in String type being essentially just a container for bytes with a UTF-8-like interpretation. Thus if two String objects don't have the same underlying data, they are not equal. If you want fancier Unicode-aware behavior like UTF-8 validation or normalized comparison, then you'll need to use a package that provides a type like EncodedString{UTF8} (or something). Care needs to be taken to make sure that == is transitive, but that's a problem for the to-be-created external Unicode strings package.

Are you sure? I'm not sure what lexcmp does (or should do..), but if
say one string has precomposed Unicode letters, and the other
doesn't, should it be possible for them to compare the same? Maybe
it's not preferable by default? Or is it..?

These things can all be tricky; E.g. minor recent upgrades of
PostgreSQL are generally considered safe (and recommended over not
upgrading), with a rare "exception" (just just need to be careful
and REINDEX afterwards, as broken in a former minor update..):

"PostgreSQL 9.5 introduced logic for speeding up comparisons of
string data types by using the standard C library function
|strxfrm()| as a substitute for |strcoll()|. It now emerges that

most versions of glibc (Linux's implementation of the C library)
have buggy implementations of |strxfrm()| that, in some locales, can
produce string comparison results that do not match |strcoll()|.
Until this problem can be better characterized, disable the
optimization in all non-C locales. (C locale is safe since it uses
neither |strcoll()| nor |strxfrm()|.)

Unfortunately, this problem affects not only sorting but also entry
ordering in B-tree indexes, which means that B-tree indexes on text,
varchar, or char columns may now be corrupt if they sort according
to an affected locale and were built or modified under PostgreSQL
9.5.0 or 9.5.1. Users should REINDEX indexes that might be affected.

It is not possible at this time to give an exhaustive list of
known-affected locales. C locale is known safe, and there is no
evidence of trouble in English-based locales such as en_US, but some
other popular locales such as de_DE are affected in most glibc
versions."

On Wednesday, June 8, 2016 at 11:55:11 PM UTC-4, Scott Jones wrote:

I have made a branch with an even faster version of == for
the String type (it took 28%-40% less time in my testing).
Instead of calling lexcmp, which has to deal with strings of
different length, it calls memcmp directly if the string
lengths are equal, and checks if the result is 0.

Re: Even faster string equality (see #16863)

We're moving towards the built-in String type being essentially just a container for bytes with a UTF-8-like interpretation. Thus if two String objects don't have the same underlying data, they are not equal. If you want fancier Unicode-aware behavior

Ok, we can define it that way, then the optimization that is already merged is ok.

This still does not imply anything for ordering. That must be defined. ASCIIbetical is sometimes used (but is bad), similar for Unicode would also be a disaster. Since I want my alphabet, e.g. a á b c.., then it's not a leap to correclty order á weather it's precomposed or not.

like UTF-8 validation or normalized comparison, then you'll need to use a package that provides a type like EncodedString{UTF8} (or something). Care needs to be taken to make sure that == is transitive, but that's a problem for the to-be-created external Unicode strings package.

Is it at least easy to include a package with new behavior that substitutes the default string type? If not with the same name, at least all literal will get the new type?

If you must define some order, maybe it's better that equal is not dumb..

Re: Even faster string equality (see #16863)

Rather than defining a whole new string type, you can also just normalize your strings (with the normalize_string function) before comparing them, in the rare cases where you want equality that ignores e.g. differences in encoding combining characters.

You could even overload a new infix operator (e.g. ≃) that performs normalized equality tests of whatever flavor you like.

There's no single correct answer here, because it depends on what differences you want to ignore. Do you want to ignore differences in codepoint composition that lead to logically identical strings? Use NFC normalization. Do you also want to ignore differences in "confusable" characters like µ (micro) vs μ (greek mu)? Use NFKC normalization. Do you want to ignore all diacritical marks, e.g. treat a and á the same? Use the stripmark=true option. Do you want to ignore case? Use the casefold=true option.

Julia's default of == meaning "same codepoints" is reasonable and fast (whereas anything requiring normalization will be much slower).

Re: Even faster string equality (see #16863)

We're moving towards the built-in String type being essentially just a container for bytes with a UTF-8-like interpretation. Thus if two String objects don't have the same underlying data, they are not equal. If you want fancier Unicode-aware behavior

Ok, we can define it that way, then the optimization that is already merged is ok.

Julia, AFAIK, has never compared UTF8String (or now String) in any way except for just checking if the encoded bytes are the same.

Before my fixes to handle different technically invalid (but common) UTF-8 encodings, you could easily get two strings like “\xc0\x80” and “\0” which contained the same Unicode code point, but with a different byte encoding (as well as have many completely invalid sequences that were simply not detected).

I think there is a problem now, in that the String constructor does not seem to be making sure that all strings produced are actually valid UTF-8, and methods that deal with String assume that a String is valid UTF-8.

In v0.4.x, utf8 (and convert) ensured that each character in the resulting UTF8String was encoded correctly, but now utf8 has been deprecated, and convert(String doesn’t create a canonically encoded UTF-8 string.

This still does not imply anything for ordering. That must be defined. ASCIIbetical is sometimes used (but is bad), similar for Unicode would also be a disaster. Since I want my alphabet, e.g. a á b c.., then it's not a leap to correclty order á weather it's precomposed or not

Ordering like that is language dependent, and really needs something very complex like the ICU library to attempt to do it right (I wrote code to handle those issues before ICU was available - it’s not at all a trivial task).

like UTF-8 validation or normalized comparison, then you'll need to use a package that provides a type like EncodedString{UTF8} (or something). Care needs to be taken to make sure that == is transitive, but that's a problem for the to-be-created external Unicode strings package.

Is it at least easy to include a package with new behavior that substitutes the default string type? If not with the same name, at least all literal will get the new type?

That is not currently possible.

If you must define some order, maybe it's better that equal is not dumb..

Re: Even faster string equality (see #16863)

Rather than defining a whole new string type, you can also just normalize your strings (with the normalize_string function) before comparing them, in the rare cases where you want equality that ignores e.g. differences in encoding combining characters.

You could even overload a new infix operator (e.g. ≃) that performs normalized equality tests of whatever flavor you like.

There's no single correct answer here, because it depends on what differences you want to ignore. Do you want to ignore differences in codepoint composition that lead to logically identical strings? Use NFC normalization. Do you also want to ignore differences in "confusable" characters like µ (micro) vs μ (greek mu)? Use NFKC normalization. Do you want to ignore all diacritical marks, e.g. treat a and á the same? Use the stripmark=true option. Do you want to ignore case? Use the casefold=true option.

Julia's default of == meaning "same codepoints" is reasonable and fast (whereas anything requiring normalization will be much slower).

Unfortunately, that is simply not true (and as far as I've seen has never been) for UTF8String (and now String).

Since Julia never requires that strings be valid, there are many different ways of encoding a single code point using either overly long encodings (such as 0xc0 0x80, which is an invalid (overly long) encoding of 0x00), or encoding non-BMP characters with 6 bytes, i.e. as two 3 byte surrogate characters, which also cause strings to not sort correctly), and those will give bad results. Fixing those issues would make the performance be far worse than it already is.

At least, with convert and utf8 in v0.4.x (because of my PRs), you could eliminate those problems and ensure that you had valid UTF-8 encoded strings before comparisons, but now, in v0.5.x, you'd need to add the LegacyStrings package to gain back that functionality.

It would be possible, although rather inefficient, to do something like ==(x::String,y::String) = convert(Vector{Char},x) == convert(Vector{Char},y), and then you also have to deal with all sorts of hashing problems as well.

Just to be very clear here - I am *not* talking about composed Unicode characters - which is another matter entirely, I'm talking about simply trying to compare two Unicode strings accepted by String.

This is one of the reasons I felt that the string changes that have gone into v0.5 were a mistake. (I'm not saying either that the way strings are in v0.3 were good either, with 4 different inconsistent basic string types, and my fixes in v0.4 didn't change that mess).

Re: Even faster string equality (see #16863)

Julia's default of == meaning "same codepoints" is reasonable and fast (whereas anything requiring normalization will be much slower).

Unfortunately, that is simply not true (and as far as I've seen has never been) for UTF8String (and now String).

It's true for valid strings, which is what most applications are concerned with. For invalid encodings, it's true that we compare encodings. I don't see a problem with that. If the user just stuffed arbitrary binary data into a string, comparing encodings (valid or not) seems to be the most reasonable choice of == without knowing application-specific information.

(For example, if your application knows that you are importing "modified UTF-8" from Java code that has (invalid) surrogate pairs, you can do an application-specific preprocessing step to convert these into proper UTF-8 if you want to use the String type. But it would be inappropriate for Base.String to do this preprocessing for you, because Base.String doesn't know the origin of your invalid encoding.)

Re: Even faster string equality (see #16863)

Then what you are saying is that == for String type means simply "same code units", which is not at all the same thing as "same codepoints".

This isn't the case of just arbitrary binary data, these are UTF-8 encodings that are going to be frequently found, whether coming from Java (or languages using the JVM) or many databases.

I don't think it is inappropriate for `convert(String, x::Vector{UInt8})` to make sure that the string is canonically encoded, it can save a lot of grief when people UTF-8 data from a variety of sources.

Julia's default of == meaning "same codepoints" is reasonable and fast (whereas anything requiring normalization will be much slower).

Unfortunately, that is simply not true (and as far as I've seen has never been) for UTF8String (and now String).

It's true for valid strings, which is what most applications are concerned with. For invalid encodings, it's true that we compare encodings. I don't see a problem with that. If the user just stuffed arbitrary binary data into a string, comparing encodings (valid or not) seems to be the most reasonable choice of == without knowing application-specific information.

(For example, if your application knows that you are importing "modified UTF-8" from Java code that has (invalid) surrogate pairs, you can do an application-specific preprocessing step to convert these into proper UTF-8 if you want to use the String type. But it would be inappropriate for Base.String to do this preprocessing for you, because Base.String doesn't know the origin of your invalid encoding.)