Monday, March 05, 2012

The Perils of Collation-Aware Comparisons

If you use psql's \l command to list all the databases in your PostgreSQL cluster, you'll notice a column in the output labelled "Collate"; on my system, that column has the value "en_US.UTF-8". This means that when, for example, you sort strings, you'll use the "English" sort order rather than the traditional "C" sort order. You might ask: what's the difference?

In the "C" sort order, all capital letters come before all lower-case letters, whereas in en_US.UTF8, a comes before A which comes before b which comes before B, and so on. In other words, every collation can have its own rules for sorting strings, consistent with the way that the people who speak that language like to alphabetize things; or at least with the way that the people who wrote the locale definitions for that language think that they like to alphabetize things. Collations are OS-dependent: some operating systems don't support them at all, while Windows has a completely different naming convention from every other operating system, and probably different behaviors as well.
Now, being able sort strings according to the rules of some particular language is definitely cool, and useful, and PostgreSQL 9.1 has some improvements in this area that allow you to change the collation for individual columns or even on the fly for one particular query, using the COLLATE keyword. What's less cool is that collation-aware comparisons are considerably slower than the traditional "C" sort order. I recently did some testing of sorting 100,000 randomly-generated strings using PostgreSQL 9.2devel and found that it took about 80% longer using the en_US than it did using "C"; and I've heard of cases where the penalty was substantially more than that (this particular test was run on a glibc-derived system, but things don't seem to be much better on BSD-derived systems, even though strcoll() is implemented quite differently there). These speed penalties not only affect the cost of sorting result-sets, but also the costs of building and accessing B-tree indexes, which can be, and by default are, collation-aware.

If you happen to need the particular sorting behavior that collation-aware sorting and comparisons provide, then you may find this price worth paying, but I suspect there are a lot of people out there who are paying it more or less accidentally and don't really care very much about the underlying sorting behavior. If, for example, all of your queries are based on equality, and you don't use greater-than or less-than tests, then it doesn't matter what collation is in use. You might as well use "C" instead of whatever your local default may be, because it's faster. On PostgreSQL 9.1+, you can do this by including COLLATE "C" in your CREATE INDEX statement, and on any recent version you can also do it by writing text_pattern_ops after the column name. Such indexes have another benefit, as well: they can be used to optimize LIKE searches against constant patterns that do not begin with a wildcard.

It would be possible to reduce the speed penalty associated with collation-aware comparisons given better operating system support. We use the C function strcoll() to compare strings, but this function requires the strings being compared to be NUL-terminated. PostgreSQL stores strings internally with no trailing NUL-byte, so we're forced to copy the input strings and add a NUL to each one before calling strcoll(). It would be much nicer for PostgreSQL if the operating system exposed a function that allowed a length for each string to be specified in bytes, and guaranteed that only that number of bytes would be read. It appears that, currently, no operating system provides such a function. Also, PostgreSQL needs any two strings that are not byte-for-byte identical to compare as unequal, regardless of the collation in use. Since strcoll() does not guarantee this on all collations on all platforms, we compare the strings using strcmp() whenever strcoll() says they are equal. Fortunately, this branch isn't taken often, but it would be better if we didn't need it at all. An OS-level API allowing us to ask "can this collation ever return equality for two not-byte-for-byte identical strings?" might be useful; or just some kind of iron-clad guarantee that such a thing will never happen. What would help even more is a faster version of strcoll(); the testing I've done seems to indicate that copying the input strings is annoying, but the biggest problem is that strcoll() itself is nowhere near as fast as strcmp(). That may be hard to fix, though - whereas providing a better API should be a simple matter of programming.

There's been some discussion of using the ICU library as an alternative to or in addition to the operating system's collation support. Some (perhaps somewhat out-of-date?) notes on that can be found at http://wiki.postgresql.org/wiki/Todo:ICU. While it appears to address at least the first of the problems mentioned in the previous paragraph, it seems that it doesn't have full support for UTF-8, which PostgreSQL uses heavily. We'd basically be replacing one kind of impedance mismatch with another, which might or might not work out to an improvement, but certainly isn't ideal.

9 comments:

Just a note about ICU, FreeBSD port of Postgres actually uses ICU for collation handling with UTF8 charset since the system itself has bad support for UTF8, maybe the UTF8 support in ICU is not that bad?

ICU have a UTF8 as default mode, which work quite nice. Some function use UTF16 internally, but they are still faster then glibc even if you count the conversion time.http://userguide.icu-project.org/strings/utf-8

Performance could probably be improved significantly if we could cache the result of strxfrm() somewhere.

The reason that there is no "strncoll" function is that string collation is not "prefix transparent" (my term), meaning that if strcoll(a, b) < 0, that does not imply strncoll(a, b, n) < 0, for n < strlen(a) and n < strlen(b), which is would you would otherwise expect from strncmp(). Of course, such a function could still exist, but the way to use it would be quite inconsistent with other string.h functions.

Eventually, we will want to allow collations where different strings compare equal (think case insensitive). So that OS-level API you are asking for is not only virtually impossible to compute, it would also be wrongheaded for our purposes.

I don't think that caching the strxfrm() output would be very helpful, because it's ridiculously space-inefficient. For example, on my MacBook Pro, strxfrm() of "abcd" under en_US produces "001S001T001U001V0000001S001T001U001V"; in general, it seems that for a string of length n, strxfrm() produces one of length 9n+3, which is over the top.

I am aware that strncoll() couldn't reasonably behave like strncmp(), but I don't think that's an argument against it. glibc's strcoll() starts by doing strlen() on each string, which is pretty silly given that we just copied the input strings (whose length we knew) solely for the purpose of adding a \0 terminator.