The exact and quick answer is: because characters with unknown location in alphabet are ignored when sorting.

But let me show some details.

Most of you are familiar with alphabet like this:

a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z

others might know and understand alphabet like this:

a,ą,b,c,ć,d,e,ę,f,g,h,i,j,k,l,ł,m,n,ń,o,ó,p,r,s,ś,t,u,w,x,y,z,ź,ż

There are also alphabets with different characters. Like ù, æ or ë.

But that's not all. There are some languages which don't use any of these letters, they have their own. Like Russian:

а,б,в,г,д,е,ё,ж,з,и,й,к,л,м,н,о,п,р,с,т,у,ф,х,ц,ч,ш,щ,ъ,ы,ь,э,ю,я

To make it even more interesting – some languages use latin letters, but order of them in alphabet is different.

For example – in Czech language “CH" pair, is considered (as far as sorting is considered) as single character, and it sorted between H and I. Or in French, when in some cases, you order not from left to right, but from right to left.

So. To make sorting work at all we have to have tables, which say which character is before/after which. These are called “collations", and are part of system locales. PostgreSQL uses it to avoid bundling the same tables in its sources.

What's most important – virtually all (at the very least: all I know) locales, don't specify order of most of the characters. Mostly because (as I understand) that (for example) space could be very well be before first letter, or after last – and it's just a matter of personal preference. While order of letters in alphabet is well established, and usually supported by local law or norms.

So, the point is – every character that has not strict position defined in collation – is ignored, because PostgreSQL, nor system, will decide for you if you want spaces before a, or after z.

Are there any ways to get sorting which takes into consideration spaces?

Sure.

If you'll use pseudo-locale “C", ordering is done using ASCII value of every character. Which ignores all special cases for languages, but uses spaces. Since I don't have (and am not really happy to setup) database with C locale, I can show it using system sort:

Which is pretty obvious, because order of ó, ą, and so on is defined in Polish language (and possibly some others, but not necessarily the same), while ucs_basic (as I understand it) orders using numerical value of unicode codepoints.

So, is it possible at all to get proper ordering, with some additional changes, that would cause spaces (or other characters) to be sorted “together"?

Well, there is solution, and quite simple (although not elegant).

We can write simple function, which will modify input string, and return another, that is sortable the way we want.

For example. Let's assume we want proper ordering, and spaces should be before letters. In my case, it means it should be “before a".

So, the easy change is to make letter “a" into “az", and space into “aa". Like this:

As you can see both polish accents, and spaces are ordered the way we'd want.

Please note that the substitutions I used (aa and az) are safe in my locale, but you have to check if they are safe in your locale too, if you'd want to use such hack.

Using the same method we could also add to ordering dots, commas, or any other character that's important to you, but hasn't been included in collation tables.

The important factor in here is that the order_with_spaces() function is immutable, which means that you can use it as base for index. But this leads to one small issue – if you'll use it as base for index, and then you'll change the definition of the function (because, for example, you want to add special place for – character) – you will have to reindex all indexes that use the function.

One thought on “Why is “depesz” between “de luca” and “de vil”?”

The technical details of this are not quite correct. The collation definitions on glibc do provide a sort order for all (well, maybe most) characters. The reason why spaces appear to be ignored is that collation elements are that sorted in multiple passes (usually 3 or 4), and the collation setting for space is

IGNORE;IGNORE;IGNORE;

so that it is ignored in the first three passed, unlike letters and other “more important” characters.