Waiting for 8.4 – database-level lc_collation and lc_ctype

On 23rd of September, Heikki Linnakangas committed patch that was written by Radek Strnad (actually committed patch is a stripped-down version of original).

What it does? It adds ability to have (finally!) different collation order and character categories for different databases.

Up until now, you had to set LC_COLLATE and LC_CTYPE when doing initdb, and then it was unchangeable. Database initdb'ed as LATIN2 would not work fully correctly when dealing with UTF-8.

Now, it has been changed:

Make LC_COLLATE and LC_CTYPE database-level settings. Collation and
ctype are now more like encoding, stored in new datcollate and datctype
columns in pg_database.
This is a stripped-down version of Radek Strnad's patch, with further
changes by me.

What it gives us?

For example, let's assume we have PostgreSQL instance initialized with “C" locale:

# CREATEDATABASE depesz_pl WITH encoding 'utf8'COLLATE'pl_PL.UTF-8' ctype 'pl_PL.UTF-8';
ERROR: NEW collation IS incompatible WITH the collation OF the template DATABASE(C)
HINT: USE the same collation ASIN the template DATABASE,ORUSE template0 AS template

(of course I could have created template1_pl, but then – I would have to create this template somehow 🙂

One thing – you can't change collation/ctype of existing database. Reason for this is pretty simple: indexes rely on collation (and can rely on ctype). So changing collation/ctype would require reindexation of all data. While this would be technically possible – you can do it by simply dumping database, creating new database with desired locale, and loading dump.

Of course this is still a long way to make PostgreSQL fully functional in multi-language environments, but at the very least it is a step in right direction. Awaited, and important step.

You mention “this is stil la long way to make PostgreSQL fully functional in multi-language environments”. Could you probably post where it is currently lacking multi-language support or potentially cause problems?

@Steve:
to make it fully functional I would need to supply colation/ctype on much smaller objects than databases: tables or columns.

for example – consider sorting texts in many languages. of course – you can get away with it using simply utf8 everywhere, but that introduces additional costs of text conversion if your output charset is different.