This explains why I've been having issues with MongoDB enabled. MongoDB returns character strings, with utf flag set. See Item10748 where we added a similar fix at a much later stage in the rendering pipeline.

Unfortunately, the "quick fix" above solving the main problem from this Task, but somewhere deeper cause double encoding or something like - and therefore it is not correct.

Encoding problems like this are the main reason, why is good encode/decode from/to utf8 as close to IO as possible... (e.g. for example in the Foswiki::readFile). IMHO, Foswiki should internally use only and only utf-8 characters and not octets/bytes.

One possible solution - not ideal, but usable: Foswiki::readFile (or the new Store::VC interface) should read all files from the filesystem in binary mode (as Store::VC it does). If the SIteCharset is not utf8, should convert (recode) the text files from the SiteCharset to utf8 and return utf8-flagged string to Foswiki. From this moment, everything is done in utf8 . At the write, from utf8 convert to SiteCharset and write binary. (here can be problem of lost characters,, e.g. when someone want convert greek omega into 8859-1, but possible workaround with html entities.). If the SiteCharset is UTF8 - no conversions are needed.

At the DB level alike. If the DB is not utf8 - convert before read/write. If the DB is utf8 (like Mongo) = no conversions. And this allow having multiple DB's with different encodings and so on.

This way will not break anything at Topic level (so old 88591 Topics remain untouched). Unfortunately, this will break much things internally. (Will need revisit every regex, especially like /\177/ or so.). Will probably break much of plugins. and so on. But doing it sooner is better than later. If here is no another reason: it is easier optimize a debugged code as debugging the (badly) optimized one. ($0.02)

Jozef, your English is just fine; we just invent stuff to go in the broken bits

Yes, using perl unicode (which is implemented using utf8) internally is the master plan. Unfortunately there are a number of problems, such as badly behaved plugins that assume 1 byte=1 char, that make this harder than you might think. We have been waiting for enough people to get together with the common goal of making this work before embarking on major changes; it's too much for one person to tackle.

Hi Jozef, I've also found that running the latest CGI.pm (version 3.43 shipped with Ubuntu 10.04 LTS and I think Debian Lenny is broken, see Support.Faq25) and FCGI.pm (If you're using ModFastCGIEngineContrib) really helps a lot.

Sorry, but this problem is only going to be solved when we move to unicode in the core, and the constraint on that is needing a common perl version that has working unicode. With redhat still on 5.10, that's a way away. Deferred until 2.0.0.