Before Foswiki 2.0, Foswiki had to be configured with a {Site}{CharSet},
which set the encoding used for characters in topic and attachment names,
and topic content.

The default encoding used by Foswiki before 2.0 was iso-8859-1, which was a
reasonable choice for many western languages. However there are many
other languages (for example, Arabic, Chinese, Hebrew, Hindi) that have
characters that do not appear in this character set. Even some basic characters
like the euro symbol are missing from iso-8859-1. For this reason, Foswiki
has now moved to supporting the standard UTF-8 character encoding, which
is designed to support a very wide range of characters.

Unfortunately once you chose a {Site}{CharSet} and created a bunch
of topics, it became very risky to change because the charset is
associated with the entire database, and not with individual topics.
It was even possible to paste content in a different encoding into the
text editor and have it stored in that encoding, resulting in what looked
like garbled topics.

Ideally all Foswikis should use UTF-8, even those that are still using
older Foswikis, but we have a legacy of existing sites that don't. So we
need some way to convert an RCS-based wiki from any existing character
encoding to UTF-8.

And that's what this module provides. If you have a store that is:

Set up to use some {Site}{CharSet}other than UTF-8

Using a mixture of encodings in content

Using RcsWrap or RcsLite as its {Store}{Implementation}

then this module can convert it to using UTF-8, including all the topic
histories.

Even if you don't have an immediate need for non-western character sets
this is worth doing, as Foswiki 2.0 and later work exclusively with
UTF-8 content.

Note that this module converts all the histories of all your topics,
as well as the latest version of the topic. It also maps all web,
topic and attachment names. It does not, however, touch the content of
attachments.

Installation

This extension is tested with Foswiki 1.1.0 and later. If your Foswiki
installation is older than that, then upgrade your Foswiki first.

Note that the extension is not required and is not recommended on Foswiki
2.0 or later. If your requirement is part of an upgrade to Foswiki 2.0,
then either:

convert the 1.1.x Foswiki to UTF-8 using this extension first, or

use tools/bulk_copy.pl, as recommended in the release notes.

Version 1.2 of this extension has limited support for Foswiki 2.0
systems. It can be used with caution and a backup to detect and correct characterset
issues on a 2.0 installation.

You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.

Open configure, and open the "Extensions" section. Use "Find More Extensions" to get a list of available extensions. Select "Install".

Usage

The conversion process updates data in-place, and cannot be reversed. Be sure to take a backup before running this tool.

The convertor is used from the command-line on your wiki server (if you do
not have access to the command line then we are sorry, but there is currently
no way for you to use the conversion).

To use the convertor,

first shut down your site. You don't want anyone modifying topics while it is running.

then cd to the tools directory in your installation and perl convert_charset.pl -i.

If that runs cleanly without reporting any errors, you can:

perl convert_charset.pl

The script will convert the Foswiki RCS database pointed at by
{DataDir} and {PubDir} from the existing character set (as set
by {Site}{CharSet}) to UTF8.

Options:

-i

info - report what would be done only, do not convert anything

-q

quiet - work silently (unless there's an error)

-a

abort - on error (default is to report and continue)

-r

repair - detect the encoding of each string and repair inconsistencies.

Expert options

-web=webname

Restrict conversion to a single web and it's subwebs.

-encoding=charset

Override the source encoding. (Required if running the conversion on Foswiki 2.0.)

Only use -r if your site may contain content which cannot be decoded
using the {Site}{CharSet} (if this is the case, -i will abort with an
error).

if the -r option is given, then any number of additional repair options
can follow. These are of two types:

detected-encoding=actual-encoding

topic-path=actual-encoding

The first allows you to override the encoding of all strings detected as
detected-encoding, while the second allows you to select an individual topic
and override the encoding of the content of just that topic. If you need to
override the encoding of a web or topic name, use :N after the topic-path
e.g. Sandbox/NorthKorea:N=EUC-KR

Although this exension is intended for use on Foswiki 1.1, there may be cases
where an individual web requires conversion on a Foswiki 2.0 system. For example,
conversion of a single web migrated at a later date from an older system. For
example, convert the oops web from iso-8859-1 on a system already converted
to utf-8. *Use extreme caution converting individual webs. Foswiki does
not support mixed encoding.
perl convert_charset.pl -web=Oops -encoding=iso-8859-1 -i

Once you have run the script without -i, all:

web names

topic names

attachment names

topic content

will be converted to UTF-8. The conversion is performed in place on the data
and pub directories.

Note that no conversion is performed on

log files

working/

temporary files

password files

Links to attachments that were entity encoded.

Once conversion is complete you must change your {Site}{CharSet} to 'utf-8'.