I had an interesting question from a friend to help convert the
following UTF8 text to iso-8859-15.

<br />â€œ curly quotes â€ aeiouÌˆ line â€¨ separator

This gives me a good opportunity to present unicode tools I often use — recode, uconv and python.

MySQL issues

The text above was obtained/garbled by doing a dump from a MySQL database.
MySQL character encoding is a common problem, since the default seems to be latin1,
and even if you set "utf8", that doesn't cover all characters. "utf8mb4"
is in fact required for that. For a description of some of the problems with trying to
fix MySQL encoding issues in place see this OpenStack discussion.
Presented below are external tools to untangle this mess which can get quite tricky.

Ungarbling

With a little experimentation one can see that the above is validly encoded UTF8,
but the encoder assumed it was encoding windows cp1252 data, while the data was
already in UTF8!. One can see this using the venerable
recode utility to reverse this process:

Notice that there are missing bytes in the table above.
I.E. there are certain bytes that are not valid cp1252 characters,
namely 81 8d 90 9d 9e.
So the original conversion is not a fully reversible operation.
I.E. if any of the original UTF8 text has those values then there will
be problems converting (note iso-8859-15 for example defines chars for all bytes
and so one would have not had this issue).

Specifically consider the right curly quote (”) in the original
UTF8 file. This has the byte sequence: e2 80 9d
containing one of the invalid cp1252 code points.
What ever did the conversion converted these 3 bytes to
c3 a2e2 82 acc2 9d.

So rather than just ignoring the invalid characters, what
we can do is convert to cp1252 but fall back to iso-8859-15
conversion, which will essentially just remove the c2 byte
as required. I don't know of existing tools that allow
you to do that, but a quick python proggy fits the bill.

Note python has traditionally had very good support for unicode
processing (it uses ICU internally), and I often use it for
unicode lookup and non standard conversion tasks like this. Here is an example to
lookup its embedded unicode database:

Normalization

After running the data through our ungarbling script above, we get this valid UTF8

<br />“ curly quotes ” aeioü line \xe2\x80\xa8 separator

Note the highlighted portion above. I replaced that U+2028 line separator character in the original
with its hex representation because it causes firefox 2.0.0.10 on linux to hang immediately, and firefox 2.0.0.13 to truncate the line.
Also recode which we'll use for conversion later on, doesn't know how to handle it,
so we'll convert it to HTML with sed like sed 's/\xe2\x80\xa8/<br\/>/g.

Note also the ü character above. This umlaut is actually represented above as 2 UTF8 characters. u plus the combining diacritic.
This is a common issue in unicode processing, where there are multiple ways to represent a particular character.
We can use the uconv utility which comes from the ICU project
to normalise representations to the combined form for example.
This is needed in our particular case as recode doesn't handle combining characters at present. So to convert to single characters
do uconv -f utf8 -t utf8 -x nfc. Note uconv is available in the libicu-dev package on debian/ubuntu,
and in the icu package on fedora/redhat.

Conversion

So after the above normalization we have this UTF8 string:

<br />“ curly quotes ” aeioü line <br/> separator

The last step to transliterate the UTF8 characters to the closest iso-8859-15 equivalent is
achieved quite easily with the recode utf8..iso-8859-15 command.
recode really is a nifty tool and I've previously documented other recode examples.
So to recap on the whole conversion command line: