Originally posted on August 20, 2012.

You have almost certainly seen text on a computer that looks something like this:

If numbers arenâ€™t beautiful, I donâ€™t know what is. â€“Paul ErdÅ‘s

Somewhere, a computer got hold of a list of numbers that were intended to constitute a quotation and did something distinctly un-beautiful with it. A person reading that can deduce that it was actually supposed to say this:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Here’s what’s going on. A modern computer has the ability to display text that uses over 100,000 different characters, but unfortunately that text sometimes passes through a doddering old program that believes there are only the 256 that it can fit in a single byte. The program doesn’t even bother to check what encoding the text is in; it just uses its own favorite encoding and turns a bunch of characters into strings of completely different characters.

But the problem is that sometimes you might have to deal with text that comes out of other code. We deal with this a lot at Luminoso, where the text our customers want us to analyze has often passed through several different pieces of software, each with their own quirks, probably with Microsoft Office somewhere in the chain.

So this post isn’t about how to do Unicode right. It’s about a tool we came up with for damage control after some other program does Unicode wrong. It detects some of the most common encoding mistakes and does what it can to undo them.

Here’s the type of Unicode mistake we’re fixing.

Some text, somewhere, was encoded into bytes using UTF-8 (which is quickly becoming the standard encoding for text on the Internet).

The software that received this text wasn’t expecting UTF-8. It instead decodes the bytes in an encoding with only 256 characters. The simplest of these encodings is the one called “ISO-8859-1”, or “Latin-1” among friends. In Latin-1, you map the 256 possible bytes to the first 256 Unicode characters. This encoding can arise naturally from software that doesn’t even consider that different encodings exist.

The result is that every non-ASCII character turns into two or three garbage characters.

The three most commonly-confused codecs are UTF-8, Latin-1, and Windows-1252. There are lots of other codecs in use in the world, but they are so obviously different from these three that everyone can tell when they’ve gone wrong. We’ll focus on fixing cases where text was encoded as one of these three codecs and decoded as another.

A first attempt

When you look at the kind of junk that’s produced by this process, the character sequences seem so ugly and meaningless that you could just replace anything that looks like it should have been UTF-8. Just find those sequences, replace them unconditionally with what they would be in UTF-8, and you’re done. In fact, that’s what my first version did. Skipping a bunch of edge cases and error handling, it looked something like this:

This does a perfectly fine job at decoding UTF-8 that was read as Latin-1 with hardly any false positives. But a lot of erroneous text out there in the wild wasn’t decoded as Latin-1. It was instead decoded in a slightly different codec, Windows-1252, the default in widely-used software such as Microsoft Office.

Windows-1252 is totally non-standard, but you can see why people want it: it fills an otherwise useless area of Latin-1 with lots of word-processing-friendly characters, such as curly quotes, bullets, the Euro symbol, the trademark symbol, and the Czech letter š. When these characters show up where you didn’t expect them, they’re called “gremlins“.

When we might encounter text that was meant to be UTF-8 with these characters in it, the problem isn’t so simple anymore. I started finding things that people might actually say that included these characters and were also valid in UTF-8. Maybe these are improbable edge cases, but I don’t want to write a Unicode fixer that actually introduces errors.

&gt;&gt;&gt; print naive_unicode_fixer(u'“I'm not such a fan of Charlotte Brontë…”')
“I'm not such a fan of Charlotte Bront녔
&gt;&gt;&gt; print naive_unicode_fixer(u'AHÅ™, the new sofa from IKEA®')
AHř, the new sofa from IKEA®

An intelligent Unicode fixer

Because encoded text can actually be ambiguous, we have to figure out whether the text is better when we fix it or when we leave it alone. The venerable Mark Pilgrim has a key insight when discussing his chardet module:

The reason the word “Bront녔” is so clearly wrong is that the first five characters are Roman letters, while the last one is Hangul, and most words in most languages don’t mix two different scripts like that.

This is where Python’s standard library starts to shine. The unicodedata module can tell us lots of things we want to know about any given character:

That leads us to a complete Unicode fixer that applies these rules. It does an excellent job at fixing files full of garble line-by-line, such as the University of LeedsInternet Spanish frequency list, which picked up that “mÃ¡s” is a really common word in Spanish text because there is so much incorrect Unicode on the Web.

The final code appears below, as well as in this recipe and in our open source (MIT license) natural language wrangling package, metanl.