The goal of ftfy is to take in bad Unicode and output good Unicode, for use
in your Unicode-aware code. This is different from taking in non-Unicode and
outputting Unicode, which is not a goal of ftfy. It also isn’t designed to
protect you from having to write Unicode-aware code. ftfy helps those who help
themselves.

Of course you’re better off if your input is decoded properly and has no
glitches. But you often don’t have any control over your input; it’s someone
else’s mistake, but it’s your problem now.

ftfy will do everything it can to fix the problem.

Note

This documentation is for ftfy 5, which runs on Python 3 only, following
the plan to drop Python 2 support that was announced in ftfy 3.3.

If you’re running on Python 2, ftfy 4.x will keep working for you. In that
case, you should add ftfy<5 to your requirements.

The most interesting kind of brokenness that ftfy will fix is when someone has
encoded Unicode with one standard and decoded it with a different one. This
often shows up as characters that turn into nonsense sequences (called
“mojibake”):

The word schön might appear as schÃ¶n.

An em dash (—) might appear as â€”.

Text that was meant to be enclosed in quotation marks might end up
instead enclosed in â€œ and â€<9d>, where <9d> represents an
unprintable character.

This causes your Unicode-aware code to end up with garbage text because someone
else (or maybe “someone else”) made a mistake.

This happens very often to real text. It’s often the fault of software that
makes it difficult to use UTF-8 correctly, such as Microsoft Office and some
programming languages. The ftfy.fix_encoding() function will look for
evidence of mojibake and, when possible, it will undo the process that produced
it to get back the text that was supposed to be there.

Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding
that makes it obvious when it’s being misused, and a string of mojibake usually
contains all the information we need to recover the original string.

When ftfy is tested on multilingual data from Twitter, it has a false positive
rate of less than 1 per million tweets.

Any given text string might have other irritating properties, possibly even
interacting with the erroneous decoding. The main function of ftfy,
ftfy.fix_text(), will fix other problems along the way, such as:

The text could contain HTML entities such as &amp; in place of certain
characters, when you would rather see what the characters actually are.

For that matter, it could contain instructions for a text terminal to
do something like change colors, but you are not sending the text to a
terminal, so those instructions are just going to look like ^[[30m;
or something in the middle of the text.

The text could write words in non-standard ways for display purposes,
such as using the three characters ﬂop for the word “flop”.
This can happen when you copy text out of a PDF, for example.

It might not be in NFC normalized form. You generally want your text to be
NFC-normalized, to avoid situations where unequal sequences of codepoints
can represent exactly the same text. You can also opt for ftfy to use the
more aggressive NFKC normalization.

Note

Before version 4.0, ftfy used NFKC normalization by default. This covered a
lot of helpful fixes at once, such as expanding ligatures and replacing
“fullwidth” characters with their standard form. However, it also performed
transformations that lose information, such as converting ™ to TM and
H₂O to H2O.

The default, starting in ftfy 4.0, is to use NFC normalization unless told
to use NFKC normalization (or no normalization at all). The more helpful
parts of NFKC are implemented as separate, limited fixes.

There are other interesting things that ftfy can do that aren’t part of
the ftfy.fix_text() pipeline, such as:

The main function, ftfy.fix_text(), will run text through a sequence of
fixes. If the text changed, it will run them through again, so that you can be
sure the output ends up in a standard form that will be unchanged by
ftfy.fix_text().

All the fixes are on by default, but you can pass options to turn them off.
Check that the default fixes are appropriate for your use case. For example:

You should set fix_entities to False if the output is meant to be
interpreted as HTML.

You should set fix_character_width to False if you want to preserve the
spacing of CJK text.

You should set uncurl_quotes to False if you want to preserve quotation
marks with nice typography. You could even consider doing quite the opposite
of uncurl_quotes, running smartypants on the result to make all the
punctuation nice.

If the only fix you need is to detect and repair decoding errors (mojibake), then
you should use ftfy.fix_encoding() directly.

Changed in version 4.0: The default normalization was changed from 'NFKC' to 'NFC'. The options
fix_latin_ligatures and fix_character_width were added to implement some
of the less lossy parts of NFKC normalization on top of NFC.

>>> # This example string starts with a byte-order mark, even if>>> # you can't see it on the Web.>>> print(fix_text('\ufeffParty like\nit&rsquo;s 1999!'))Party likeit's 1999!

>>> print(fix_text('ＬＯＵＤ ＮＯＩＳＥＳ'))LOUD NOISES

>>> len(fix_text('ﬁ'*100000))200000

>>> len(fix_text(''))0

Based on the options you provide, ftfy applies these steps in order:

If remove_terminal_escapes is True, remove sequences of bytes that are
instructions for Unix terminals, such as the codes that make text appear
in different colors.

If fix_encoding is True, look for common mistakes that come from
encoding or decoding Unicode text incorrectly, and fix them if they are
reasonably fixable. See fixes.fix_encoding for details.

If fix_entities is True, replace HTML entities with their equivalent
characters. If it’s “auto” (the default), then consider replacing HTML
entities, but don’t do so in text where you have seen a pair of actual
angle brackets (that’s probably actually HTML and you shouldn’t mess
with the entities).

If uncurl_quotes is True, replace various curly quotation marks with
plain-ASCII straight quotes.

If fix_latin_ligatures is True, then ligatures made of Latin letters,
such as ﬁ, will be separated into individual letters. These ligatures
are usually not meaningful outside of font rendering, and often represent
copy-and-paste errors.

If fix_character_width is True, half-width and full-width characters
will be replaced by their standard-width form.

If fix_line_breaks is true, convert all line breaks to Unix style
(CRLF and CR line breaks become LF line breaks).

If fix_surrogates is true, ensure that there are no UTF-16 surrogates
in the resulting string, by converting them to the correct characters
when they’re appropriately paired, or replacing them with ufffd
otherwise.

If remove_control_chars is true, remove control characters that
are not suitable for use in text. This includes most of the ASCII control
characters, plus some Unicode controls such as the byte order mark
(U+FEFF). Useful control characters, such as Tab, Line Feed, and
bidirectional marks, are left as they are.

If remove_bom is True, remove the Byte-Order Mark at the start of the
string if it exists. (This is largely redundant, because it’s a special
case of remove_control_characters. This option will become deprecated
in a later version.)

If normalization is not None, apply the specified form of Unicode
normalization, which can be one of ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

The default normalization, NFC, combines characters and diacritics that
are written using separate code points, such as converting “e” plus an
acute accent modifier into “é”, or converting “ka” (か) plus a dakuten
into the single character “ga” (が). Unicode can be converted to NFC
form without any change in its meaning.

If you ask for NFKC normalization, it will apply additional
normalizations that can change the meanings of characters. For example,
ellipsis characters will be replaced with three periods, all ligatures
will be replaced with the individual characters that make them up,
and characters that differ in font style will be converted to the same
character.

If anything was changed, repeat all the steps, so that the function is
idempotent. “&amp;amp;” will become “&”, for example, not “&amp;”.

fix_text will work one line at a time, with the possibility that some
lines are in different encodings, allowing it to fix text that has been
concatenated together from different sources.

When it encounters lines longer than max_decode_length (1 million
codepoints by default), it will not run the fix_encoding step, to avoid
unbounded slowdowns.

If you’re certain that any decoding errors in the text would have affected
the entire text in the same way, and you don’t mind operations that scale
with the length of the text, you can use fix_text_segment directly to
fix the whole string in one batch.

This function looks for the evidence of mojibake, formulates a plan to fix
it, and applies the plan. It determines whether it should replace nonsense
sequences of single-byte characters that were really meant to be UTF-8
characters, and if so, turns them into the correctly-encoded Unicode
character that they were meant to represent.

The input to the function must be Unicode. If you don’t have Unicode text,
you’re not using the right tool to solve your problem.

fix_encoding decodes text that looks like it was decoded incorrectly. It
leaves alone text that doesn’t.

Because these characters often come from Microsoft products, we allow
for the possibility that we get not just Unicode characters 128-255, but
also Windows’s conflicting idea of what characters 128-160 are.

>>> print(fix_encoding('This â€” should be an em dash'))This — should be an em dash

We might have to deal with both Windows characters and raw control
characters at the same time, especially when dealing with characters like
0x81 that have no mapping in Windows. This is a string that Python’s
standard .encode and .decode methods cannot correct.

However, it has safeguards against fixing sequences of letters and
punctuation that can occur in valid text. In the following example,
the last three characters are not replaced with a Korean character,
even though they could be.

>>> print(fix_encoding('not such a fan of Charlotte Brontë…”'))not such a fan of Charlotte Brontë…”

This function can now recover some complex manglings of text, such as when
UTF-8 mojibake has been normalized in a way that replaces U+A0 with a
space:

>>> print(fix_encoding('The more you know ðŸŒ '))The more you know 🌠

Cases of genuine ambiguity can sometimes be addressed by finding other
characters that are not double-encoded, and expecting the encoding to
be consistent:

>>> print(fix_encoding('AHÅ™, the new sofa from IKEA®'))AHÅ™, the new sofa from IKEA®

Finally, we handle the case where the text is in a single-byte encoding
that was intended as Windows-1252 all along but read as Latin-1:

>>> print(fix_encoding('This text was never UTF-8 at all\x85'))This text was never UTF-8 at all…

If the file is being read as Unicode text, use that. If it’s being read as
bytes, then we hope an encoding was supplied. If not, unfortunately, we
have to guess what encoding it is. We’ll try a few common encodings, but we
make no promises. See the guess_bytes function for how this is done.

Now, you may know that your input is a mess of bytes in an unknown encoding,
and you might want a tool that can just statistically analyze those bytes and
predict what encoding they’re in.

ftfy is not that tool. The ftfy.guess_bytes() function it contains will
do this in very limited cases, but to support more encodings from around the
world, something more is needed.

You may have heard of chardet. Chardet is admirable, but it doesn’t
completely do the job either. Its heuristics are designed for multi-byte
encodings, such as UTF-8 and the language-specific encodings used in East Asian
languages. It works badly on single-byte encodings, to the point where it will
output wrong answers with high confidence.

ftfy.guess_bytes() doesn’t even try the East Asian encodings, so the
ideal thing would combine the simple heuristic of ftfy.guess_bytes() with
the multibyte character set detection of chardet. This ideal thing doesn’t
exist yet.

ftfy uses Twitter’s streaming API as an endless source of realistic sample
data. Twitter is massively multilingual, and despite that it’s supposed to be
uniformly UTF-8, in practice, any encoding mistake that someone can make will
be made by someone’s Twitter client.

We check what ftfy’s fix_encoding() heuristic does to this data, and we
aim to have the rate of false positives be indistinguishable from zero.

A pre-release version of ftfy was evaluated on 30,880,000 tweets received from
Twitter’s streaming API in April 2015. There was 1 false positive, and it was
due to a bug that has now been fixed.

When looking at the changes ftfy makes, we found:

ftfy.fix_text(), with all default options, will change about 1 in 18 tweets.

With stylistic changes (fix_character_width and uncurl_quotes) turned off,
ftfy.fix_text() will change about 1 in every 300 tweets.

Replace single-character ligatures of Latin letters, such as ‘ﬁ’, with the
characters that they contain, as in ‘fi’. Latin ligatures are usually not
intended in text strings (though they’re lovely in rendered text). If
you have such a ligature in your string, it is probably a result of a
copy-and-paste glitch.

We leave ligatures in other scripts alone to be safe. They may be intended,
and removing them may lose information. If you want to take apart nearly
all ligatures, use NFKC normalization.

The above doctest had to be very carefully written, because even putting
the Unicode escapes of the surrogates in the docstring was causing
various tools to fail, which I think just goes to show why this fixer is
necessary.

Decode backslashed escape sequences, including \x, \u, and \U character
references, even in the presence of other Unicode.

This is what Python’s “string-escape” and “unicode-escape” codecs were
meant to do, but in contrast, this actually works. It will decode the
string exactly the same way that the Python interpreter decodes its string
literals.

>>> factoid='\\u20a1 is the currency symbol for the colón.'>>> print(factoid[1:])u20a1 is the currency symbol for the colón.>>> print(decode_escapes(factoid))₡ is the currency symbol for the colón.

Even though Python itself can read string literals with a combination of
escapes and literal Unicode – you’re looking at one right now – the
“unicode-escape” codec doesn’t work on literal Unicode. (See
http://stackoverflow.com/a/24519338/773754 for more details.)

Instead, this function searches for just the parts of a string that
represent escape sequences, and decodes them, leaving the rest alone. All
valid escape sequences are made of ASCII characters, and this allows
“unicode-escape” to work correctly.

This fix cannot be automatically applied by the ftfy.fix_text function,
because escaped text is not necessarily a mistake, and there is no way
to distinguish text that’s supposed to be escaped from text that isn’t.

Some mojibake has been additionally altered by a process that said “hmm,
byte A0, that’s basically a space!” and replaced it with an ASCII space.
When the A0 is part of a sequence that we intend to decode as UTF-8,
changing byte A0 to 20 would make it fail to decode.

This process finds sequences that would convincingly decode as UTF-8 if
byte 20 were changed to A0, and puts back the A0. For the purpose of
deciding whether this is a good idea, this step gets a cost of twice
the number of bytes that are changed.

This function identifies sequences where information has been lost in
a “sloppy” codec, indicated by byte 1A, and if they would otherwise look
like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD.

A further explanation:

ftfy can now fix text in a few cases that it would previously fix
incompletely, because of the fact that it can’t successfully apply the fix
to the entire string. A very common case of this is when characters have
been erroneously decoded as windows-1252, but instead of the “sloppy”
windows-1252 that passes through unassigned bytes, the unassigned bytes get
turned into U+FFFD (�), so we can’t tell what they were.

This most commonly happens with curly quotation marks that appear
â€œlikethisâ€�.

We can do better by building on ftfy’s “sloppy codecs” to let them handle
less-sloppy but more-lossy text. When they encounter the character �,
instead of refusing to encode it, they encode it as byte 1A – an
ASCII control code called SUBSTITUTE that once was meant for about the same
purpose. We can then apply a fixer that looks for UTF-8 sequences where
some continuation bytes have been replaced by byte 1A, and decode the whole
sequence as �; if that doesn’t work, it’ll just turn the byte back into �
itself.

As a result, the above text â€œlikethisâ€� will decode as
“likethis�.

If U+1A was actually in the original string, then the sloppy codecs will
not be used, and this function will not be run, so your weird control
character will be left alone but wacky fixes like this won’t be possible.

Python does not want you to be sloppy with your text. Its encoders and decoders
(“codecs”) follow the relevant standards whenever possible, which means that
when you get text that doesn’t follow those standards, you’ll probably fail
to decode it. Or you might succeed at decoding it for implementation-specific
reasons, which is perhaps worse.

There are some encodings out there that Python wishes didn’t exist, which are
widely used outside of Python:

“utf-8-variants”, a family of not-quite-UTF-8 encodings, including the
ever-popular CESU-8 and “Java modified UTF-8”.

“Sloppy” versions of character map encodings, where bytes that don’t map to
anything will instead map to the Unicode character with the same number.

Simply importing this module, or in fact any part of the ftfy package, will
make these new “bad codecs” available to Python through the standard Codecs
API. You never have to actually call any functions inside ftfy.bad_codecs.

However, if you want to call something because your code checker insists on it,
you can call ftfy.bad_codecs.ok().

Decodes single-byte encodings, filling their “holes” in the same messy way that
everyone else does.

A single-byte encoding maps each byte to a Unicode character, except that some
bytes are left unmapped. In the commonly-used Windows-1252 encoding, for
example, bytes 0x81 and 0x8D, among others, have no meaning.

Python, wanting to preserve some sense of decorum, will handle these bytes
as errors. But Windows knows that 0x81 and 0x8D are possible bytes and they’re
different from each other. It just hasn’t defined what they are in terms of
Unicode.

Software that has to interoperate with Windows-1252 and Unicode – such as all
the common Web browsers – will pick some Unicode characters for them to map
to, and the characters they pick are the Unicode characters with the same
numbers: U+0081 and U+008D. This is the same as what Latin-1 does, and the
resulting characters tend to fall into a range of Unicode that’s set aside for
obselete Latin-1 control characters anyway.

These sloppy codecs let Python do the same thing, thus interoperating with
other software that works this way. It defines a sloppy version of many
single-byte encodings with holes. (There is no need for a sloppy version of
an encoding without holes: for example, there is no such thing as
sloppy-iso-8859-2 or sloppy-macroman.)

The following encodings will become defined:

sloppy-windows-1250 (Central European, sort of based on ISO-8859-2)

sloppy-windows-1251 (Cyrillic)

sloppy-windows-1252 (Western European, based on Latin-1)

sloppy-windows-1253 (Greek, sort of based on ISO-8859-7)

sloppy-windows-1254 (Turkish, based on ISO-8859-9)

sloppy-windows-1255 (Hebrew, based on ISO-8859-8)

sloppy-windows-1256 (Arabic)

sloppy-windows-1257 (Baltic, based on ISO-8859-13)

sloppy-windows-1258 (Vietnamese)

sloppy-cp874 (Thai, based on ISO-8859-11)

sloppy-iso-8859-3 (Maltese and Esperanto, I guess)

sloppy-iso-8859-6 (different Arabic)

sloppy-iso-8859-7 (Greek)

sloppy-iso-8859-8 (Hebrew)

sloppy-iso-8859-11 (Thai)

Aliases such as “sloppy-cp1252” for “sloppy-windows-1252” will also be
defined.

Only sloppy-windows-1251 and sloppy-windows-1252 are used by the rest of ftfy;
the rest are rather uncommon.

Here are some examples, using ftfy.explain_unicode to illustrate how
sloppy-windows-1252 merges Windows-1252 with Latin-1:

This file defines a codec called “utf-8-variants” (or “utf-8-var”), which can
decode text that’s been encoded with a popular non-standard version of UTF-8.
This includes CESU-8, the accidental encoding made by layering UTF-8 on top of
UTF-16, as well as Java’s twist on CESU-8 that contains a two-byte encoding for
codepoint 0.

This is particularly relevant in Python 3, which provides no other way of
decoding CESU-8 [1].

The codec does not at all enforce “correct” CESU-8. For example, the Unicode
Consortium’s not-quite-standard describing CESU-8 requires that there is only
one possible encoding of any character, so it does not allow mixing of valid
UTF-8 and CESU-8. This codec does allow that, just like Python 2’s UTF-8
decoder does.

Characters in the Basic Multilingual Plane still have only one encoding. This
codec still enforces the rule, within the BMP, that characters must appear in
their shortest form. There is one exception: the sequence of bytes 0xc00x80,
instead of just 0x00, may be used to encode the null character U+0000, like
in Java.

If you encode with this codec, you get legitimate UTF-8. Decoding with this
codec and then re-encoding is not idempotent, although encoding and then
decoding is. So this module won’t produce CESU-8 for you. Look for that
functionality in the sister module, “Breaks Text For You”, coming approximately
never.

Return text centered in a Unicode string whose display width, in a
monospaced terminal, should be at least width character cells. The rest
of the string will be padded with fillchar, which must be a width-1
character.

Return text left-justified in a Unicode string whose display width,
in a monospaced terminal, should be at least width character cells.
The rest of the string will be padded with fillchar, which must be
a width-1 character.

“Left” here means toward the beginning of the string, which may actually
appear on the right in an RTL context. This is similar to the use of the
word “left” in “left parenthesis”.

This example, and the similar ones that follow, should come out justified
correctly when viewed in a monospaced terminal. It will probably not look
correct if you’re viewing this code or documentation in a Web browser.

Return text right-justified in a Unicode string whose display width,
in a monospaced terminal, should be at least width character cells.
The rest of the string will be padded with fillchar, which must be
a width-1 character.

“Right” here means toward the end of the string, which may actually be on
the left in an RTL context. This is similar to the use of the word “right”
in “right parenthesis”.

Return the number of character cells that this string is likely to occupy
when displayed in a monospaced, modern, Unicode-aware terminal emulator.
We refer to this as the “display width” of the string.

This can be useful for formatting text that may contain non-spacing
characters, or CJK characters that take up two character cells.

Returns -1 if the string contains a non-printable or control character.

These files load information about the character properties in Unicode 9.0.
Yes, even if your version of Python doesn’t support Unicode 9.0. This ensures
that ftfy’s behavior is consistent across versions.

This gives other modules access to the gritty details about characters and the
encodings that use them.