This method should only be used for importing legacy data from systems
or files where the encoding is not known. This method will always
succeed and normally guess the correct encoding, but it is only
a guess and will be incorrect some of the time. Also note that
data may be lost, as if we cannot determine the correct encoding
we fall back to ISO-8859-1 and replace unrecognized characters with
ufffd characters (the Unicode unrepresentable code point).

NB: We currently only cope with the major Western character
sets - we need to change the algorithm to cope with asian languages.
One way that apparently works is to convert the string into all possible
encodings, one at a time, and if successful score them based on the
number of meaningful characters (using the unicodedata module to
let us know what are control characters, letters, printable characters
etc.).

ASCII is easy

>>> guess('hello')
u'hello'

Unicode raises an exception to annoy lazy programmers. It should also
catches bugs as if you have valid Unicode you shouldn't be going anywhere
near this method.

However, UTF-16 strings without a BOM will be interpreted as ISO-8859-1.
I doubt this is a problem, as we are unlikely to see this except with
asian languages and in these cases other encodings we don't support
at the moment like ISO-2022-jp, BIG5, SHIFT-JIS etc. will be a bigger
problem.

>>> guess(u'hello'.encode('UTF-16be'))
u'\x00h\x00e\x00l\x00l\x00o'

def
escape_nonascii_uniquely(bogus_string):

Replace non-ascii characters with a hex representation.

This is mainly for preventing emails with invalid characters from causing
oopses. The nonascii characters could have been removed or just converted
to "?", but this provides some insight into what the bogus data was, and
it prevents the message-id from two unrelated emails matching because
all the nonascii characters have been replaced with the same ascii
character.

Unfortunately, all the strings below are actually part of this
function's docstring, so python processes the backslash once before
doctest, and then python processes it again when doctest runs the
test. This makes it confusing, since four backslashes will get
converted into a single ascii character.