I expected to have either some gibberish or an Error after the print statement, since the "é" character isn't part of ASCII and I haven't specified an encoding. I guess I don't understand what ASCII being the default encoding means.

It would be pretty nice if you could turn that edit into an answer instead and accept it.
–
mercatorSep 5 '13 at 13:57

1

Printing '\xe9' in a terminal configured for UTF-8 will not print é. It'll print a replacement character (usually a question mark) as \xe9 is not a valid UTF-8 sequence (it is missing two bytes that should have followed that leading byte). It will most certainly not be interpreted as Latin-1 instead.
–
Martijn Pieters♦Feb 19 '14 at 19:42

1

@MartijnPieters I suspect you might have skimmed over the part where I specified that the terminal is set to decode in ISO-8859-1 (latin1) when I output \xe9 to print é.
–
mikeFeb 21 '14 at 0:10

1

Ah yes, I did miss that part; the terminal is has a configuration that differs from the shell. Check.
–
Martijn Pieters♦Feb 21 '14 at 0:13

5 Answers
5

Thanks to bits and pieces from various replies, I think we can stitch up an explanation.

By trying to print an unicode string, u'\xe9', Python implicitly try to encode that string using the encoding scheme currently stored in sys.stdout.encoding. Python actually picks up this setting from the environment it's been initiated from. If it can't find a proper encoding from the environment, only then does it revert to its default, ASCII.

For example, I use a bash shell which encoding defaults to UTF-8. If I start Python from it, it picks up and use that setting:

$ python
>>> import sys
>>> print sys.stdout.encoding
UTF-8

Let's for a moment exit the Python shell and set bash's environment with some bogus encoding:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

Then start the python shell again and verify that it does indeed revert to its default ascii encoding.

$ python
>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968

Bingo!

If you now try to output some unicode character outside of ascii you should get a nice error message

We'll now observe what happens after Python outputs strings. For this we'll first start a bash shell within a graphic terminal (I use Gnome Terminal) and we'll set the terminal to decode output with ISO-8859-1 aka latin-1 (graphic terminals usually have an option to Set Character Encoding in one of their dropdown menus). Note that this doesn't change the actual shell environment's encoding, it only changes the way the terminal itself will decode output it's given, a bit like a web browser does. You can therefore change the terminal's encoding, independantly from the shell's environment. Let's then start Python from the shell and verify that sys.stdout.encoding is set to the shell environment's encoding (UTF-8 for me):

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.

(3) python encodes unicode code point u'\xe9' (233) with the latin-1 scheme. Turns out latin-1 code points range is 0-255 and points to the exact same character as Unicode within that range. Therefore, Unicode code points in that range will yield the same value when encoded in latin-1. So u'\xe9' (233) encoded in latin-1 will also yields the binary string '\xe9'. Terminal receives that value and tries to match it on the latin-1 character map. Just like case (1), it yields "é" and that's what's displayed.

Let's now change the terminal's encoding settings to UTF-8 from the dropdown menu (like you would change your web browser's encoding settings). No need to stop Python or restart the shell. The terminal's encoding now matches Python's. Let's try printing again:

(4) python outputs a binary string as is. Terminal attempts to decode that stream with UTF-8. But UTF-8 doesn't understand the value 0xe9 (see later explanation) and is therefore unable to convert it to a unicode code point. No code point found, no character printed.

(5) python attempts to implicitly encode the Unicode string with whatever's in sys.stdout.encoding. Still "UTF-8". The resulting binary string is '\xc3\xa9'. Terminal receives the stream and attempts to decode 0xc3a9 also using UTF-8. It yields back code value 0xe9 (233), which on the Unicode character map points to the symbol "é". Terminal displays "é".

(6) python encodes unicode string with latin-1, it yields a binary string with the same value '\xe9'. Again, for the terminal this is pretty much the same as case (4).

Conclusions:
- Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data.
- Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding.
- Python gets that setting from the shell's environment.
- the terminal displays output according to its own encoding settings.
- the terminal's encoding is independant from the shell's.

More details on unicode, UTF-8 and latin-1:

Unicode is basically a table of characters where some keys (code points) have been conventionally assigned to point to some symbols. e.g. by convention it's been decided that key 0xe9 (233) is the value pointing to the symbol 'é'. ASCII and Unicode use the same code points from 0 to 127, as do latin-1 and Unicode from 0 to 255. That is, 0x41 points to 'A' in ASCII, latin-1 and Unicode, 0xc8 points to 'Ü' in latin-1 and Unicode, 0xe9 points to 'é' in latin-1 and Unicode.

When working with electronic devices, Unicode code points need an efficient way to be represented electronically. That's what encoding schemes are about. Various Unicode encoding schemes exist (utf7, UTF-8, UTF-16, UTF-32). The most intuitive and straight forward encoding approach would be to simply use a code point's value in the Unicode map as its value for its electronic form, but Unicode currently has over a million code points, which means that some of them require 3 bytes to be expressed. To work efficiently with text, a 1 to 1 mapping would be rather impractical, since it would require that all code points be stored in exactly the same amount of space, with a minimum of 3 bytes per character, regardless of their actual need.

Most encoding schemes have shortcomings regarding space requirement, the most economic ones don't cover all unicode code points, for example ascii only covers the first 128, while latin-1 covers the first 256. Others that try to be more comprehensive end up also being wasteful, since they require more bytes than necessary, even for common "cheap" characters. UTF-16 for instance, uses a minimum of 2 bytes per character, including those in the ascii range ('B' which is 65, still requires 2 bytes of storage in UTF-16). UTF-32 is even more wasteful as it stores all characters in 4 bytes.

UTF-8 happens to have cleverly resolved the dilemma, with a scheme able to store code points with a variable amount of byte spaces. As part of its encoding strategy, UTF-8 laces code points with flag bits that indicate (presumably to decoders) their space requirements and their boundaries.

UTF-8 encoding of unicode code points in the ascii range (0-127):

0xxx xxxx (in binary)

the x's show the actual space reserved to "store" the code point during encoding

The leading 0 is a flag that indicates to the UTF-8 decoder that this code point will only require 1 byte.

upon encoding, UTF-8 doesn't change the value of code points in that specific range (i.e. 65 encoded in UTF-8 is also 65). Considering that Unicode and ASCII are also compatible in the same range, it incidentally makes UTF-8 and ASCII also compatible in that range.

e.g. Unicode code point for 'B' is '0x42' or 0100 0010 in binary (as we said, it's the same in ASCII). After encoding in UTF-8 it becomes:

so if I understand well, when I print out unicode strings (the code points), python assumes that I want an output encoded in utf-8, instead of just trying to give me what it could have been in ascii?
–
mikeApr 8 '10 at 0:30

@mike: AFAIK what you said is correct. If it did print out the Unicode characters but encoded as ASCII, everything would come out garbled and probably all the beginners would be asking, "How come I can't print out Unicode text?"
–
Mark RushakoffApr 8 '10 at 0:38

1

Thank you. I'm actually one of those beginners, but coming from the side of people who do have some understanding of unicode, which is why this behavior is throwing me off a bit.
–
mikeApr 8 '10 at 0:46

2

R., not correct, since '\xe9' isn't in the ascii character set. Non-Unicode strings are printed using sys.stdout.encoding, Unicode strings are encoded to sys.stdout.encoding before printing.
–
Mark TolonenApr 8 '10 at 2:41