Python 2.7 quick reference

10.1. The UTF-8 encoding

How, you might ask, do we pack 32-bit Unicode characters
into 8-bit bytes? Quite prevalent on the Web and the
Internet generally is the UTF-8 encoding, which allows any
of the Unicode characters to be represented as a string of
one or more 8-bit bytes.

First, some definitions:

A code point is a number
representing a unique member of the Unicode character
set.

The Unicode code points are visualized as a
three-dimensional structure made of planes, each of which has a range of 65536
code points organized as 256 rows of 256 columns each.

The low-order eight bits of the code point select the
column; the next eight more significant bits select the
row; and the remaining most significant bits select the
plane.

This diagram shows how UTF-8 encoding works. The first
128 code points (hexadecimal 00 through 7F) are encoded
as in normal 7-bit ASCII, with the high-order bit always 0. For
code points above hex 7F, all bytes have the high-order
(0x80) bit set, and the bits of the code
point are distributed through two, three, or four
bytes, depending on the number of bits needed to
represent the code point value.

To encode a Unicode string U, use this method:

U.encode('utf_8')

To decode a regular str value S that contains a
UTF-8 encoded value, use this method: