Previous Topic

Next Topic

Quick Search

The Python language supports two ways of representing what we
know as “strings”, i.e. series of characters. In Python 2, the
two types are string and unicode, and in Python 3 they are
bytes and string. A key aspect of the Python 2 string and
Python 3 bytes types are that they contain no information
regarding what encoding the data is stored in. For this
reason they were commonly referred to as byte strings on
Python 2, and Python 3 makes this name more explicit. The
origins of this come from Python’s background of being developed
before the Unicode standard was even available, back when
strings were C-style strings and were just that, a series of
bytes. Strings that had only values below 128 just happened to
be ASCII strings and were printable on the console, whereas
strings with values above 128 would produce all kinds of
graphical characters and bells.

Contrast the “byte-string” type with the “unicode/string” type.
Objects of this latter type are created whenever you say something like
u"helloworld" (or in Python 3, just "helloworld"). In this
case, Python represents each character in the string internally
using multiple bytes per character (something similar to
UTF-16). What’s important is that when using the
unicode/string type to store strings, Python knows the
data’s encoding; it’s in its own internal format. Whereas when
using the string/bytes type, it does not.

When Python 2 attempts to treat a byte-string as a string, which
means it’s attempting to compare/parse its characters, to coerce
it into another encoding, or to decode it to a unicode object,
it has to guess what the encoding is. In this case, it will
pretty much always guess the encoding as ascii… and if the
byte-string contains bytes above value 128, you’ll get an error.
Python 3 eliminates much of this confusion by just raising an
error unconditionally if a byte-string is used in a
character-aware context.

There is one operation that Python can do with a non-ASCII
byte-string, and it’s a great source of confusion: it can dump the
byte-string straight out to a stream or a file, with nary a care
what the encoding is. To Python, this is pretty much like
dumping any other kind of binary data (like an image) to a
stream somewhere. In Python 2, it is common to see programs that
embed all kinds of international characters and encodings into
plain byte-strings (i.e. using "helloworld" style literals)
can fly right through their run, sending reams of strings out to
wherever they are going, and the programmer, seeing the same
output as was expressed in the input, is now under the illusion
that his or her program is Unicode-compliant. In fact, their
program has no unicode awareness whatsoever, and similarly has
no ability to interact with libraries that are unicode aware.
Python 3 makes this much less likely by defaulting to unicode as
the storage format for strings.

The “pass through encoded data” scheme is what template
languages like Cheetah and earlier versions of Myghty do by
default. Mako as of version 0.2 also supports this mode of
operation when using Python 2, using the disable_unicode=True
flag. However, when using Mako in its default mode of
unicode-aware, it requires explicitness when dealing with
non-ASCII encodings. Additionally, if you ever need to handle
unicode strings and other kinds of encoding conversions more
intelligently, the usage of raw byte-strings quickly becomes a
nightmare, since you are sending the Python interpreter
collections of bytes for which it can make no intelligent
decisions with regards to encoding. In Python 3 Mako only allows
usage of native, unicode strings.

This is the most basic encoding-related setting, and it is
equivalent to Python’s “magic encoding comment”, as described in
pep-0263. Any
template that contains non-ASCII characters requires that this
comment be present so that Mako can decode to unicode (and also
make usage of Python’s AST parsing services). Mako’s lexer will
use this encoding in order to convert the template source into a
unicode object before continuing its parsing:

The next area that encoding comes into play is in expression
constructs. By default, Mako’s treatment of an expression like
this:

${"hello world"}

looks something like this:

context.write(unicode("hello world"))

In Python 3, it’s just:

context.write(str("hello world"))

That is, the output of all expressions is run through the
``unicode`` built-in. This is the default setting, and can be
modified to expect various encodings. The unicode step serves
both the purpose of rendering non-string expressions into
strings (such as integers or objects which contain __str()__
methods), and to ensure that the final output stream is
constructed as a unicode object. The main implication of this is
that any raw byte-strings that contain an encoding other than
ASCII must first be decoded to a Python unicode object. It
means you can’t say this in Python 2:

${"voix m’a réveillé."}## error in Python 2!

You must instead say this:

${u"voix m’a réveillé."}## OK !

Similarly, if you are reading data from a file that is streaming
bytes, or returning data from some object that is returning a
Python byte-string containing a non-ASCII encoding, you have to
explicitly decode to unicode first, such as:

${call_my_object().decode('utf-8')}

Note that filehandles acquired by open() in Python 3 default
to returning “text”, that is the decoding is done for you. See
Python 3’s documentation for the open() built-in for details on
this.

If you want a certain encoding applied to all expressions,
override the unicode builtin with the decode built-in at the
Template or TemplateLookup level:

t=Template(templatetext,default_filters=['decode.utf8'])

Note that the built-in decode object is slower than the
unicode function, since unlike unicode it’s not a Python
built-in, and it also checks the type of the incoming data to
determine if string conversion is needed first.

The default_filters argument can be used to entirely customize
the filtering process of expressions. This argument is described
in The default_filters Argument.

Mako does play some games with the style of buffering used
internally, to maximize performance. Since the buffer is by far
the most heavily used object in a render operation, it’s
important!

When calling render() on a template that does not specify any
output encoding (i.e. it’s ascii), Python’s cStringIO module,
which cannot handle encoding of non-ASCII unicode objects
(even though it can send raw byte-strings through), is used for
buffering. Otherwise, a custom Mako class called
FastEncodingBuffer is used, which essentially is a super
dumbed-down version of StringIO that gathers all strings into
a list and uses u''.join(elements) to produce the final output
– it’s markedly faster than StringIO.

Some segments of Mako’s userbase choose to make no usage of
Unicode whatsoever, and instead would prefer the “pass through”
approach; all string expressions in their templates return
encoded byte-strings, and they would like these strings to pass
right through. The only advantage to this approach is that
templates need not use u"" for literal strings; there’s an
arguable speed improvement as well since raw byte-strings
generally perform slightly faster than unicode objects in
Python. For these users, assuming they’re sticking with Python
2, they can hit the disable_unicode=True flag as so:

Where above that the string literal used within Context.write()
is a regular byte-string.

When disable_unicode=True is turned on, the default_filters
argument which normally defaults to ["unicode"] now defaults
to ["str"] instead. Setting default_filters to the empty list
[] can remove the overhead of the str call. Also, in this
mode you cannot safely call render_unicode() – you’ll get
unicode/decode errors.

Don’t use this mode unless you really, really want to and you
absolutely understand what you’re doing.

Don’t use this option just because you don’t want to learn to
use Unicode properly; we aren’t supporting user issues in this
mode of operation. We will however offer generous help for the
vast majority of users who stick to the Unicode program.

Python 3 is unicode by default, and the flag is not available
when running on Python 3.