This document explains how Unicode and related issues are handled in CKAN.
For a general introduction to Unicode and Unicode handling in Python 2 please
read the Python 2 Unicode HOWTO. Since Unicode handling differs greatly
between Python 2 and Python 3 you might also be interested in the
Python 3 Unicode HOWTO.

Note

This document describes the intended future state of Unicode handling in
CKAN. For historic reasons, some existing code does not yet follow the
rules described here.

New code should always comply with the rules in this document. Exceptions
must be documented.

String literals are string values given directly in the source code (as opposed
to strings variables read from a file, received via argument, etc.). In
Python 2, string literals by default have type str. They can be changed to
unicode by adding a u prefix. In addition, the b prefix can be used
to explicitly mark a literal as str:

x="I'm a str literal"y=u"I'm a unicode literal"z=b"I'm also a str literal"

In CKAN, every string literal must carry either a u or a b prefix.
While the latter is redundant in Python 2, it makes the developer’s intention
explicit and eases a future migration to Python 3.

This rule also holds for raw strings, which are created using an r
prefix. Simply use ur instead:

For many characters, Unicode offers multiple descriptions. For example, a small
latin e with an acute accent (é) can either be specified using its
dedicated code point (U+00E9) or by combining the code points for e
(U+0065) and the accent (U+0301). Both variants will look the same but
are different from a numerical point of view:

Like all other strings, filenames should be stored as Unicode strings
internally. However, some filesystem operations return or expect byte strings,
so filenames have to be encoded/decoded appropriately. Unfortunately, different
operating systems use different encodings for their filenames, and on some of
them (e.g. Linux) the file system encoding is even configurable by the user.

To make decoding and encoding of filenames easier, the ckan.lib.io module
therefore contains the functions decode_path and encode_path, which
automatically use the correct encoding:

importioimportjsonfromckan.lib.ioimportdecode_path# __file__ is a byte string, so we decode itMODULE_FILE=decode_path(__file__)print(u'Running from '+MODULE_FILE)# The functions in os.path return unicode if given unicodeMODULE_DIR=os.path.dirname(MODULE_FILE)DATA_FILE=os.path.join(MODULE_DIR,u'data.json')# Most of Python's built-in I/O-functions accept Unicode filenames as input# and encode them automaticallywithio.open(DATA_FILE,encoding='utf-8')asf:data=json.load(f)

Note that almost all Python’s built-in I/O-functions accept Unicode filenames
as input and encode them automatically, so using encode_path is usually not
necessary.

The return type of some of Python’s I/O-functions (e.g. os.listdir and
os.walk) depends on the type of their input: If passed byte strings they
return byte strings and if passed Unicode they automatically decode the raw
filenames to Unicode before returning them. Other functions exist in two
variants that return byte strings (e.g. os.getcwd) and Unicode (os.getcwdu),
respectively.

Warning

Some of Python’s I/O-functions may return both byte and Unicode strings
for a single call. For example, os.listdir will normally return Unicode
when passed Unicode, but filenames that cannot be decoded using the
filesystem encoding will still be returned as byte strings!

Note that if the filename of an existing file cannot be decoded using the
filesystem’s encoding then the environment Python is running in is most
probably incorrectly set up.

The instructions above are meant for the names of existing files that are
obtained using Python’s I/O functions. However, sometimes one also wants to
create new files whose names are generated from unknown sources (e.g. user
input). To make sure that the generated filename is safe to use and can be
represented using the filesystem’s encoding use
ckan.lib.munge.munge_filename: