Lessons in porting to Python 3

Yesterday, I completed my port of dbus-python to Python 3, and submitted
my patch upstream. While I've yet to hear any feedback from Simon about my
patch, I'm fairly confident that it's going in the right direction. This
version should allow existing Python 2 applications to run largely unchanged,
and minimizes the differences that clients will have to make to use the Python
3 version.

Some of the changes are specific to the dbus-python project, and I included
a detailed summary of those changes and my rationale behind them. There
are lots of good lessons learned during this porting exercise that I want to
share with you, have a discussion about, and see if there aren't things we
core Python developers can do in Python 3.3 to make it even easier to migrate
to Python 3.

First, some background. D-Bus is a freedesktop.org project for same-system
interprocess communication, and it's an essential component of any Linux
desktop. The D-Bus system and C API are mature and well-defined, and there
are bindings available for many programming language, Python included of
course. The existing dbus-python package is only compatible with Python 2,
and most recommendations are to use the Gnome version of Python bindings
should you want to use D-Bus with Python 3. For us in Ubuntu, this isn't
acceptable though because we must have a solution that supports KDE and
potentially even non-UI based D-Bus Python servers. Several ports of
dbus-python to Python 3 have been attempted in the past, but none have been
accepted upstream, so naturally I took it as a challenge to work on a new
version of the port. After some discussion with the upstream maintainer Simon
McVittie, I had a few requirements in mind:

One code base for both Python 2 and Python 3. It's simply too difficult to
support multiple development branches, so one branch must be compilable in
both versions of Python. Because dbus-python is not setuptools-based, I not
to rely on 2to3 to auto-convert the Python layer. This is more difficult,
but given the next requirement, entirely possible.

Minimum Python versions to support are 2.6 and 3.2 (Python 2.7 is also
supported). Python 2.6 contains almost everything you need to do a high
quality port of both the Python layer and the C extension layer with a
single code base. Python 2.7 has one or two additional helpers, but they
aren't important enough to count Python 2.6 out. For dbus-python, this
specifically means dropping support for Python 2.5, which is more than 5
years old at the time of this writing. Also, it makes no sense to support
Python 3.0 or 3.1 as neither of those are in wide-spread use.

Minimize any API changes seen by Python 2 code, and minimize the changes
needed to port clients to Python 3. For the former, this means everything
from keeping Python APIs unchanged to keeping the inheritance hierarchy the
same. Python 2 programs will see a few small changes after the application
of my patches; I'll describe them below but they should be inconsequential
for the vast majority of Python 2 applications. While it's unavoidable that
Python 3 applications will see a different API, these differences have been
minimized.

There are two main issues that had to be sorted out for this port, and in
general for most ports to Python 3: bytes vs. strings, and ints vs. longs.
For the latter, you probably know that where Python 2 has two integer types,
Python 3 has only one. In Python 3, all integers are longs, and there is no
L suffix for integer literals. This turned out to be trickier in the
dbus-python case because dbus supports a numeric stack of various integer
widths, and in Python 2 these are implemented as subclasses of the built-in
int and long types. Because there are only longs in Python 3, the inheritance
hierarchy a Python application will see changes between Python 2 and Python 3.
This is unavoidable.

I also made the decision to change some object types to longs in both versions
of Python, where I thought it was highly unlikely that Python clients would
care. Specifically, many dbus objects have a variant_level attribute,
which is usually zero, but can be any positive integer. For implementation
simplicity, I changed these to longs in Python 2 also.

Ah, bytes vs. strings is always where things get interesting when porting to
Python 3. It's the single most brain hurty exercise you will have to go
through. Remember that Python 2 lets you cheat. If you not sure whether the
entity you're dealing with is some bytes, or some (usually ASCII-encoded)
string, just use a Python 2 str type (a.k.a. 8-bit string) and let
Python's automatic conversion rules change it to a unicode when the two
types meet. You can't get away with this in Python 3 though, for very good
reasons - it's error prone, and can lead to data corruption or the annoyingly
ubiquitous and hard to predict UnicodeError s.

In Python 3, you must be clear about what are bytes and what are strings
(i.e. unicodes), and you must be explicit when converting between the two.
Yes, this can be painful at times but in my opinion, it's crucial that you
do so. It's that important to eliminate UnicodeError s that you can't
defend against and your users won't understand or be able to correct. Once
you're clear in your own mind as to which are strings and which are bytes,
it's usually not that hard to reflect that clearly in your code, especially if
you leave Python 2.5 and anything earlier behind, which I highly recommend.

dbus-python presented an interesting challenge here. It has several data
types in its C API that are defined as UTF-8 encoded char*'s. At first
blush, it seemed to me that these should be reflected in Python 3 as bytes
objects to simplify the conversion in the extension module to and from
char*'s. It turns out that this was a bad idea from an implementation
standpoint, and dbus-python's upstream maintainer had already expressed his
opinion that these data types should be exposed as unicodes in Python 3.
After having failed at my initial attempts at making them bytes, I now agree
that they must be unicodes, both for implementation simplicity and for minimal
impact on porting user code.

The biggest problem I ran into with the choice of bytes is that the callback
dispatch code in dbus-python is complex, difficult to understand and debug,
driven by external data, and written with a deep assumption of operating on
strings. For example, when the dbus C API receives a signal, it must
determine whether there is a Python function registered to handle that signal,
and it does this by comparing a number of client-registered parameters, such
as the method name, the interface, and the object path. If the dbus C API was
turning these parameters into bytes, but the clients had registered strings,
then the comparisons in the callback dispatch routines would fail, either
loudly with an exception, or silently with failing comparisons. The former
were relatively easy to track down and fix, by explicitly decoding
client-registered strings to bytes. But the latter, silent failures, were
nearly impossible to debug. Add to that the fact that there were so many
roads into the registration system, that it was also very difficult to coerce
all incoming data early enough so that coercion wasn't necessary at comparison
time. I was left with the unappealing alternative of forcing all client code
to also change their data from using strings to using bytes, which I realized
would be much too high a burden on clients porting their applications to
Python 3. Simon was right, but it was a useful exercise to fail at anyway.

(By way of comparison, it took me the better part of a week and a half to try
to get the test suite passing when these objects were bytes, which I was
ultimately unable to do, and about a day to get them passing when everything
was unicodes. That's gotta tell you something right there, and hopefully not
that "I suck" :).

Let's look at some practical advice that may help you in your own porting
efforts.

Target nothing older than Python 2.6 or Python 3.2. I mentioned this
before, but it's really going to make your life easier. Specifically, drop
Python 2.5 and earlier and you will thank yourself [1]. If you absolutely
cannot do this, consider waiting to port to Python 3. Note that while
Python 2.7 has a few additional conveniences for supporting both Python 2
and Python 3 in a single code base, I did not find them compelling enough to
drop Python 2.6 support.

Where you have C types with reprs, make those reprs return unicodes in both
versions. Many dbus-python types have somewhat complicated reprs because
they return different strings depending on whether their variant_level``s
are zero or non-zero.``#ifdef'ing all of these was just too much work.
Because most code probably doesn't care about the specific type of the repr,
and because Python 2 allows unicode reprs, and because I have a very clever
hack for this [2], I decided to make all reprs return unicodes in both
versions of Python.

Include the following __future__ imports in your Python code:
print_function, absolute_import, and unicode_literals. In
Python 2.6 and 2.7, these enable features that are the default in Python 3,
and so make it easier to support both with one codebase. Specifically,
change all your print statements to print() functions, and remove all
your u'' prefixes from your unicode literals. Be sure to b'' prefix
all your byte literals [3].

Wherever possible, in your extension modules, change all your PyInt s to
PyLong s. In dbus-python, this means that the variant_level
attributes are longs in both Python versions, as are values that represent
such things as UNIX file descriptors. The only place where I kept
PyInt s in Python 2 (and their requisite #ifdefs to use PyLong s
in Python 3) was in the numeric stack inheritance hierarchy, mostly so that
Python 2 code which cares about such things would not have to change.

Define a Python variable and a C macro for determining whether you're
running in Python 2 or Python 3. The former is used in dbus-python because
under Python 3, there is no UTF8String type any more, among other subtle
differences. The latter is used to simplify the #ifdef tests where
they're needed [4].

In your C code, #include <bytesobject.h>. This header exposes aliases
for all PyString calls so that you can use the Python 3 idiom of
PyBytes. Then globally replace all PyString_Foo() calls with
PyBytes_Foo() and the code will look clean and be compilable under both
versions of Python. You may need to add explicit PyUnicode calls where
you need to discern between bytes and strings, but again, this code will be
completely portable between Python 2 and Python 3.

Try to write your functions to accept both unicodes and bytes, but always
normalize them to one type or the other for internal use, and choose one or
the other to return. Some Python stdlib methods are polymorphic in that
they return bytes when handed bytes, and unicodes when handed unicodes.
This can be convenient in some cases, but problematic in others. Choose
carefully when porting your APIs.

Don't use trailing-L long literals if you can help it.

Switch to using Py_TYPE() everywhere instead of de-referencing
ob_type explicitly. The structures are laid out differently between
Python 2 and Python 3, and this Python-supplied macro hides the ugliness
from you.

Here are a few other miscellaneous issues you should be aware of:

Metaclasses are defined differently in Python 2 and Python 3, and you cannot
write any Python code snippet that is even compilable between the two. That's
because the syntax for defining a class that derives from a metaclass in
Python 3 is illegal syntax in Python 2. Your module simply won't compile. My
solution was to use exec() on a string. For this reason, I suggest
keeping metaclass subclasses as simple as possible, so that string is nice and
small.

Get rid of all your uses of iteritems(), iterkeys(), itervalues(),
and xrange(). You probably don't need the optimization these provide, and
they do not exist in Python 3. You can conditionalize around them, but I
think in most cases it's not worth it. If you really need the optimization,
then you'll have to figure out a way around the missing names in Python 3.
But note that Python 3 is already more efficient for the first three, since
you get back dictview objects instead of concrete lists.

PyArg_Parse() and friends lack a 'y' code in Python 2. In Python 3,
these return bytes objects. Where I absolutely needed bytes in Python 3 and
strs in Python 2, I just #ifdef'd around the PyArg_Parse() calls. In
Python 3, there's no equivalent of 'z' for bytes objects (which accept
None s and set the output variable to NULL in that case). If this is
important to you, you might need to write an O& converter.

Watch out for next() vs. __next__() when writing iterators. Python 2
uses the former while Python 3 uses the latter. Best to define the method
once, and then support compatibility via next = __next__ in your class
definition.

operator.isSequenceType() is gone in Python 3. Here's the code I use for
compatibility:

If you by chance use PyCObjects in your extension module, you'll have to
switch these to PyCapsules for Python 3. If you're lucky enough to be
able to drop Python 2.6, you can use PyCapsules everywhere, since they are
available in Python 2.7.

Let me close by saying that you shouldn't be frightened off by the prospect
either of porting your code to Python 3, or supporting both Python 2 and
Python 3 in a single code base. It's definitely doable, and we in the Python
community are gaining more experience at it every day. I strongly feel that
we are well on the track of Guido's original goal of mainstream Python 3
acceptance within 5 years of Python 3's release. I think we're soon going to
see a critical mass of Python 3 ports, after which time, you'll just seem old
and creaky if you don't port to Python 3.

There are some other excellent references for helping you port out there on
the 'net, and for the most part, I've tried not to duplicate their
information. Here are some useful places to start:

It is not impossible to support both Python 3 and versions of Python 2
earlier than 2.6, just more difficult. Michael Foord has had success
doing this for libraries of his such as mock. I just think it's more
trouble than it's worth in most cases.

Here's the clever hack, but first a set-up. The reprs of many of the
dbus-python objects are conditional on whether the variant_level is
zero or not. The variant_level is only included in the repr when
it is greater than zero (with zero being the typical value). This just
means there are usually two calls to PyUnicode_FromFormat() in each
C repr implementation, and #ifdef'ing them to use
PyString_FromFormat() in Python 2 would just double the pain. In
addition, the reprs all include the repr of their parent objects,
i.e. their base class repr. The problem is that these base-class reprs
will be PyBytes in Python 2 and PyUnicodes in Python 3, and
there's nothing we can do about that. As it turns out, Python 2.6 and
Python 3.2 have a %V format with some very interesting semantics.
%V consumes two arguments, a PyObject* and a char*, but it
only uses one of them. When the first argument is not NULL, it
uses that and ignores the second argument. But when the first argument
is NULL, it will use the second argument.

How can this help produce portable code? I define the following macro
and use this everywhere the %V format code is given:

In Python 2, where parent_repr is a PyBytes, REPRV() will
return NULL as the first argument, and via PyBytes_AS_STRING(), a
char* in the second argument. In Python 3, where parent_repr is
a PyUnicode, the first argument will just be the object and the
second argument will be NULL (but it is ignored by Python). As
long as parent_repr is either a PyUnicode or a PyBytes
(a.k.a. PyString), this works perfectly, and keeps the call sites
simple and sane. Beware though because if parent_repr can be any
other type, this will crash your program. Fortunately, Python doesn't
allow for arbitrary repr types - they must be bytes or unicodes, so in
practice this is pretty safe.

A recent thread in python-dev points out that this recommendation
may not be practical if you're building PEP 3333-compliant WSGI
applications. My take on it is that PEP 3333's definition of "native
strings" is a mistake, but sadly one that we have to live with for now.

Now I can use this in other code to switch behavior between Python 2
and Python 3. For example, in dbus-python to import the UTF8String
type in Python 2 only:

from dbus import is_py3
if is_py3:
from _dbus_bindings import UTF8String

This is much easier and less error prone then doing the
sys.version_info test everywhere. The other problem is that
sys.version_info is a namedtuple only in Python 2.7, so in
Python 2.6, it has no attribute called major.