Making Python faster

Welcome to LWN.net

The following subscription-only content has been made available to you
by an LWN subscriber. Thousands of subscribers depend on LWN for the
best news from the Linux and free software communities. If you enjoy this
article, please consider accepting the trial offer on the right. Thank you
for visiting LWN.net!

Free trial subscription

The Python core developers, and Victor Stinner in particular, have been
focusing on improving the performance of Python 3 over the last few
years. At PyCon 2017, Stinner
gave a talk on some of the optimizations that have been added recently and
the effect they have had on various benchmarks. Along the way, he took a
detour into some improvements that have been made for benchmarking
Python.

He started his talk by
noting that he has been working on porting OpenStack to Python 3 as
part of his day job at Red Hat. So far, most of the unit tests are
passing. That means that an enormous Python program (with some 3 million
lines of code) has largely made the transition to the Python 3 world.

Benchmarks

Back in March 2016, developers did not really trust the Python benchmark
suite, he said. The benchmark results were not stable, which made it
impossible to
tell if a particular optimization made CPython (the Python reference
implementation) faster or slower. So he set out to improve that situation.

He created a new module, perf, as a framework for
running benchmarks. It calibrates the number of loops to run the benchmark
based on a time budget. Each benchmark run then consists of sequentially
spawning twenty processes, each of which performs the appropriate number of
loops three times. That generates 60 time values; the average and standard
deviation are calculated from those. He noted that the standard deviation
can be used to spot problems in the benchmark or the system; if it is large,
meaning lots of variation, that could indicate a problem.

The perf module has a “system
tune” command that can be used to tune a
Linux system for benchmarking. That includes using a fixed CPU
frequency, rather than allowing each core’s frequency to change all the
time, disabling the Intel Turbo Boost feature, using CPU pinning, and
running the benchmarks on an isolated CPU if that feature is enabled in the
kernel.

Having stable benchmarks makes it much easier to spot a performance
regression, Stinner said. For a real example, he pointed to a graph in hisslides
[PDF] that showed the python_startup benchmark time
increasing dramatically
during the development of 3.6 (from 20ms to 27ms). The problem was a new
import in the code; the fix dropped the benchmark to 17ms.

The speed.python.org site allows developers to look at a timeline of the
performance of CPython since April 2014 on various benchmarks. Sometimes
it makes sense to focus on micro-benchmarks, he said, but the timelines of
the larger benchmarks can be even more useful for finding regressions.

Stinner put up a series of graphs showing that 3.6 is faster than 3.5 and
2.7 on multiple benchmarks. He chose the most significant changes to show
in the graphs, and there are a few benchmarks that go against these
trends. The differences between 3.6 and 2.7 are larger than those for 3.6
versus 3.5, which is probably not a huge surprise.

The SymPy benchmarks show
some of the largest performance increases. They are 22-42% faster in 3.6
than they are in 2.7. The largest increase, though, was on the telco
benchmark, which is 40x faster on 3.6 versus 2.7. That is because thedecimal
module was rewritten in C for Python 3.3.

Preliminary results indicate that the in-development Python 3.7 is
faster than 3.6, as well. There were some optimizations that were merged
just after the 3.6 release; there were worries about regressions, which is
why they were held back, he said.

Optimizations

Stinner then turned to some of the optimizations that have made those
benchmarks faster. For 3.5, several developers rewrote the functools.lru_cache()
decorator in C. That made the SymPy benchmarks 20% faster. The cache
is “quite complex” with many corner cases, which made it hard to get
right. In fact, it took three and a half years to close the bug associated with it.

Another 3.5 optimization was for ordered
dictionaries (collections.OrderedDict). Rewriting it in C
made the html5lib benchmark 20% faster, but it was also tricky code. It
took two and a half years to close that bug, he said.

Moving on to optimizations for 3.6, he described the change he made for
memory allocation in CPython. Instead of using PyMem_Malloc() for
smaller allocations, he switched to the Python fast
memory allocator that
is used
for Python objects. It only changed two lines of code, but resulted in
many benchmarks getting 5-22% faster—and no benchmarks ran slower.

The profile-guided optimization for CPython was improved by using the Python test
suite. Previously, CPython would be compiled twice using the pidigits
module to guide the optimization. That only tested a few,
math-oriented Python functions, so using the test suite instead covers more
of the interpreter. That
resulted in many benchmarks showing 5-27% improvement just by changing the
build process.

In 3.6, Python moved from
using a bytecode for its virtual machine to a
“wordcode”. Instead of either having one or three bytes per instruction,
now all instructions are two bytes long. That removed an if
statement from the hot path in ceval.c (the main execution loop).

Stinner added a way to
make C function calls faster using a new internal_PyObject_FastCall() routine. Creating and destroying the tuple
that is
used to call C functions would take around 20ns, which is expensive if the
call itself is only, say, 100ns. So the new function dispenses with
creating the tuple to pass the function arguments. It shows a 12-50%
speedup for many micro-benchmarks.

He also optimized the ASCII and UTF-8 codecs when using the “ignore”,
“replace”, “surrogateescape”, and “surrogatepass” error
handlers. Those codecs were full of
“bad code”, he said. His work resulted in UTF-8 decoding being 15x faster
and encoding to be 75x faster. For ASCII, decoding is now 60x faster,
while encoding is 3x faster.

Python 3.5 added byte-string formatting back into the language as a result
of PEP 461, but the
code was inefficient. He used
the _PyBytesWriter() interface to handle byte-string
formatting. That resulted in 2-3x speedups for those types of
operations.

There are “lots of ideas” for optimizations for 3.7, Stinner said, but he
is not sure which will be implemented or if they will be helpful. One that
has been merged already is to add new opcodes (LOAD_METHOD andCALL_METHOD) to support making method calls as fast
calls, which makes method calls 10-20% faster. It is an idea that has
come to CPython from PyPy.

He concluded his talk by pointing out that on some benchmarks,
Python 3.7 is still slower than 2.7. Most of those are on the order
of 10-20% slower, but the python_startup benchmarks are 2-3x slower. There
is a need to find a way to optimize interpreter startup in Python 3.
There are, of course, more opportunities to optimize the language and he
encouraged those interested to check out speed.python.org, as well as hisFaster CPython site
(which he mentioned in his Python Language
Summit session earlier in the week).