After the 2020 edition of dotPy was cancelled due to the COVID-19 pandemic, we contacted two of the speakers who had been due to appear at the event, Victor Stinner and Julien Danjou, so that we could find out more about the performance of the programming language Python. Aspects that came under the spotlight were how best to measure its performance, the reasons behind its slow speeds, and the ongoing projects and concrete solutions being implemented to tackle performance issues, which can be problematic for developers of large applications. Stinner is a member of the Python Steering Council and has been a core developer since 2010, while Danjou is an engineer at the monitoring service Datadog, where he has been working on a profiler for Python programs.

How have you been involved with Python and its performance?

Victor Stinner: I’ve been working on the Python project for 10 years as a core developer for CPython and have been a member of the Python Steering Council since the most recent election. Today I am paid by Red Hat to maintain Python downstream on Red Hat products, such as Red Hat Enterprise Linux and Fedora, as well as upstream on Python.org. So I’m trying to keep Python competitive with other languages, such as Rust or JavaScript, and to make sure that it’s fast enough for existing and new use cases. One of the bottlenecks of the main implementation of Python, CPython, is being able to use multiple CPUs at the same time. It’s already possible in theory, but it’s not really convenient. We need to find a better way to use all the skills of newer CPUs with lots of cores!

Julien Danjou: I’ve been doing Python for 10 years as well and I’ve been a staff engineer at Datadog for a year now. At Datadog, they had this idea to build a profiler for Python programs to improve performance, so I started to look at everything that existed in the Python ecosystem about profiling. I discovered that there was no existing perfect solution—basically, everything that had been built was only half-doing the job! So I spent a year or so building a profiler to ship to our customers. They run it on their production system and it sends the performance data back to our system, which we then analyze. Everything is open source on GitHub.

What are the best ways to analyze the performance of Python code?

JD: If you want to analyze your code, what it’s doing, where you can optimize it and where you can spend less money on it, there’s no better way than profiling—profiling for CPU time, but also profiling for memory allocation and I/O. And in order to be efficient, you have to profile directly on your production system! Your systems are going to experience a lot of different behavior, depending on the time of day and the hardware and operating system they run on. So if you don’t profile your software on your production system, but try to do it on your laptop instead, you’re going to get the wrong results. Another important point is that you have to do this profiling with a very low overhead, because you cannot afford for your production system to be 10 times slower.

Why is Python so slow compared to other languages?

VS: I think it is important to keep in mind that the people impacted by the performance of Python are in the minority. For most people, Python is efficient enough for their use case. Moreover, it should be noted that CPython is getting faster every year. Now, there are different reasons behind Python’s slowness. The first is the official statement, which is that there is no such performance issue! Saying this is preventing us from working on the issue. The other reasons are mostly technical. Python is one of the oldest programming languages—some design choices that were made 30-odd years ago were relevant to the hardware that existed at the time and so aren’t any longer. For example, the first version of Python didn’t support threads, whereas we know that, today, all applications are using threads with 8 to 64 CPUs. And changing the implementation of Python to better use the hardware requires a lot of effort, mainly because Python is very popular nowadays. So each time we make a tiny change, there are dozens of people who complain about every single incompatible change that we push. But in order to be able to adapt to the new hardware, we have to push incompatible changes! Being able to find a trade-off between the stability of the language and the evolution of the implementation is a very difficult task.

JD: I agree with Victor. I don’t see a lot of people working upstream on Python who are trying to improve its performance, which is a shame considering the number of people who use the language and want better performance from it in general.

VS: I think there is also a financial reason that prevents developers from spending time on optimization, because it’s difficult to justify working on performance, as most of the money is usually spent on development. But there are exceptions—some companies pay developers to work on performance, such as Dropbox on a project called Pyston, Google with Unladen Swallow, and Microsoft on Pyjion. For other companies, sometimes the solution is more about identifying the slowest part of the code and rewriting those parts in a different language, or changing the architecture of the application to split the process to the application into multiple ones or better distribute the workload on multiple servers. The example of Dropbox is interesting, because two developers worked for three years on the Pyston project to optimize Python, while another team rewrote some of the code in the Go language. At the end of the three years, the second team succeeded in optimizing the workload of Dropbox, so the company stopped investing in optimization as it was no longer needed by customers.

JD: So when coding with Python, you might need hardware that is a little bigger, just because it’s a bit slower than doing C directly, but you actually spend a lot less on development time and people. With Python, it’s really easy to prototype anything! It’s really easy to do a Python script and then make that evolve into a real application. And it’s actually way cheaper than trying to build something that is pretty fast and low level, and spending a lot of money on development time and complexity. So it’s a trade-off, but I think it’s worth it.

What would your advice be for Python developers wanting to optimize their code in terms of performance?

JD: If you want to make your Python program faster, the first thing you need to do is actually profile. I do say that a lot, but developers need to understand where the bottleneck is in their program, what makes it slow, and what needs to be optimized. Then, as a developer, you also need to think ahead when you design and build your application! For example, if you’re going to build a large application that is going to be distributed, you’ll need to think about how you’re going to split your workload. That’s a problem you see in other programming languages as well, but there are some caveats in Python with the global lock, which makes threads particularly slow. So you’ll need to go with multi-processes, distribution on different systems, or using async I/O, for example—an event loop, basically, to make your program faster—and not just threads, like you used to 10 or 20 years ago.

VS: The first thing to do if you care about performance is to run a benchmark. That’s the most important part! You need to have a reference point to make sure that the optimization is really faster. The issue is that many developers are not using benchmarking tools properly and end up making bad decisions. I’ve personally spent a lot of time fixing the tool to run benchmarks, because we didn’t have very reliable benchmarking tools in Python—it was very difficult to reproduce the results between two computers. I also created the pyperf module to help with the writing, running, and analyzing of benchmarks, which is now used by the pyperformance benchmark suite. And then there are many existing solutions that improve the performance—for the scientific part, there is Numba for example, which is a JIT compiler, and for the general case, there is the implementation PyPy. But be careful—depending on your workload, PyPy can use more memory, which can be an issue. And the last solution is what Julien has already talked about—changing the distribution of your workload or your architecture.

There was an attempt to remove the global interpreter lock (GIL) from the CPython implementation to optimize performance—what’s the current situation with that project?

VS: When you add more CPUs, more threads, the application becomes less efficient, which is counterintuitive. And when I discussed this issue with Larry Hastings and other developers working on the Gilectomy project, what I understood is that one of the bottlenecks is the C API, because we expose too many implementation details.

JD: The last time I heard about the Gilectomy project was 2 years ago. And I think a lot of people are actually waiting for Python to be able to run tons of threads. But I think it’s very hard because they built the language around the assessment that multithreading is safe in many places, and you can’t change that without breaking a lot of Python code. So it’s going to be a very tricky thing to do.

VS: There is another experimental project, with sub-interpreters, which has a different approach compared to multiprocessing. The idea is that, inside a single process, you have multiple instances of Python, each being independent, with its own memory and data as well as its own GIL. By using this design, we may be able to run 2 instances in parallel, for example. But implementing this project requires a lot of effort and there are not many people working on it right now. The leader is Eric Snow, who is at Microsoft but also works on this project in his free time. I’m trying to help, but I don’t have a lot of time, so unfortunately this project is not ready for production yet, but I have faith that it will succeed at some point!

And the last ongoing project to optimize Python that I heard about is HPy, which consists of writing a new C API for Python that doesn’t leak any implementation detail. Because of the C API, some C extensions are way slower on PyPy than on CPython, as PyPy is not written in the C language. So if you want to use an existing C extension, like NumPy on PyPy, it has to emulate how the objects are stored in the memory, which is really inefficient. So the team of PyPy developers working on this project showed that if you convert an existing C extension to their new C API, the same C extension becomes something like 4 times faster on PyPy without slowing on CPython. It’s important to note that the HPy project is independent of the CPython project, so it’s not limited by the backward compatibility and development slowness of CPython. In the long term, the HPy team is hoping for support from CPython and Cython but also Rust and GraalPython.

This interview has been edited for space and clarity.

This article is part of Behind the Code, the media for developers, by developers. Discover more articles and videos by visiting Behind the Code!