Posts

This is the second post on how to accelerate Python with Cython. The previous post used Cython to declare static C data types in Python, to speed up the runtime of a prime number generator. In this post we shall be modifying a more complex program, one that performs an image transform on a map. This will allow me to demonstrate some more advanced Cython techniques, such as importing C functions from the C math library, using memory views of Numpy arrays and turning off the Python global interpreter lock (GIL).

As with the previous post, we shall be making a series of Cython modifications to Python code, noting the speed improvement with each step. As we go, we’ll be using the Cython compiler’s annotation feature to see which lines are converted into C at each stage, and which lines are still using Python objects and functions. And as we tune the code to run at increasingly higher speeds, we shall be profiling it to see what’s still holding us up, and where to refocus our attention.

Although I will be using Python 3 on a Mac, the instructions I give will be mostly platform agnostic: I will assume you have installed Cython on your system (on Windows, Linux or OS/X) and have followed and understood the installation and testing steps in my previous post. This will be essential if you are to follow the steps I outline below. As stated in the previous post, Cython is not for Python beginners.

Compilation Method

Just to recap, I’ll be using a simple editor (IDLE, in my case), editing my Cython code as .py files (not .pyx files) so that IDLE can recognize them. I’ll only be converting them to .pyx when I’m ready to compile them. To compile them with Cython I’m using a combination of two files. The first is a simple executable shell file I’ve written, containing these two lines:

Save these lines to a file called CompileCython.sh and make it executable. Two things to note. First, you don’t have to use this file – you can just run the second line from the command line. Second, however you run it, this line will need some additional compiler flags for Windows, depending on your Python distribution.

Whichever way you choose to run it, this Python command in turn calls setup.py, which contains the compilation instructions for Cython:

Once you have edited this file to be exactly how you want it, you will rarely have a reason to touch it. We shall be only making changes to it once, when we import the C math libraries later in the post.

Testing each Cython code mod will therefore a simple five step process:

The only other step will be to profile the code, which I’ll be doing early and often to re-check which functions are taking up most of the execution time. There are several great tools for this, but the main one I’ll be using will be the Python code profiler prun, in combination with the Cython compiler’s annotation output to check which lines are still using Python after each mod.

The Starting Point

The code I’ll be using is a faster version of that originally posted for performing a transverse Mercator transform. The original code did not have to be perfect, and was written to explore some interesting mathematics and cartography. It was also reworked in my parallel processing post, where it served its purpose by being computationally intensive. But for this post it needs to start lean and get leaner. So I’ve removed some unnecessary rounding and type casting, and simplified some of the equations.

This means that almost all the gains from the changes I’m about to make are down to Cython, not from restructuring earlier code. I say ‘almost all’ because once things start getting fast, any lines of Python holding things up will start to stand out, and so restructuring the code will become necessary.

Note that the total runtime takes longer when we profile the code, especially from within iPython. The point though is to find the relative execution times of each function. It’s clear that the majority of the runtime is being spent in the low level function inv_transform_pixel(), and its sub-functions inv_merc_long() and inv_merc_lat() with their heavy use of Numpy trigonometry. Once we have declared all our initial static C variables, these will be the areas we’ll look at first.

1. Compiling Python With Cython

But before we do that, first we need to see how much faster the code runs when it is compiled, rather than just interpreted. Compiling Python code with Cython is often the only step that many companies take, as a way of hiding their Python source code. (People often use the Python compiler Nuitka for the same reason.)

Here is the result of compiling the Python code with Cython which, apart from hiding the source code, brings the additional benefit of running slightly faster:

So again we’ve only achieved only a modest runtime improvement, this time of about 7%.

Note how the HTML annotation file has barely changed for the most highly used functions. The lines still using Python functions and objects are still in yellow, but we are starting to see some white lines appear for those lines handling only static C variables:

There is clearly a lot more we have to do a make this code run faster.

3. Replacing Numpy Math with C Math Functions

Even though Numpy is ultimately implemented in C, at its highest level every Numpy function call is still a Python function, each of which comes with an execution overhead. The purpose of this step is to replace these Python function calls with direct C calls. Since many Numpy functions and constants are also implemented in the C math library, it’s a fairly simple exercise to import the C functions corresponding to the Numpy trigonometry functions and mathematical constants and use them directly.

This is an incremental time improvement of 62% for one step, and with an overall runtime now 3 times faster than Python. And here’s how Cython’s HTML annotation file now looks for the most frequently called functions:

The reason the return lines are still in Python is because the functions are still Python def functions. The next step will fix this.

4. Making Python Functions Inline C Functions

The purpose of this step is to stop making thousands of high repetition Python function calls, which are incredibly expensive time-wise. The hard way to do this is to rewrite your code into a flat, non-hierarchical block. The easy way is just to tell Cython just to treat your Python functions as if you had actually done this, while leaving the code with the apparent logical structure you’ve given it. This step also requires us to replace the Python def keyword with the Cython cdef keyword. This has the primary effect of turning these Python functions into C functions, with the side effect that they cannot now be called from outside the module. As a final step, the functions’ return types have been added.

It’s a small speed gain, but the Cython compiler’s annotation file reveals that the high frequency, low level functions are now entirely in C, which will give us more options later:

5. Using a C list to Store Transform Results

Moving to the higher level functions, there are two Numpy arrays used in this program. The first one [called line_pix_mapping in inv_transform_line() and line_pix_mapping in inv_transform_map() ] stores the transform results. This array is really just a correlation between the input and output pixel positions, without any pixel values. The second Numpy array [ called output_image_data in inv_transform_map() ] uses this transform data to create the output image from the input image data. This Cython mod replaces the first Numpy array (both as line_pix_mapping and line_pixel_mapping) with a simple C list to store the image transform correlations, as they are calculated, line by line:

So the PIL Image function getpixel() is now the slowest function in the program. The getpixel() routine was therefore dropped, and the image data read into a Numpy array when it was first opened. This shaved another 460ms off the time:

Which means the code is now running nearly 11 times faster than Python. This last gain was not directly due to Cython, of course. Cython acceleration has merely revealed a Python bottleneck which was not apparent earlier.

7. Using Memory Views of Numpy Arrays

Memory views are a way of bypassing Numpy’s administrative overhead by reading and writing Numpy array data to and from memory directly. They are only marginally slower than C arrays and much easier to set up. The Numpy arrays are still there, and can still be accessed for what they are good for – matrix mathematics and fast array transforms, but if you just want to get your hands on the data and manipulate it, memory views are the way to go. The code to implement memory views is as follows:

8. Turning Off the Python GIL

The fact that the three lowest level functions (the ones called thousands of times) are all now in C gives us an extra card to play in the Python acceleration game: parallel processing. As we noted earlier, the compiler’s HTML annotation file clearly shows that there are no Python objects or functions left in these functions, which means that – for these functions at least – we can turn off the Python global interpreter lock (GIL).

With Cython, this can done to whatever granularity you like – on a function, a loop or even an individual line, by using the nogil Cython keyword. Since we now have entire functions is C, we will be using it on a function level, leaving Cython to handle all the threading issues in the background.

Note: you can’t just turn off the GIL for a mid-level function, if it’s calling other functions defined in the same scope. You have to do it for all its locally-defined sub-functions as well, or else you’ll get this error message:

Which means the code is now running 39 times faster. The overall gain would obviously be greater with more cores, but with an 8% speed improvement in going from 1 to 2 cores it’s probably not worth it. Generally speaking, the gain from going parallel with Cython increases with the amount of C-only code you can get to compile without the GIL, and what percentage of the total execution time that code was taking. If you have highly computational for-loops and functions taking up most of your execution time, the result for your own project might be more dramatic.

Python With Cython Mods vs. Parallel Python

It’s worth noting that the gains achieved from parallel computing in a previous post using almost the same code were nominally much higher – over 40% faster in going from 1 to 2 cores. A couple of things to note though.

First, it’s easier to make inefficient code run faster: on two cores it was still taking nearly 4 seconds. Second, even with the best coders money can buy, the most you can theoretically expect from adding more cores is a linear division of the original runtime. So even if you managed to scale the original Python to run on 32 cores (after much time, coding effort and expense) and achieve the theoretical limit in speed, you would still only have a runtime of 220ms, half as fast as Python with Cython mods running on one core.

Finally, the notion of Cython vs. parallel Python doesn’t really make sense. Cython gives you near-C speeds, on top of which you still have the option of running on multiple cores, if you choose to do so.

The Final Profile

The final Python profiler run shows that the function inv_transform_line() still takes over half the execution time. The good news is that that same execution time has been reduced from 4 seconds to a little over 100ms:

Note how the Python profiler can no longer see inside inv_transform_line(), since the sub-functions it calls no longer use the Python global interpreter lock (GIL).

Summary of Changes

We have taken some simple Python image transform code and made it run 39 times faster than the original Python, using nothing but a laptop. Here is a summary of the changes we made:

From the table we can see that the single most dramatic change was step 3 – replacing the Numpy math functions with directly imported C math functions. We can also see that many of these changes had to happen in roughly this order. Certainly, 3, 4 & 5 could have happened in any order, but 1 and 2 had to happen first. And step 8 could have happened earlier, but not before the lower level functions were rendered entirely into C functions at steps 2, 3 & 4. Similarly, step 7’s memory views needed all the input image data to be first loaded into an array, which meant that the PIL routine getpixel first had to go in step 6.

Ultimately, you can make as many Cython mods to Python as you like, but the trick is knowing when to stop. Using three dimensional C arrays for the image data illustrates the point. They would make the two highest level functions noticeably more complex, and therefore significantly harder to maintain – even by yourself in six months’ time – in return for perhaps only a few milliseconds’ gain. Memory views of Numpy arrays might be fractionally slower than C arrays, but the code is somewhat easier to understand. The law of diminishing returns very much applies here.

There is much more to Cython, but these two posts should be enough to illustrate what is possible. The changes are always a trade-off between how much work you are willing to put in, the gains you might achieve, how complex the code might become, and so how difficult it might be to maintain when you’re finished. If you’ve followed all the steps you should be able to work your way through the Cython online documentation, and you’ll soon be ready to bring C-like speeds to your own Python applications.

This longer post will show you some of the coding skills you’ll need for turning your existing Python code into the Python-C hybrid we call Cython. In doing so, we’ll be digging into some C static data types, to see how much faster Python code will run, and restructuring some Python code along the way for maximum speed.

With Cython, all the benefits of Python are still yours – easily readable code, fast development cycles, powerful high level commands, maintainability, a suite of web development frameworks, a huge standard library for data science, machine learning, imaging, databases and security, plus easy manipulation of files, documents and strings. You should still use Python for all these things – these are what Python does best. But you should also think about combining them with Cython to speed up your computationally intensive Python functions that needs to be fast. Continue reading “From Python To Cython”

This post continues my investigation into running standard Python on multiple cores using the concurrent.futures module. This time, I will be taking some image processing code from a previous post on the transverse Mercator transform and attempting to scale it to multiple CPUs.

With the impending demise of Moore’s Law, multiple cores are a common manufacturers’ workaround for improving hardware performance, whether or not your installed apps can use the parallel architecture.

And with each new release of Python, parallel programming gets even easier. But the degree to which your code can use your multiple cores will depend on the kind of problem you are trying to solve, on the implementation of Python you are running and, as it turns out, how truly parallel the underlying architecture of your system actually is.

The goal of this series of posts is to see how adaptable some of my existing code is to take advantage of multi-core hardware, to see what changes need be made to scale it, and to measure the performance improvements from the exercise. Continue reading “Parallel Python – 1: Prime Numbers”

With global warming melting the icecaps and opening up the poles for oil exploration and tourism, I think it’s time for a new standard wall map, one that shifts those distorted map regions away from major land masses, and places the polar regions where we can see them. That way, our cruise ship and oil tanker captains can navigate more easily through the clear, blue Arctic Ocean, unimpeded by any tiresome ice-pack.

I particularly love that oil companies want to use the new Arctic Ocean sea lanes to transport their oil to market faster. Is it irony, or some form of rare, extinction-level stupidity that only comes around one every few thousand years? Hard to tell. But I digress.

Alas, dear Windows, it was not to be. I’m afraid I’ve been seeing other platforms. Specifically, I’ve been spending time with OS/X behind your back. It was just too painful to be with you. All those arguments, the shouting, the hair-pulling, the throwing things across the room.

Sure, you’re a lot less volatile than you used to be. And you don’t do the tearful breakdown thing any more. Yes, I know I can do almost anything with you that I can with OS/X, but everything just takes longer. OK, you want me to be honest? Fine. I find you excruciatingly frustrating to be with. Why is it always ME navigating around YOUR moods? I mean, why is it that after 25 years, everything with you is STILL a workaround?

After using Spyder for a couple of years, I recently changed my Python IDE from Spyder to PyCharm Community Edition (CE). And since I’ve now used both, I thought I’d share my impressions of each with you.

This post describes the process I used to design an algorithm that allows you to implement a modified Sieve of Eratosthenes to bypass the memory limitations of your computer and, in the process, to find big primes well beyond your 64-bit computer’s supposed numerical limit of 2.0e63 (9.223e18). Beyond that, with this algorithm, the only limitation is the speed of your CPU.

The Sieve of Eratosthenes is a beautifully elegant way of finding all the prime numbers up to any limit. The goal of this post is to implement the algorithm as efficiently as possible in standard Python 3.5, without resorting to importing any modules, or to running it on faster hardware.

Eratosthenes was a Greek scholar who lived in Alexandria (276BC to 194BC) in the so-called Hellenistic period. He was working about a century after Alexander, and about a century before the Romans arrived to impose their cultural desert and call it peace. And then do nothing with the body of knowledge they discovered. Literally. For over 1,600 years, if you count Constantinople. Not a damn thing.

So much for overly religious, centralised, bureaucratic superstates, obsessed with conquest. But I digress.

OK, one for the Mac users. Continuing the theme of user interfaces, here’s a simple but powerful way of using AppleScript to create a user interface for your Python programs and shell scripts and sending the results to just about any application installed on your Mac.

This solution has the advantage over Python’s native Tkinter in that the development time is much faster, and uses the speech synthesis features of OS/X to make your code much easier to use for the non-technical, elderly or visually impaired.

With complex numbers, I always feel as if I’m getting a glimpse of something truly awesome that lies hidden within mathematics. The first time I understood how they worked, I thought it was some form of magic.

I get the same feeling with prime numbers. Like many, I’ve looked at them from all angles – prime gaps, large primes, prime densities, prime sieves – and they continue to fascinate. A few months ago I was thumbing through Henry Warren’s programmers cookbook Hacker’s Delight (A) and discovered a whole chapter on the various formulas for (some of) them. Mind-bending stuff.

This is the second of two posts on how to quickly create a Tkinter dashboard for your command line Python programs. The Tkinter widgets and programming techniques it introduces are a sequel to the previous post.

So far, you have an interactive graphical way of opening a file to analyse it in some way with your own logic, entering text to use as triggers or search strings, setting your own program flags on/off using check boxes, switching between two or more mutually exclusive program flags using radio boxes, controlling access to widgets and the variables they control, calling your own logic, and saving your results in a new file.

This post will build on these skills by showing how to create a dashboard to accept numerical input, perform different kinds of type- and value-checking, and select multiple input files simultaneously using a Tkinter GUI file selector. The solution will be multi-platform, and is shown running above on (from left) Windows 10, Mac OS/X and Linux Mint above. This post will explain how to create the same thing for your own program. Before proceeding, make sure you’ve read and understood the previous post.

The next problem I needed to solve was to come up with a simple graphical user interface (GUI) template as a front-end for configuring and launching any Python code or module I may wish to write or run. Initial impetus: I didn’t want to have to write a user interface from scratch every time I wrote some Python code to manipulate text, data or files. Bonus reason: if I made the GUI template generic enough, others might be able to use it to create their own user interfaces.

This would solve a problem that occurs in many technical fields. A university professor may have a post-doc researcher on her team, one who has written a complex command line program performing, e.g. image processing, AI or genetic analysis. At some stage, there may be some highly repetitive tests that can be performed by someone less technical, freeing up the researcher. She wouldn’t want him running these repetitive command-line tests with code only he knows how to run or, worse, sitting around designing complex user interfaces for others to use it. It would be better to get an intern or research assistant (or even a temp) to run the tests using a GUI that the researcher can knock up in a day or two. This would free him up to concentrate on his research. And finish it faster. Continue reading “GUI Template For Python: Part 1”

The first problem I wanted to solve was to write a short program that would allow me to perform basic textual analysis of any work of literature.

I wanted to be able to study the richness of different authors’ language by looking at how they used neologisms (their own made up words), pseudo-archaisms, invented their own contractions for authentic speech, or used hyphenated compound words, etc. I also wanted to be able to list all the characters and place names (proper nouns) mentioned in a text.

After talking to a good friend who is an experienced coder, I decided on the following:

The IDE

Spyder, running Python 3. It seems to have everything I need, including a good debugger, a variable explorer, hot-linking to function definitions, auto-completion typing, Matplotlib, QT, plus a choice of either a Python and iPython console (each with their different strengths). The bundle I went with is Spyder for WinPython-64bit (WinPython-64bit-3.4.4.3Qt5). The QT will be useful later.