As you can see the code aligns the sequences in some way.
The output seems reasonable given the simplicity of the alignment scores, which is great news for us.
We can now continue to establishing a baseline on how fast the code runs.
Unix command time is perfect for this task.

> time python alignment.py

Typing this command into the terminal reports the baseline to be around 0.502s.

At this point it is curious to measure the runtime using an optimising python interpreter, PyPy.
Download and install PyPy according to instructions in their download page.
For those lucky enough to use OSX, Homebrew provides a simple interface to install PyPy:

> brew install pypy

Since we know PyPy is an optimising Python compiler, we would expect the code to run faster than on it than on CPython.
To test this assumption, we can again use time:

> time pypy alingment.py

This function reports the code to run for 0.255s – almost twice as fast as the under the standard python.
Thinking about this, it is quite impressive that we were able to nearly halve the runtime without any significant effort on our side (besides installing PyPy, of course).

Profiling the Baseline

Now let’s see if we can speed things up even further.
Let’s fire up cProfile and investigate where the program spends majority of it’s time.
The standard way to run cProfile is given below:

> python -m cProfile -o profile.pstats alignment.py

The -o profile.pstats option tells the profiler to save the stats in profile.pstats file, which we are going to view using gprof2dot script and graphviz.

If you do not yet have these packages, you should be able to install gprof2dot using pip directly:

> pip install gprof2dot

If your computer does not recognise pip command, try to install it first using these instructions.

Since graphviz is not a python package it cannot be installed via pip, therefore follow the instructions in the download page for guidance on how to obtain it. Alternatively, you can use your distribution’s package manager. For instance, the aforementioned Homebrew on OSX can do this (brew install graphviz).

Here-n 0 -e 0 just tells the software to draw all nodes (by default it cuts a few unimportant ones off for visual clarity). The piped dot command generates profile.pdf with the profile of function calls for the program.

Open this profile.pdf file with your favourite PDF viewer to see the results.

Improving the code based on profile results

From the profile above we can see that the majority of the time is spent in the align function, just as we would expect, 35% of which in function max.
Out of this time, we see that we spend 9.75% of time in the lambda function, namely the lambda function lambda x: x[0] that just tells the function to get the first value and use that for comparison.

Judging from these results, one must ask whether using max function for three elements only is an overkill. M
aybe we were too smart here and could actually get away with simpler code? Lets replace the call for max function with a block of nested if statements:

We can see that CPython version of the code became much faster (remember, it was around 0.5s before), whereas PyPy runtimes stayed roughly the same, suggesting that PyPy has already thought of this optimisation on it’s own.

Running cProfile again, we can see that align function is still the bottleneck. Unfortunately, it does not tell us what is making it slow any more:

Profiling line-by-line

We will use a line profiler to get more granularity for our profiles.

First install it if you haven’t already:

> pip install line_profiler

Now in order for the profiler to work, we need to tell it which functions we want to profile. This is done by decorating said functions with @profile decorator.
Open align.py in your favourite text editor and add the line @profile on top of the def align ... line, i.e.:

We can see we spend a significant amount of time in lines that access the matrix, i.e. lines 26-28, 39.
Similarly, a significant amount of time is spent in the comparisons and other parts of the loop.
Optimisation of any of these lines will provide the most benefit. Let’s see if we can find a way to perform these optimisations.

Cythonising the code

Cython can be considered to be a compiler for Python.
Generally, it is able to convert the Python code into C, which should provide a significant performance boost for simple statements such as element access in a matrix.
In order to start using it, we would need to install it first. pip would be able to handle this for us:

> pip install Cython

Once the installation is complete, it is ridiculously easy to convert the code to Cython format:

> cp align.py align.pyx

You got it right, we just copied the file to a file with different extension (.pyx is Cython file extension).
In order to be able to run it, however, we would need to compile this file. This is done by creating setup.py.
The following code provides the minimal skeleton for such file:

Line eight in this file can roughly be translated to say compile file align.pyx into a library align_cythonised.

Now let’s quickly compile our code using this setup file:

> python setup.py build_ext --inplace

The code should generate align.c and align_cythonised.so files in your local directory (do not forget the --inplace tag, as otherwise it would compile it into build/ directory).

We could now use timeit module to compare the runtimes of these functions.
It is convenient to use IPython (http://ipython.org/, pip install ipython) for timeit tests due to it’s magic command %timeit. I use it here:

We can see we got 69 ms improvement by just compiling the code without changing a line of code – not bad.
Note that PyPy runs this function in 168 ms – 21 milliseconds faster.
One could overstretch this result and say that Python runs faster than C in this scenario (hehe).

Let’s see if we can make Cythonised code beat this though.

Numpy and Cython

Cython collaborates with numpy folks a lot and is able to run numpy’s C interface natively. We could exploit this native interface to our advantage.
To do so, we will change our matrix to be a numpy n-dimensional array rather than a dictionary.

First of all, we need to install numpy, pip could be used to do this yet again:

> pip install numpy

You might need a Fortran compiler to do so (you can obtain it from Homebrew on OSX brew install gfortran). If you are having trouble with this step consult the downloads page on how to get the precompiled library directly.

Once this is all set, copy align.pyx into align_numpy.pyx and change it slightly to use numpy.

As you can see, the total runtime is 101 milliseconds. This is 157ms faster than the original version of the code and all we did was to start using numpy and add a couple of type hints.

This Cythonic code can be further optimised by enabling the potentially dangerous options such as disabling bound checking.
Furthermore, if you want to take optimisation matters into your own hands, one could run native C code from Cython.
I refer the reader to Cython documentation and various tutorial videos from SciPy conferences for more information on these advanced topics.

Conclusions

In conclusion, this guide provides a reference on some quick optimisations you could make for your code, without much significant effort from your side.
Were able to significantly improve runtimes of Python code by profiling where the bottlenecks are and doing simple, yet incremental changes to it.

A word of warning, though, these alternative interpreters/compilers for python should be considered only as a last resort. More often than not, there is a way to obtain sufficient performance boosts from simply changing the bits of the code, as we did in the beginning. However, if in any case this would prove insufficient for your needs, these optimisation tools exist already and will continue to improve over time.