Blog ~ Saulius LukauskasJekyll2015-10-13T12:09:47+01:00/Saulius Lukauskas//articles/2014/02/13/why-your-python-runs-slow-part-1-data-structures2014-02-13T00:00:00+00:002014-02-13T00:00:00+00:00Saulius Lukauskas<p>Almost a year ago, in Waza 2013, Alex Gaynor delivered an excellent talk on <a href="http://vimeo.com/61044810">“Why Python, Ruby and Javascript are slow”</a> . The key of his talk, as he emphasised, is in the present tense. In other words, even if these languages <em>are</em> slower now, it does not mean that the situation is hopeless and will necessarily stay that way forever.</p>
<p>When one asks a question among the lines of why Python is slower than C, dynamic typing is the first answer to get thrown around. While it is true that dynamically typed programming languages have some overhead associated with them, it may not be the major factor slowing down <em>your</em> Python code.</p>
<p>Dynamic typing (and other <em>magic</em> features of programming languages like Python) make the interpreters for the language harder to optimise. There is a huge difference between interpreter being harder to optimise and your code being slower, however. In fact, as Alex himself mentions in the talk, we have had years of research focusing on what is the best way to perform type checking at runtime. This research has already lead to ways of making this overhead negligible.</p>
<p>In reality, the significant runtime differences between Pythonic code and C boil down to different data structures and algorithms used in each of the cases. Sometimes even without the programmers being aware of them.</p>
<h2 id="you-code-differently-in-python">You Code Differently in Python</h2>
<p>Let’s illustrate the point in previous paragraph with an example from Alex’s talk.
A Python programmer would be likely to use a similar code snippet to code a container for 2D point in Python:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">point</span> <span class="o">=</span> <span class="p">{</span><span class="s">&#39;x&#39;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="s">&#39;y&#39;</span><span class="p">:</span> <span class="mi">0</span><span class="p">}</span></code></pre></div>
<p>The solution easy to read, even easier to code and elegant in general.</p>
<p>A C developer, on the other hand, would use a <code>struct</code> for 2D point container:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">struct</span> <span class="n">Point</span> <span class="p">{</span>
<span class="nb">int</span> <span class="n">x</span><span class="p">;</span>
<span class="nb">int</span> <span class="n">y</span><span class="p">;</span>
<span class="p">};</span></code></pre></div>
<p>While this solution is as elegant and as easy to work with as the Pythonic solution, it is completely different data structure. Here we tell the compiler that each point would have exactly two fields, <code>x</code> and <code>y</code>. Knowing the size of these fields, the compiler could take the hint and allocate them near each other, as a single contiguous memory block. In other words, the structure would be similar to an array. The compiler would know exactly where <code>x</code> and <code>y</code> fields of a given point are at any time. We could also access both of these fields in memory easily, just by looking some constant offset away from where the point itself is.</p>
<p>Pythonic code uses a hash table-like data structure to perform a similar task. The interpreter cannot simply preallocate a contiguous block of memory to store both <code>x</code> and <code>y</code> next to each other to make them easily accessible. This is impossible as we could also have any number of keys present in the dictionary at any time. We could also delete these keys if we wanted to. The interpreter must use a hash function to be able to map any input you could throw at him to a memory location. Needless to say, these functions add a some overhead to the calculation. While it is usually small, it may as well be significant enough to slow down your code, especially if you have to do a lot of them.</p>
<p>If someone was to directly translate the pythonic code to C++, they would probably end up with something similar to:</p>
<div class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="n">std</span><span class="o">::</span><span class="n">hash_set</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="p">,</span> <span class="kt">int</span><span class="o">&gt;</span> <span class="n">point</span><span class="p">;</span>
<span class="n">point</span><span class="p">[</span><span class="err">“</span><span class="n">x</span><span class="err">”</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span>
<span class="n">point</span><span class="p">[</span><span class="err">“</span><span class="n">y</span><span class="err">”</span><span class="p">]</span> <span class="o">=</span> <span class="n">y</span></code></pre></div>
<p>Looking at the code snippet, it looks as if the language designers themselves went to great lengths to make the hash tables as complicated to use as possible, just so nobody would use them by accident. And rightfully so. Because of this <em>feature</em>, a newcomer to the C world would be able to spot that there’s something ridiculously wrong with this code. This begs to question why is it acceptable in Python?</p>
<p>The answer is in the “dictionaries are lightweight objects” mindset of Python programmers.
Consider the following code, which is the closest thing in Python to the C <code>struct</code> above<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="lineno">1</span> <span class="k">class</span> <span class="nc">Point</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="lineno">2</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">None</span>
<span class="lineno">3</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="lineno">4</span> <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">y</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span></code></pre></div>
<p>This representation provides useful hints to the compiler, just like the C <code>struct</code> does. For instance, in the second line, we are explicitly telling the interpreter that our point object will always have at least two fields <code>x</code> and <code>y</code> when it is created, hoping it would be able to deal with them.</p>
<p>Unfortunately, the standard Python interpreter, also called CPython, is not able to make much use of this representation. The following code block takes 186 microseconds to run on my machine:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">sum_</span><span class="p">(</span><span class="n">points</span><span class="p">):</span>
<span class="n">sum_x</span><span class="p">,</span> <span class="n">sum_y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">point</span> <span class="ow">in</span> <span class="n">points</span><span class="p">:</span>
<span class="n">sum_x</span> <span class="o">+=</span> <span class="n">point</span><span class="p">[</span><span class="s">&#39;x&#39;</span><span class="p">]</span>
<span class="n">sum_y</span> <span class="o">+=</span> <span class="n">point</span><span class="p">[</span><span class="s">&#39;y&#39;</span><span class="p">]</span>
<span class="k">return</span> <span class="n">sum_x</span><span class="p">,</span> <span class="n">sum_y</span></code></pre></div>
<p>A similar code block that replaces <code>point['x']</code> with <code>point.x</code> takes 201 microseconds to run on the same machine. In other words, using the class <code>Point</code> is 8% slower.</p>
<p>In CPython, <code>point.x</code> is often executed in the same way as the statement <code>dict(point)['x']</code><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.
This means that the class implementation of a point still uses a dictionary lookup, as before, with some additional overhead. In this sense, it is easy to see why the dictionaries are thought to be “lightweight objects”.</p>
<p>Some Python implementations that were specifically designed to be efficient, such as <a href="http://pypy.org">PyPy</a>, are able to flip the scales for these two approaches around. For instance, the same code snippets run for 21.6 and 3.75 microseconds respectively when using PyPy instead of Python. While both of these results are significantly faster than their CPython counterparts, using a class is 5.76 <em>times</em> faster under this JIT-capable interpreter, than using a dictionary. In other words, PyPy are able to use correct data structure here.</p>
<p>I encourage you to look at the smallest number again – 3.75 microseconds. This number means that we could perform 266 000 of these calculations in a second, all from Python, with dynamic typing, monkey-patching, etc. still built into it. All of this, by just using a better data structure, both in the code, and in the implementation. Next time, you are writing a line of code in Python, have a thought of what data structure you are using, both explicitly and implicitly, and ask yourself whether there is a better one. This is what you would do in C, is it?</p>
<p>Finally, I like to believe that this article is a clear example why the future is bright for Python (and similar languages). The standard Python implementation, CPython is there only to act as reference. It has never been designed to be the fastest one. As we have seen today, alternative implementations, like PyPy, are already able to optimise your code to great lengths. These optimisations are possible, even with all the challenges offered by the nature of the programming language. We only had 23 years of Python interpreter development, how would things look like when Python is 42, like C?</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>One would be right to argue that <a href="http://docs.python.org/2/library/collections.html#collections.namedtuple"><code>collections.namedtuple()</code></a> is even closer to C struct than this. Let’s not overcomplicate things, however. The point I’m trying to make is valid regardless. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:2">
<p>Consult python <a href="http://docs.python.org/2/reference/datamodel.html#the-standard-type-hierarchy">documentation</a> for a detailed view of how attributes work in python classes. <a href="#fnref:2" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="/articles/2014/02/13/why-your-python-runs-slow-part-1-data-structures/">Why Your Python Runs Slow. Part 1: Data Structures</a> was originally published by Saulius Lukauskas at <a href="">Blog ~ Saulius Lukauskas</a> on February 13, 2014.</p>/articles/2014/02/12/how-to-make-python-faster-without-trying-that-much2014-02-12T00:00:00+00:002014-02-12T00:00:00+00:00Saulius Lukauskas<section id="table-of-contents" class="toc">
<header>
<h3>Contents</h3>
</header>
<div id="drawer">
<ul id="markdown-toc">
<li><a href="#starting-code">Starting code</a></li>
<li><a href="#baseline">Baseline</a></li>
<li><a href="#profiling-the-baseline">Profiling the Baseline</a></li>
<li><a href="#improving-the-code-based-on-profile-results">Improving the code based on profile results</a></li>
<li><a href="#profiling-line-by-line">Profiling line-by-line</a></li>
<li><a href="#cythonising-the-code">Cythonising the code</a></li>
<li><a href="#numpy-and-cython">Numpy and Cython</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>
</div>
</section>
<!-- /#table-of-contents -->
<p>This article is based on a demo I gave in a session at the <a href="http://www.theosysbio.bio.ic.ac.uk/">Theoretical Systems Biology Group</a> on 4th of February.</p>
<p>The example focuses on optimising a small program that computes the best alignment for two protein sequences.</p>
<h2 id="starting-code">Starting code</h2>
<p>The code for this demo is available <a href="https://github.com/sauliusl/python-optimisation-example/">directly from GitHub</a>.
Please checkout the <code>starting-code</code> tag to begin with:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">git clone git@github.com:sauliusl/python-optimisation-example.git
git checkout starting-code</code></pre></div>
<p>Alternatively, you can browse the code <a href="https://github.com/sauliusl/python-optimisation-example/tree/starting-code/alignment">here</a>.</p>
<p>Take a look at the directory alignment, there are three files:</p>
<ul>
<li><code>align.py</code> – the file will be the target of our optimisation today, as it contains the routine to compute the dynamic programming matrix;</li>
<li><code>alignment.py</code> – the main executable that calls this function and prints the output.</li>
<li><code>sequences.py</code> –the file that provide a convenient way to get the sequences.</li>
</ul>
<h2 id="baseline">Baseline</h2>
<p>Before starting any optimisations, let’s verify whether the code works at all:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; python alignment.py
Score: 315
Alignment:
MEE-PQSD-PS-VE-PPLSQETFS-DLWKLL-PE--NNVL-SP--LPSQAMDD-LMLSP-
MEES-QSDI-SL-EL-PLSQETFSG-LWKLLPPEDI---LPSPHC-----MDDLL-L-PQ
&lt;... snip ...&gt;
KE-PG-GSRAHSS-HLK-SKKGQSTSRHKK-LM--FK-TEGPDSD
-ES-GD-SRAHSSY-LKT-KKGQSTSRHKKT-MVK-KV--GPDSD</code></pre></div>
<p>As you can see the code aligns the sequences in some way.
The output seems reasonable given the simplicity of the alignment scores, which is great news for us.
We can now continue to establishing a baseline on how fast the code runs.
Unix command <code>time</code> is perfect for this task.</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; <span class="nb">time </span>python alignment.py</code></pre></div>
<p>Typing this command into the terminal reports the baseline to be around 0.502s.</p>
<p>At this point it is curious to measure the runtime using an optimising python interpreter, <a href="http://pypy.org">PyPy</a>.
Download and install PyPy according to instructions in <a href="http://pypy.org/download.html">their download page</a>.
For those lucky enough to use OSX, <a href="http://brew.sh/">Homebrew</a> provides a simple interface to install PyPy:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; brew install pypy</code></pre></div>
<p>Since we know PyPy is an optimising Python compiler, we would expect the code to run faster than on it than on CPython.
To test this assumption, we can again use <code>time</code>:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; <span class="nb">time </span>pypy alingment.py</code></pre></div>
<p>This function reports the code to run for 0.255s – almost twice as fast as the under the standard python.
Thinking about this, it is quite impressive that we were able to nearly halve the runtime without any significant effort on our side (besides installing PyPy, of course).</p>
<h2 id="profiling-the-baseline">Profiling the Baseline</h2>
<p>Now let’s see if we can speed things up even further.
Let’s fire up <a href="http://docs.python.org/2/library/profile.html#module-cProfile"><code>cProfile</code></a> and investigate where the program spends majority of it’s time.
The standard way to run <code>cProfile</code> is given below:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; python -m cProfile -o profile.pstats alignment.py</code></pre></div>
<p>The <code>-o profile.pstats</code> option tells the profiler to save the stats in <code>profile.pstats</code> file, which we are going to view using <a href="https://code.google.com/p/jrfonseca/wiki/Gprof2Dot"><code>gprof2dot</code></a> script and <a href="http://www.graphviz.org/"><code>graphviz</code></a>.</p>
<p>If you do not yet have these packages, you should be able to install <code>gprof2dot</code> using <code>pip</code> directly:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; pip install gprof2dot</code></pre></div>
<p>If your computer does not recognise <code>pip</code> command, try to install it first using <a href="http://www.pip-installer.org/en/latest/installing.html">these instructions</a>.</p>
<p>Since <code>graphviz</code> is not a python package it cannot be installed via <code>pip</code>, therefore follow the instructions in the <a href="http://www.graphviz.org/Download.php">download page</a> for guidance on how to obtain it. Alternatively, you can use your distribution’s package manager. For instance, the aforementioned Homebrew on OSX can do this (<code>brew install graphviz</code>).</p>
<p>Once you have both modules, run the following two-part command:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; gprof2dot -f pstats profile.pstats -n <span class="m">0</span> -e <span class="m">0</span> <span class="p">|</span> dot -Tpdf -o profile.pdf</code></pre></div>
<p>Here<code>-n 0 -e 0</code> just tells the software to draw all nodes (by default it cuts a few unimportant ones off for visual clarity). The piped <code>dot</code> command generates <code>profile.pdf</code> with the profile of function calls for the program.</p>
<p>Open this <code>profile.pdf</code> file with your favourite PDF viewer to see the results.</p>
<p><img src="/assets/post-images/2014-02-12-how-to-make-python-faster-without-trying-that-much/profile-1.png" alt="Contents of profile.pdf" /></p>
<h2 id="improving-the-code-based-on-profile-results">Improving the code based on profile results</h2>
<p>From the profile above we can see that the majority of the time is spent in the <code>align</code> function, just as we would expect, 35% of which in function <code>max</code>.
Out of this time, we see that we spend 9.75% of time in the <code>lambda</code> function, namely the lambda function <code>lambda x: x[0]</code> that just tells the function to get the first value and use that for comparison.</p>
<p>Judging from these results, one must ask whether using <code>max</code> function for three elements only is an overkill. M
aybe we were too smart here and could actually get away with simpler code? Lets replace the call for max function with a block of nested <code>if</code> statements:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">if</span> <span class="n">match_score</span> <span class="o">&gt;=</span> <span class="n">deletion</span> <span class="ow">and</span> <span class="n">match_score</span> <span class="o">&gt;=</span> <span class="n">insertion</span><span class="p">:</span>
<span class="n">max_score</span> <span class="o">=</span> <span class="n">match_score</span>
<span class="n">traceback_pos</span> <span class="o">=</span> <span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">j</span><span class="o">-</span><span class="mi">1</span>
<span class="k">elif</span> <span class="n">deletion</span> <span class="o">&gt;</span> <span class="n">match_score</span> <span class="ow">and</span> <span class="n">deletion</span> <span class="o">&gt;=</span> <span class="n">insertion</span><span class="p">:</span>
<span class="n">max_score</span> <span class="o">=</span> <span class="n">deletion</span>
<span class="n">traceback_pos</span> <span class="o">=</span> <span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">j</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">max_score</span> <span class="o">=</span> <span class="n">insertion</span>
<span class="n">traceback_pos</span> <span class="o">=</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="o">-</span><span class="mi">1</span>
<span class="n">matrix</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">max_score</span>
<span class="n">traceback</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">traceback_pos</span></code></pre></div>
<p>To get this code just checkout <code>simplified-initial</code> tag:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; git checkout simplified-initial</code></pre></div>
<p>Alternatively, browse it <a href="https://github.com/sauliusl/python-optimisation-example/tree/simplified-initial">here</a>.</p>
<p>Let’s <code>time</code> this new code:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; <span class="nb">time </span>python alignment.py
0m0.327s
&gt; <span class="nb">time </span>pypy alignment.py
0m0.249s</code></pre></div>
<p>We can see that CPython version of the code became much faster (remember, it was around 0.5s before), whereas PyPy runtimes stayed roughly the same, suggesting that PyPy has already thought of this optimisation on it’s own.</p>
<p>Running cProfile again, we can see that align function is still the bottleneck. Unfortunately, it does not tell us what is making it slow any more:</p>
<p><img src="/assets/post-images/2014-02-12-how-to-make-python-faster-without-trying-that-much/profile-2.png" alt="Contents of profile.pdf" /></p>
<h2 id="profiling-line-by-line">Profiling line-by-line</h2>
<p>We will use a line profiler to get more granularity for our profiles.</p>
<p>First install it if you haven’t already:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; pip install line_profiler</code></pre></div>
<p>Now in order for the profiler to work, we need to tell it which functions we want to profile. This is done by decorating said functions with <code>@profile</code> decorator.
Open <code>align.py</code> in your favourite text editor and add the line <code>@profile</code> on top of the <code>def align ...</code> line, i.e.:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># align.py</span>
<span class="c"># &lt; snip ... &gt;</span>
<span class="nd">@profile</span>
<span class="k">def</span> <span class="nf">align</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">):</span>
<span class="n">matrix</span> <span class="o">=</span> <span class="p">{}</span>
<span class="c">#&lt; snip ... &gt;</span></code></pre></div>
<p>Don’t worry about importing this line. Profiler will make sure to import it for you automagically.</p>
<p>The actual profiling is done by running</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; kernprof.py --line-by-line alignment.py</code></pre></div>
<p>This would store the results in <code>alignment.py.lprof</code> file which can then be viewed by</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; python -m line_profiler alignment.py.lprof</code></pre></div>
<p>The first command in this set generates the profile, and the second one prints it in a readable format.
This is the output:</p>
<pre><code>
Timer unit: 1e-06 s
File: align.py
Function: align at line 6
Total time: 1.38775 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
6 @profile
7 def align(left, right):
8
9 1 3 3.0 0.0 matrix = {}
10
11 # Initialise
12 1 1 1.0 0.0 matrix[0, 0] = 0
13 1 1 1.0 0.0 traceback = {}
14 394 369 0.9 0.0 for i in xrange(len(left)):
15 393 525 1.3 0.0 matrix[i+1, 0] = (i+1) * DELETION_SCORE
16 393 624 1.6 0.0 traceback[i+1, 0] = i, 0
17
18 388 405 1.0 0.0 for i in xrange(len(right)):
19 387 477 1.2 0.0 matrix[0, i+1] = (i+1) * INSERTION_SCORE
20 387 532 1.4 0.0 traceback[0, i+1] = 0, i
21
22
23
24 394 338 0.9 0.0 for i, symbol_left in enumerate(left):
25 152484 140777 0.9 10.1 for j, symbol_right in enumerate(right):
26 152091 125126 0.8 9.0 match_score = matrix[i, j] + MATCH_SCORE if symbol_left == symbol_right else MISMATCH_SCORE
27 152091 155781 1.0 11.2 deletion = matrix[i, j+1] + DELETION_SCORE
28 152091 146020 1.0 10.5 insertion = matrix[i+1, j] + INSERTION_SCORE
29 152091 111502 0.7 8.0 if match_score &gt;= deletion and match_score &gt;= insertion:
30 9563 6786 0.7 0.5 max_score = match_score
31 9563 7496 0.8 0.5 traceback_pos = i, j
32 142528 114515 0.8 8.3 elif deletion &gt; match_score and deletion &gt;= insertion:
33 87713 62775 0.7 4.5 max_score = deletion
34 87713 72069 0.8 5.2 traceback_pos = i, j+1
35 else:
36 54815 38668 0.7 2.8 max_score = insertion
37 54815 44198 0.8 3.2 traceback_pos = i+1, j
38
39 152091 171714 1.1 12.4 matrix[i+1, j+1] = max_score
40 152091 187046 1.2 13.5 traceback[i+1, j+1] = traceback_pos
41
42 1 2 2.0 0.0 return matrix[len(left), len(right)], traceback
</code></pre>
<p>We can see we spend a significant amount of time in lines that access the matrix, i.e. lines 26-28, 39.
Similarly, a significant amount of time is spent in the comparisons and other parts of the loop.
Optimisation of any of these lines will provide the most benefit. Let’s see if we can find a way to perform these optimisations.</p>
<h2 id="cythonising-the-code">Cythonising the code</h2>
<p><a href="http://cython.org/">Cython</a> can be considered to be a compiler for Python.
Generally, it is able to convert the Python code into C, which should provide a significant performance boost for simple statements such as element access in a matrix.
In order to start using it, we would need to install it first. <code>pip</code> would be able to handle this for us:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; pip install Cython</code></pre></div>
<p>Once the installation is complete, it is ridiculously easy to convert the code to Cython format:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; cp align.py align.pyx</code></pre></div>
<p>You got it right, we just copied the file to a file with different extension (<code>.pyx</code> is Cython file extension).
In order to be able to run it, however, we would need to compile this file. This is done by creating <code>setup.py</code>.
The following code provides the minimal skeleton for such file:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="lineno">1</span> <span class="c"># setup.py</span>
<span class="lineno">2</span> <span class="kn">from</span> <span class="nn">distutils.core</span> <span class="kn">import</span> <span class="n">setup</span>
<span class="lineno">3</span> <span class="kn">from</span> <span class="nn">distutils.extension</span> <span class="kn">import</span> <span class="n">Extension</span>
<span class="lineno">4</span> <span class="kn">from</span> <span class="nn">Cython.Distutils</span> <span class="kn">import</span> <span class="n">build_ext</span>
<span class="lineno">5</span>
<span class="lineno">6</span> <span class="n">setup</span><span class="p">(</span>
<span class="lineno">7</span> <span class="n">cmdclass</span> <span class="o">=</span> <span class="p">{</span><span class="s">&#39;build_ext&#39;</span><span class="p">:</span> <span class="n">build_ext</span><span class="p">},</span>
<span class="lineno">8</span> <span class="n">ext_modules</span> <span class="o">=</span> <span class="p">[</span><span class="n">Extension</span><span class="p">(</span><span class="s">&quot;align_cythonised&quot;</span><span class="p">,</span> <span class="p">[</span><span class="s">&quot;align.pyx&quot;</span><span class="p">])]</span>
<span class="lineno">9</span> <span class="p">)</span></code></pre></div>
<p>Line eight in this file can roughly be translated to say <em>compile file <code>align.pyx</code> into a library <code>align_cythonised</code></em>.</p>
<p>Now let’s quickly compile our code using this setup file:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; python setup.py build_ext --inplace</code></pre></div>
<p>The code should generate <code>align.c</code> and <code>align_cythonised.so</code> files in your local directory (do not forget the <code>--inplace</code> tag, as otherwise it would compile it into <code>build/</code> directory).</p>
<p>We could now use <a href="http://docs.python.org/2/library/timeit.html"><code>timeit</code></a> module to compare the runtimes of these functions.
It is convenient to use IPython (<a href="http://ipython.org">http://ipython.org/</a>, <code>pip install ipython</code>) for <code>timeit</code> tests due to it’s magic command <code>%timeit</code>. I use it here:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">In</span> <span class="p">[</span><span class="mi">1</span><span class="p">]:</span> <span class="kn">from</span> <span class="nn">sequences</span> <span class="kn">import</span> <span class="n">P53_HUMAN</span><span class="p">,</span> <span class="n">P53_MOUSE</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="kn">from</span> <span class="nn">align</span> <span class="kn">import</span> <span class="n">align</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">3</span><span class="p">]:</span> <span class="kn">from</span> <span class="nn">align_cythonised</span> <span class="kn">import</span> <span class="n">align</span> <span class="k">as</span> <span class="n">align_cythonised</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">4</span><span class="p">]:</span> <span class="o">%</span><span class="n">timeit</span> <span class="n">align</span><span class="p">(</span><span class="n">P53_HUMAN</span><span class="p">,</span> <span class="n">P53_MOUSE</span><span class="p">)</span>
<span class="mi">1</span> <span class="n">loops</span><span class="p">,</span> <span class="n">best</span> <span class="n">of</span> <span class="mi">3</span><span class="p">:</span> <span class="mi">258</span> <span class="n">ms</span> <span class="n">per</span> <span class="n">loop</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">5</span><span class="p">]:</span> <span class="o">%</span><span class="n">timeit</span> <span class="n">align_cythonised</span><span class="p">(</span><span class="n">P53_HUMAN</span><span class="p">,</span> <span class="n">P53_MOUSE</span><span class="p">)</span>
<span class="mi">1</span> <span class="n">loops</span><span class="p">,</span> <span class="n">best</span> <span class="n">of</span> <span class="mi">3</span><span class="p">:</span> <span class="mi">189</span> <span class="n">ms</span> <span class="n">per</span> <span class="n">loop</span></code></pre></div>
<p>We can see we got 69 ms improvement by just compiling the code without changing a line of code – not bad.
Note that PyPy runs this function in 168 ms – 21 milliseconds faster.
One could overstretch this result and say that Python runs faster than C in this scenario (hehe).</p>
<p>Let’s see if we can make Cythonised code beat this though.</p>
<h2 id="numpy-and-cython">Numpy and Cython</h2>
<p>Cython collaborates with <a href="http://www.numpy.org/"><code>numpy</code></a> folks a lot and is able to run <code>numpy</code>’s C interface natively. We could exploit this native interface to our advantage.
To do so, we will change our matrix to be a numpy n-dimensional array rather than a dictionary.</p>
<p>First of all, we need to install <code>numpy</code>, <code>pip</code> could be used to do this yet again:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; pip install numpy</code></pre></div>
<p>You might need a Fortran compiler to do so (you can obtain it from Homebrew on OSX <code>brew install gfortran</code>). If you are having trouble with this step consult the <a href="http://www.scipy.org/scipylib/download.html">downloads page</a> on how to get the precompiled library directly.</p>
<p>Once this is all set, copy <code>align.pyx</code> into <code>align_numpy.pyx</code> and change it slightly to use <code>numpy</code>.</p>
<p>The modified file is available from <a href="https://github.com/sauliusl/python-optimisation-example/blob/cython-final/alignment/align_numpy.pyx">GitHub</a>.</p>
<p>Git tag <code>cython-final</code> could be used to checkout this code:</p>
<div class="highlight"><pre><code class="language-bash" data-lang="bash">&gt; git checkout cython-final</code></pre></div>
<p>In the first couple of lines in the <code>align_numpy.pyx</code> we import both pythonic and c libraries for numpy:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span> <span class="c"># Import python numpy bindings</span>
<span class="n">cimport</span> <span class="n">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="c"># Import cpython numpy bindings</span></code></pre></div>
<p>Then, we define the matrix to be a <code>numpy</code> array (rather than a dictionary), and initialise it:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">cdef</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">float_t</span><span class="p">,</span> <span class="n">ndim</span><span class="o">=</span><span class="mi">2</span><span class="p">]</span> <span class="n">matrix</span>
<span class="n">matrix</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">empty</span><span class="p">((</span><span class="n">len_left</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">len_right</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span></code></pre></div>
<p>Then, we explicitly define that our iterators, i and j, are unsigned integers:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">cdef</span> <span class="n">unsigned</span> <span class="nb">int</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span></code></pre></div>
<p>This step is crucial, as numpy array accesses are optimised to native C speeds only for typed indexes.
We add a couple of other <code>cdef</code> hints, and are ready to go.</p>
<p>In order to compile this new file, <code>setup.py</code> needs to be changed a little bit as well,
the file is available in <a href="https://github.com/sauliusl/python-optimisation-example/blob/cython-final/alignment/setup.py">GitHub</a>.</p>
<p>All we did here was to add another <code>Extension</code> line to make the <code>setup.py</code> aware of this new file.</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">Extension</span><span class="p">(</span><span class="s">&quot;align_cythonised_numpy&quot;</span><span class="p">,</span> <span class="p">[</span><span class="s">&quot;align_numpy.pyx&quot;</span><span class="p">],</span>
<span class="n">include_dirs</span><span class="o">=</span><span class="p">[</span><span class="n">numpy</span><span class="o">.</span><span class="n">get_include</span><span class="p">()])]</span></code></pre></div>
<p>We also make sure to include <code>numpy</code> headers to the compilation, the <code>include_dirs</code> parameter takes care of this.</p>
<p>We can now check how fast this code is running by using <code>%timeit</code> magic function, after appropriately importing the numpy-cythonised function:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">In</span> <span class="p">[</span><span class="mi">7</span><span class="p">]:</span> <span class="o">%</span><span class="n">timeit</span> <span class="n">align_cythonised_numpy</span><span class="p">(</span><span class="n">P53_HUMAN</span><span class="p">,</span> <span class="n">P53_MOUSE</span><span class="p">)</span>
<span class="mi">10</span> <span class="n">loops</span><span class="p">,</span> <span class="n">best</span> <span class="n">of</span> <span class="mi">3</span><span class="p">:</span> <span class="mi">101</span> <span class="n">ms</span> <span class="n">per</span> <span class="n">loop</span></code></pre></div>
<p>As you can see, the total runtime is 101 milliseconds. This is 157ms faster than the original version of the code and all we did was to start using numpy and add a couple of type hints.</p>
<p>This Cythonic code can be further optimised by enabling the potentially dangerous options such as disabling bound checking.
Furthermore, if you want to take optimisation matters into your own hands, one could run <em>native</em> C code from Cython.
I refer the reader to Cython documentation and various tutorial videos from SciPy conferences for more information on these advanced topics.</p>
<h2 id="conclusions">Conclusions</h2>
<p>In conclusion, this guide provides a reference on some quick optimisations you could make for your code, without much significant effort from your side.
Were able to significantly improve runtimes of Python code by profiling where the bottlenecks are and doing simple, yet incremental changes to it.</p>
<p>A word of warning, though, these alternative interpreters/compilers for python should be considered only as a last resort. More often than not, there is a way to obtain sufficient performance boosts from simply changing the bits of the code, as we did in the beginning. However, if in any case this would prove insufficient for your needs, these optimisation tools exist already and will continue to improve over time.</p>
<p><a href="/articles/2014/02/12/how-to-make-python-faster-without-trying-that-much/">How to Make Python Faster Without Trying That Much</a> was originally published by Saulius Lukauskas at <a href="">Blog ~ Saulius Lukauskas</a> on February 12, 2014.</p>