Saturday, May 23, 2009

During development he has been disappointed with the performance of IronPython, in fact he discovered that some of his code was running massively slower than the same code on either CPython (the native implementation of Python) or Jython (the implementation for the Java Virtual Machine).

He posted a series of blog entries on this where he dug deeper into the problem and eventually, with the help of the IronPython team he found a solution. I've listed the blog entries in the order they were posted; what follows is a story of performance, impressive charts and an explanation:

Significant claims have been made about the performance of IronPython, notably back at its inception in 2004 when Jim Hugunin, creator of both IronPython and its cousin Jython, presented a paper on IronPython performance at PyCon. Since then, there have been numerous claims to IronPython’s supremacy over CPython in the performance stakes. The IronPython Performance Report reiterates that IronPython can turn in a good performance. According to Hugunin the standard line we’ll see is,

“IronPython is fast – up to 1.8x faster than CPython on the standard pystone benchmark.

But do these claims stand up in the face of real-world Python code?

The claims of good performance are based on synthetic micro-benchmarks which don’t accurately reflect balance of coding techniques found in complete programs.

At this point I’d like to offer my own quote:

“IronPython can be slow – 10x to 100x slower than CPython on real-world code and it has been observed to be up to 6000x slower.

The blog entry features some of the aforementioned pretty charts illustrating the performance of the Owl Basic code on Jython, Python and IronPython. The IronPython code is running orders of magnitude slower. Jython fares remarkably well however.

Naturally this doesn't match my experience of working with IronPython, but the story continues...

In this entry Robert has found an isolated example that seems to demonstrate the performance problem; a simple test program which builds a simple binary tree of objects.

The charts show that for small depths CPython outperforms Jython, but for deeper object graphs Jython scales better than CPython. IronPython seems to perform astonishingly badly from start to finish!

In the comments Curt Hagenlocher (IronPython core developer) notes that changing the counter from updating a class variable to a global variable makes a slight difference to the performance of IronPython.

The code Robert Smallshire was using to benchmark was updating a class variable as a counter. This is perfectly valid Python code, even idiomatic code. Unfortunately it triggered a code path in the IronPython implementation that could be described at best as "unfortunate". More on that shortly.

When Robert updated his code following Curt's suggestion, the results were interesting:

This made a phenomenal difference with IronPython completing in only 12% of the time taken by CPython – over 8× faster with a binary tree depth of 20.

Dino Viehland (IronPython core developer) followed up with a blog post explaining the problem, and how they were fixing it in IronPython. The fix is now in place and will be in IronPython 2.6.

Ahh, performance… It’s so much fun! It’s easy to measure and as you work on it you can see immediate benefits (or not!). I personally enjoy getting to spend time looking at interesting performance issues and I’ll use this as a time to plug IronPython 2.6 Beta 1 where we’ve continued our focus in this area. Even more fun yet is when you compare and contrast different implementations - it can be both surprising and amusing. Recently Robert Smallshire has been doing just that. Obviously Robert has found a case in IronPython where something is performing horribly. I’d like to take a look at what’s happening here and what we can do to improve this case.

Ok, but let’s get onto the fun technical details! It all comes down to one line of code “Node.counter += 1”. That’s a pretty innocent looking piece of code, how can it cause so many problems? First you need to understand a little bit about how IronPython works. IronPython internally keeps versions of type objects and uses this information to generate optimal code paths for accessing those objects. So the next line of code in this example, “self._children = children” usually turns into something that’s like:

What that one line of code causes then is for the version of the type to change and for this “rule” to be invalidated. So we’ll figure out what to do, generate a new cached rule, only to have it not work the next time around. The same thing applies to the calls to Node – we re-generate the “optimal” code again and again only to discover we have to throw it away.

...

So you might be curious - what’s the fix? Well in this case we’re just mutating the type by changing a simple value stored in the type (versus say adding a property or a method to the type). Today we get a very slight throughput performance benefit because we can burn the value directly into the generated code instead of fetching it from the wrapper the type object keeps it in. With the fix we’ll burn the wrapper in and fetch the value from the wrapper each time. Then when we go and modify the type we just modify the wrapper instead of the type and its version is unchanged! This fix is now checked into the IronPython source tree and you can download the fixed sources right now.

In conclusion I’d like to wrap up a little bit about our performance philosophy – performance is hard and deciding which numbers matter to you is important. We have tried to tune IronPython for apps which run for more than a fraction of a second. We have also tuned IronPython for code which is not constantly mutating types at runtime. And we’re always open to feedback when users encounter problems with IronPython and we’ll listen to what scenarios are important for them. But if you look at these graphs I think you can see that we’ve done a good job at making IronPython a fast implementation of Python.

In short, if you find code where IronPython performs dramatically worse than CPython - it's a bug! Report it on the IronPython mailing list; Dino, Curt and team are very responsive.