Too simple to be wrong

I’ve been exercising my mad-programming skillz for a while on a variety of things. I got it in my head to port the benchmarks posted on julialang.org to perl a while ago, so I’ve been working on this in the background for a few weeks. I also plan, at some point, to rewrite them in q/kdb+, as I’ve been really wanting to spend more time with it.

The benchmarks aren’t hard to rewrite. No, thats not been the challenge. The challenge has been to leverage tools I’ve not used much before, like PDL.

It boils down to this. Tools like Python etc. get quite a bit of attention for big data and analytical usage, while other tools, say some nice DSLs, possibly more appropriate to the tasks, get less attention. I wanted an excuse to spend more time with the DSLs.

And I am curious about the speed of them, and the core language. Perl isn’t slow as a language. The code is compiled down to an internal representation before execution, so as long as we don’t do dumb things (including using Red Hat builds of it), it should be reasonably fast. But more to the point, DSLs can provide significant programmer simplicity and performance benefits, to say the least, when used correctly.

So I set about to doing the port, and completed the basic elements of it. I ran the tests in C, Fortran, Perl, Python, Julia, and Octave on my laptop. The problems are toy sized problems though, and can’t be used for real comparisons … which is to the detriment of the presentation of the benchmarks on the Julia site. Actually, I’d argue that a set of real world problems, showing coding/development complexity, performance, etc. would be far better (and actually be quite useful for promoting Julia usage). FWIW, I am a fan of Julia, though I do wish for static compilation, to simplify distribution of a runtime version of code (lower space footprint).

For the perl port, I used relevant internal functions where it was wise to do so. Why should we recode quicksort, when the sort function already does quicksort by default? Where there were no internal functions, I looked at options from CPAN to provide the basis for the algorithm. Given that Python leveraged numpy, I thought PDL made sense to use in similar cases.

But I always started out with the original in pure perl to make sure the algorithm was correct. I used Python 3.4.0, Perl 5.18.2, gcc/gfortran 4.7.2, Octave 3.6.x, Julia 0.3.0.x.

One simple example coded up was the Fibonacci series computation. Usually used as a test of recursion. Code is relatively trivial.

Execution time measured over m sets of N iterations. Timer resolution +/- 0.625 ms, N has to be large enough to get enough of a signal so measurement is much larger than the tick resolution.

lang

execution time (x 10-3s)

C

0.068

Fortran

0.074

Python

3.17

Julia

0.072

Perl

5.65

Interesting, but not surprising. What about a more computationally bound test, say sum reduction (as in the pi_sum test, which is quite similar to my Riemann zeta function test).

lang

execution time (x 10-3s)

C

48.7

Fortran

48.6

Python

684.4

Julia

46.9

Perl

83.6

So how can Perl be within a factor of 2 or so of the compiled languages? What horrible things did I have to do to the code?

A simple vector sum, repeated 500 times. Nothing complex here, the DSL is embedded in the language. The += in the sum line is to prevent the optimizer from making the inner loop go away, and be computed once.

Nice. PDL has some cool powers in this case.

I also used it on the random matrix multiply bit.

lang

execution time (x 10-3s)

C

228.3

Fortran

904.6

Python

220.5

Julia

209.6

Perl

238.5

Ok, whats surprising to me is the lower performance of the fortran code. It is quite consistent … so I am guessing that we are hitting on an aliasing issue that isn’t apparent with the other codes. This has been a problem with Fortran for a long while, and can cause sudden performance loss on things that should be fast. Given that we were using the matmul intrinsic, this should be nearly optimal in performance.

Basically I am noting that Perl appears, in these microbenchmarks, cherry picked for the moment, to be holding its own.

The only outlier appears to be in the rand_mat_stat for Perl, and I think I might have made a coding error in it. Still looking it over (this is mostly for PDL exploration, and I am still trying to get my head around PDL.

But here’s where things go pear shaped. The Mandel code snippet. Basically its to compute the Mandelbrot set from (-2+-1i) to (0.5+1i). We know what it should look like.

I decided to try to get cute with this. Always a mistake … learn from my errors oh young programmer, and do not hyperoptimize before you get it correct.

So here I have this awesome DSL, PDL. Instead of computing this an element at a time, why not compute the entire region at once? So I define a complex PDL, get the range/domains correctly. Iteration is trivial. The hard part is computing whether or not a particular point has escaped, as we wish to do this an image at a time. Turns out there are conditional functions that apply to the entire image, and generate “masks” or indices for indexing. This allows you to do things like $n->indexND($mask), and only increment the pointers that should be incremented (for the abs($z) <= 2).

Very cool stuff.

Computation is fast this way.

But its also wrong.

So I go back to doing a row at a time using PDL, and a simpler version of this.

Still wrong.

Gaak.

Do element at a time. Print out the counts, so I can see the image.

Yes, I do see the set, and it looks correct, quite like the images of the other sets I’ve seen.

Ok, we are onto something.

Now look at the benchmark provided code.

…

results look … well … wrong.

So I played with the version in Octave, and got the same results, as the Python version. Which look wrong.

But the code is simply too simple to be wrong. Unless something else is going on. So I am delving into what this difference could be. Comparing the Octave and perl implementations. Making sure the code returns the same results for random sets of points, and then figuring out why or why not.

Its too simple to be wrong, but … looking at the output, it definitely is. The question is which of these are wrong.