Monday, September 2, 2013

Matrix Calculations with Clatrix and Core.Matrix

Recently, I finished the port of the eleventh lesson of CFD Python to Clojure. While I was able to establish a numerically correct solution, the runtime performance turned out to be extremly poor. In order to track down to the root causes, I opened a side project for analysing the runtime characteristics of the Clatrix and Core.Matrix libraries.

Test Setup

The problem in question is an advanced sized matrix application that involves an iteration over 16 submatrices, of which 6 are composed ones. The problem also involves preprocessing over 12 submatrices (of which 4 are composed) and additional sub-iteration for each main iteration step on 4 submatrices. The matrix size for this classroom example is 41 x 41, where the submatrices are of size 39 x 39. The main iteration takes 700 steps with a sub-iteration of 50 steps each. For regression test purposes, I also calculated three reduced models. The runtime of the Python code for caculating these models, including graph generation and IO is below 30 secs on my laptop. The Python code leverages on the use of NumPyand SciPy.

My implementation makes a naive use of the Incanter 1.5.2 interface of matrix operation which has the Clatrix library under the hood. As the runtime of the Clojure implementation counts literally in hours, I started to dig into the code in search for coding mistakes and library issues. It turned out quickly that I needed to isolate the building blocks of the overall CFD modell. I therefore forked a side project in order to separate the calculation steps and start a rigorous measurement of them by using Criterium. I pushed the test setup to an Amazon 64bit off-the-shelf Ubuntu 13.04 server for a proper baseline.

The software refers to following releases:

[org.clojure/clojure "1.5.1"]

[incanter/incanter-core "1.5.2"]

[clatrix "0.3.0"]

[org.jblas/jblas "1.2.3"]

[net.mikera/core.matrix "0.7.2"]

[slingshot "0.10.3"]

[criterium "0.4.1"]

[org.clojure/tools.macro "0.1.2"]

[midje "1.5.1"]

[lein-midje "3.1.0"]

In order to avoid own coding mistakes, I set up the test routines to be as minimal as possible. I refered to the work of C. Grand and Lau Jensen for benchmarking set/get operations on matrix elements (LJ did elementwise calculations in his CFD project instead of applying a matrix calculus).

I also double checked by separate functional tests the absence of logical mistakes. The tests operate on a reduced matrix size of 100 x 100 for runtime reasons where I wanted to have 1,000 x 1,000.

Test Summary

It is worth to mention that the Clatrix library triggers 22 reflection warnings on an empty project. The Core.Matrix library is however clean with respect to that. I also have no warnings in my own test code.

Accessing Matrix Elements

The optimized access of the 10,000 elements of the test matrix typed as a Clojure Array takes close to 1 ms for get and 0.5 ms for set. I introduced a direct comparision of a function vs. macro implementation that reveals an order of magnitude of 10 in favor for the macro version.

The Clatrix accessor functions take 3.6 ms for the same operations. These operation involve copying. A read-only get and overwrite-set! is not provided.

The Core.Matrix implemention is prohibitively slow with 211ms/63sec for get/set. While the library also defines an overwrite-set! operation, the implementation throws an exception.

Building Submatrices

The Clatrix library offers a generic function 'from-indices' which can be used to select submatrices. The implementation is however suspicioucely flawed [https://github.com/tel/clatrix/blob/master/src/clatrix/core.clj#L414]. It takes on average 14.6 sec to select twice a 98 x 98 submatrix from a 100 x 100 matrix.

I missed to measure the Core.Matrix submatrix function separately, but see below.

Matrix +/-/*

While addition and substraction refers to an elementwise operation, the matrix multiplication needed for the CFD model is based on the inner product which is significantly more expensive.

The Clatrix library functions take around 7 sec for the +/-/* operations on one 98 x 98 (out of 100 x 100) submatrix on itself. The Core.Matrix functions operate approximately in the same performance range on a Clatrix matrix. On a Core.Matrix matrix, the functions add/sub are fast with 493.2ms/59.8ms, but mul breaks down completely with an elapsed time of ~1h.

It should also be noted that Core.Matrix/mul is documented to be elementwise multiplication, while mmul implements the inner product. However, mul implements the inner product and mmul throws a compiler exception.

Final Remarks

I put reasonable effort in avoiding own coding mistakes but accept that there might be some. I would be interested to receive feedback on this and also with respect to enhance the test suite for further regression tests on the core libraries. The code is on Github.