Main menu

Post navigation

Real-time graphics code is one of those areas of programming where you can never get enough performance. One such scenario was in an OpenGL application I’d been working on recently, where there was a lot of matrix multiplications going on in the inner core loop of the application.
I’d been using the excellent OpenTK library for the OpenGL support, which has some nice helper libraries that help with some of the common grunt work associated with developing these kinds of applications. One of these utility classes is a 4×4 matrix which includes all the standard methods you’d expect to find in such a class including one to multiply two matrices together. If you’re not familiar with matrix maths then don’t worry, all you need to know is that it is a quite computationally expensive operation involving a lot of floating point operations – 64 multiplications and 48 additions to be exact.

There are two variants of the method provided by the library; one as you’d normally expect taking in two Matrix4 values and returning a new Matrix4 result, and the second which takes two ref Parameters and returns the result as a third out parameter:

The latter looks somewhat unusual but it’s quite useful since the Matrix4 type is a value type and is by default copied each time it is used. The ref and out keywords instruct the compiler that the memory for the parameters and result are already allocated by the caller so no copies need to be made.
Internally the first method simply calls the second anyway, so the former is just a more natural form since that’s the standard way parameters and return values are passed.

In my code I was already using the latter, however a quick test of the performance of these two running one million iterations came out at:

Call

Duration

OpenTK.Matrix4.Mult(left, right)

1,700ms

OpenTK.Matrix4.Mult(ref left, ref right, out result)

1,380ms

It’s a little artificial to compare these two variants since the former calls the latter and can therefore only ever be slower, but at least we get a base against which performance improvements can be measured.
Taking a peek inside the latter revealed what it was up to internally:

Granted, there’s an awful lot going on in there, but the routine itself is pretty simple; just multiply a whole lot of numbers together, sum the results and store in an output structure. At a first glance there doesn’t seem to be much to optimise here, but looks can be deceiving.

My first impression was that the values from both matrices are being queried quite often (128 times), where as between the two matrices there are only a total of 32 unique values so each accessor is being invoked 4 times more than it needs to be. By pulling back all these values first then running the same calculations on the local values, we get the following version:

As you can see, its internal data is stored in four Vector4 values (Row0 through Row3), and there are helper accessor functions for column access (Column0 through Column3) and individual component access (M11 through M44). This means that every time the M11 getter is called for example, the runtime has to make a call to the structure which goes off to the Row0 value and pulls out it’s .X value. It may not seem like much, but let’s try going directly to the exposed data fields rather than via the properties:

Another significant boost in performance by simply using a more efficient way of getting at the data.

Up until now we’ve only been looking at the reading of data, however there is also the corresponding side of it where the result is written to the out variable; perhaps there are some performance wins there?
Since the reading code was vastly improved by going directly to the exposed data fields, lets try the same thing for the writing. This version writes into the Rows directly rather than creating a whole new Matrix structure:

A more modest speed improvement, but still an improvement.
The writing code is still having to construct new Vector4‘s to push into the result, so instead let’s try just writing directly into the structures that are already there:

This time the results are even more striking, and the net result is an over 7 times speed improvement for doing nothing more than changing the way data is read and written – exactly the same calculations are being performed and exactly the same results are returned.

Another way of looking at this which will probably appeal more to graphics coders, is if the original code was running at 30 FPS, then if matrix multiplications were the only bottleneck this new version would be running at a little over 260 FPS! In reality though this will probably only add up to a reasonably small amount of the overall performance, but it all helps.

I’m actually really surprised that the runtime can’t do these kind of inlining and parameter passing optimizations automatically. Especially the first change where you move the matrix elements into local variables – I don’t see what is preventing the JIT from recognising that the property is a simple accessor that will return the same value every time and caching the result?

Also, I’ve never liked the necessity of passing value types by ref. Surely the runtime could have some kind of ‘pass large value types by const ref’ optimization that’s automatic and invisible? The compiler could detect that a value type parameter is read-only in a method and mark the signature appropriately. And C++ compilers have been doing return value optimizations for a long time.

A JIT has more information available to do optimization with than a static compiler – but there isn’t even any control flow or polymorphism to worry about in this example so your changes are simple static optimizations that I hope native compilers would be doing off the bat.

Well done for getting such a massive speedup. But it just shows how disappointing it is that these kinds of micro-optimizations have to be done by hand.