Wednesday, 24 August 2011

One feature of Haskell that I particularly like is its pattern-matching scheme. To route code down the correct codepath, Haskell uses pattern matching against data types and values, instead of extensive flow-control statements. This makes the code much simpler to read, as your code merely concerns itself with calling the correct functions, with the correct handler cases, and the underlying Haskell compiler deals with the flow-control minutiae.

I've increasingly found myself using a similar approach in C++ and HLSL using features such as polymorphic functions. A good example comes from optimisation: frequently, you find cases in the code where someone has written a "kitchen sink" function. This function internally handles many cases, and accordingly takes many parameters.

Frequently, many of these parameters are called with specificliteral values or constants. For example, many parameters may be zero which disable entire codepaths inside the function. For me, this is a bad code smell. But why? Many people would argue that since the code is branched away, what's the big deal? Especially if it means we can have just one definition of a function. Well, branches are very expensive on modern processors. On contemporary consoles, a floating point branch can yield a 27-cycle penalty. Not to mention all the preamble and branch set-up code.

I've therefore found myself increasingly providing multiple specialised versions of these functions, which omit some of the parameters that are permanently called with 0.0f or 1.0f and tailoring the code accordingly. Obviously this saves a lot of needless computation, branching and I$ misses.

What's interesting about this approach is that once you've performed a few of these refactorings, you start to expose patterns in the code and its structure that you previously were unaware of. You may find that your major axis of variation lies not in your chosen object hierarchy composition, but elsewhere. You may have a multitude of objects that add little value at a higher level, but now a large set of lower level functions that process the objects differently. Sometimes it makes more sense to structure and divide your code along the lines of what processing it does, not what you think the object "is", as that is the greater source of variation.

Tuesday, 23 August 2011

I've been spending a little time thinking about caching lately. Irradiance caching, optimising some caching in production code, and hardware caches. Here's a little brain-dump.

I've come to the conclusion that if you ever need anything more than a trivially small cache, your algorithm is wrong. This is particularly so for parallel programming, as caches are often a difficult to handle piece of interdependent common state.

My reasoning goes a little something like this. If you are relying on a cache for performance, it is usually to re-use the results of some expensive transformation a -> b. Crucially, there needs to be 1 a to one or more b (otherwise there is no cache reuse!). For an efficient cache, you need:

To be caching the results of a fairly complex computation,

To have an efficient means of storing the results of the computation,

To have a well-behaved cache indexing function.

Now, what often ends up happening is that you iterate over the many objects 'b', and populate the cache on-demand from a common ancestor piece of data, a. And here's the first problem. You are primarily iterating over the wrong dataset, and filtering and culling parts of the other dataset by proxy. You have an implicit data filtering or reduction operation that could be better handled more explicitly.

Ok, so, we switch our iteration round to iterate over the source objects 'a'. We iterate over each 'a' object, and produce the resulting 1 or more 'b' objects. At this point, do we really have a cache? We've turned our cache into a data transformation kernel. All of sudden our cache is no longer some intermediate state used in a computation, but it is now simply a piece of input data. Since we now have a piece of code transforming [a] -> [b], we have a piece of code that is very amenable to data parallelism. Since we've decoupled the code from a cache, it's also much easier to parallelise.

I think ultimately an extensive caching scheme is evidence that the data domain hasn't been adequately structured and managed. If you need a cache, you probably need more data conditioning, filtering, and reduction.

A good counter-example is the vertex stream processing unit of a modern GPU. These units read huge streams of data, transform them, and maintain minimal cached state. Vertex caches are typically miniscule in comparison to the volume of data they process. Yet a post-transform cache can often maintain a very high level of efficiency. The key to this cache's success is the pre-processing (ie structuring) of the data fed into it. It is not random. It is well conditioned, which yields efficiency at a number of levels, and enables a vertex cache to be used only as a relatively modest addition to the pipeline. The cache is the icing on the cake, not a fundamental component of the algorithm.

I think this is an important lesson when transitioning from writing single-threaded to parallel code. Don't dream up increasingly elaborate caching schemes. Instead think up separable, well-structured data-parallel transformations. When you feel the need for a cache, you probably need to carefully re-examine your data flow structure.