Monday, 11 April 2011

I came across an interesting design failure today. In short, a C++ object was internally storing the state specific to a piece of multi-threaded client code. The client code was leaking its state data into the C++ object. In generalised, abstracted terms:

Each of the threads would read from 'Object', process it, and store the results in one of the elements of m_state.

It's a bad pattern, but I daresay a common pattern.

What is wrong here?

The first thing to notice is the series of member variables, all arrays, sized by MaxContexts. A series of variables like this is a dead giveaway that really there is semantically some concept in the system that is not explicitly represented in the code. Given that we are trying to represent an object in a series of "contexts", with different values for each, it suggests that these variables are not properties of the object, but actually variables of some other, auxiliary state. If there is logically one of something, it's a member variable of the class in question. If an object embeds a number of them, it's probably someone else's variable.

Fundamentally, my problem with this is that we're embedding some client's data within the object itself. When the client code operates on the object, it is no longer a pure transformation of one object into another, it is now a conceptually more complex, confusing transformation of one object, to the same object with new state.

This additional state presents serious problems to multi-threading a piece of code. When foreign state is embedded unnecesarily in an object like this it makes it much more difficult to manage the ownership, lifetime and correctness of data. All the other clients of the Object are burdened with helping manage the state of some other system. Appropriate ownership, coupling, and lifetime of data is all-important in parallel programming.

Our problem is that we have inverted the "has-a" relationship. The object does not "have a" state in a number of contexts. In reality, the contexts have the state and they "have", or refer to the object.

Haskell Function Signature Analysis

I increasingly find that Haskell's function signature notation is a lucid yet lightweight means of analysing data flow. There are many other alternative, successful methods, yet I find the Haskell analysis direct and natural. These days I find myself thinking directly in these terms to understand data transformations.

In Haskell terms, we have:

f :: Object -> Object

Or possible even:

f :: Object -> (Object, State)

We're not sure! We would prefer:

f :: Object -> State

Usage of State

The next question is, how is this state really used? It is not really part of an object's update function, as the multiple contexts suggest the data is not part of the object. So, what is the client code of the object doing with the state? It doesn't really matter. Whatever it's doing, we can infer that:
a) The state is a product of the object. If it were not, it would not be in the object.
b) We have multiple contexts, implying multiple passes over the data.

When a piece of state is the product of an object, it is likely that it is intermediate data. Given that we expect multiple passes over the data, we're going to be caching a lot of pieces of state data that have nothing to do with the context in question. We could switch the state representation from SoA to AoS, but it's still data in the wrong place.

Let's say we have 10 'contexts' and 100 objects. Let's say the object is 20 bytes, and the 'state' is 6 byte. We have a total data volume of 100 * (20 + 6 * 10) bytes, of which 100 * (20 + 6) bytes is used. Out of the total data volume, we're therefore using only 33% of the total data volume.

So, if we switch from the:

[Object] -> [Object]

implementation to a

[Object] -> [State]

implementation, we now have 100 * 20 + 100 * 6 bytes total data volume and 100% of the data is used.

Where is the data going?

The previous transformation is essentially a map operation. We're applying a function a list of Objects, and returning a list of State objects. A state is not an endpoint of a data processing; it implies that further processing is going to use the data.

Here's our first big clue. All that data that originally lived inside the object, most likely on the heap, is really just intermediate data polluting the cache. It's just input to a succeeding function. We can most likely eliminate that "state" data altogether and pass the output directly through to the succeeding function. Or, we can store the state in some piece of intermediate thread-local storage. Either way, we've now got a clearer understanding of the real data flow in the program.

Eliminating the containing Object

The next question becomes, which bits of the Object really contribute to the new output State?

As the State is the product of an Object, and Object is non-trivial it is unlikely that the State is a product of the entire Object. It is probably some subset of member variables.

To use something of a straw man example, suppose the code in question is a visibility operation. Suppose that Object holds a bounding volume, some geometry, and the State holds various visibility test results, distances, LODs and the likely. Our transformation is therefore more likely to be:

BoundingVolume -> LodDescription -> State

Most of the Object's data is not used here. We've passed a subset of the Object's data to some function. This is a naive pattern in C++ code. Useful, general, common patterns of code are often hidden behind a facade of encapsulation. Unrelated pieces of data (in a given context) are brought in unnecessarily. When we develop a semi-abstract purely functional description of the operation, we can build a much clearer understanding of the data flow and dependencies.

Now imagine rewriting and structuring this operation in C++ based on this understanding of the data. Arguably an experienced programmer can arrive directly at the same conclusion, but I find the Haskell notation a useful intermediate tool for reasoning about data flows.

Thursday, 7 April 2011

Recently, I've been experimenting with different ray-triangle intersection algorithms. There are many alternatives out there, and many more optimisations and special cases, but I was looking for an version that ran quickly, and "suited" functional programming.

I settled on the half-planes solution. This algorithm defines a plane at each edge of the triangle by crossing the triangle's normal and the edge direction. To check if a triangle is inside the triangle, you simply check to see if a point (on the plane of the triangle) is on the same side of all of those planes.

This costs a little extra storage space, but it's a very fast test.

So. How does the code pan out? Well, pointInsideTriangle is simple enough:

This is best read right-to-left. The map applies a lambda function to each of the half planes. This function simply returns a bool, telling us if a point is on the correct side of a triangle. This yields a list of bools. We need to turn that into a single result; this is done using a fold to reduce many values to a single Bool. I've deliberately used a right-fold so that the fold lazily consumes only the necessary parts of the list. It can early-out.

So there we have it. One line of Haskell (with a couple of helper functions) to do the test.

The rest of the Haskell code for a ray-triangle intersection is pretty straightforward:

Here we have a simple guard condition to permit backface culling. If the triangle is not double-sided, we do a backface culling test and possibly reject it. If not, we fall through to the default case and intersect the ray against the plane, and then test that intersection point for containment in the triangle.

Intersecting against a list is an interesting case. A simple version would just map against the list, and then attempt to find the closest intersection out of the resulting list. However, if we use tail recursion, we can add a parameter to hold our current "state" and therefore maintain a current-closest intersection. This eliminates a second search-the-list step, and also permits an early-reject optimisation:

The pattern of "searching a list of x and return the closest" occurs frequently in raytracing. This is a potential candidate to be factored out into a re-usable function, or possibly even a monad.

The resulting code is also quite efficient. I'm now running at less than a minute for my test scene of the Cornell Box with 8x8 distributed raytracing. This is 30% quicker than my previously Möller-Trumbore ray-triangle intersection test.

Sunday, 3 April 2011

One of the big attractions of functional languages is that they're theoretically very easily to parallelise as they have no problematic side-effects You only describe how to perform a calculation, and never touch on the details of moving data around the machine. The run-time is then free to process data, given your functional expressions, in parallel.

I've been experimenting with Haskell's basic parallelism primitives, par, and pseq. These primitives trigger ("spark") evaluation of expressions in parallel - but only if the runtime deems it necessary.

After many failed or mildly successful experiments, I've finally managed to get a speedup. In short, I stopped trying to reinvent the wheel myself. I simply used parListChunk:

Here, we break the list of rays to be traced into data parallel chunks of 256, each free to be evaluated by a different thread. Each of these chunks is processed using map. Map applies the raytracing function to convert a ray direction into a colour, and we clamp the result.

And that's it.

Running it on two threads on my dual-core MacBook Pro halves the execution time.

What's nice is that the core 'map' expression is separated from the details of parallelisation. We've separated the concepts of raytracing and its parallel decomposition into two separate concerns. We are therefore free to vary either without intertwining their concerns.

Not a critical section, lock-free queue or atomic operation in sight!

Clearly, the nature of the functional approach surrenders a lot of low-level implementation details to your platform's runtime. However, it is more frequently the case that when you require such low-level intervention, you are working around deficiencies in your run-time platform, rather than dealing with fundamental algorithm details. Parallelising many algorithms follows established patterns.

It is also worth noting that the raytracer code is currently pretty straightforward code. I've not implemented any particular tricky optimisations, and certainly none of the ideas in work such as Ingo Wald's parallel raytracing PhD thesis. Not yet, anyway...