Optimising CHP

Recently I have been optimising the core algorithms in CHP. Barrier synchronisations and channel communications in CHP use the same “event” algorithm that supports all of CHP’s features, including choice between events and conjunctions of events. Conjunction really makes the choice algorithms difficult. Here’s a quick example to illustrate. Some terminology: each process wanting to synchronise is “offering” a set of offers — each offer-set is a choice (disjunction) of offers, only one of which should be completed. Each offer is a conjunction of a several events (if you’re not using conjunction, this will be a single event). Here is a test (written in a testing EDSL) from CHP:

So each of a–e is a two-party event, and there five processes making offers (each list is an offer-set, each list item is a different offer, each offer is a single event or conjunction of them). What should be the result? The event b can complete if p’s single offer and q’s single offer are taken, but for that to work, we’ll also need to be able to complete c, d, and e, in that case. If we look at process r, c can complete if a does (so we’ll need process t), but then d can’t complete as well, because only one of r’s offers may complete.

So the answer is that those offer-sets can’t complete yet, and all processes will have to wait. We want to be able to figure that out quickly, especially in the most common-case where each process is offering just one offer, with just one event (i.e. they want to perform a single channel communication), and the other common case where a process is offering two or three single-event offers. So we want speed — but we also need correctness. Having a bug in such a core algorithm would make CHP very unstable. So far it seems that CHP has been correct in its algorithm, which is why I have been hesistant to optimise it before now. Ideally I would like to prove the algorithm correct but, for now, tests will have to do.

I knew from a recent exchange with a user of the library that CHP was horribly slow when offering choices between appreciable numbers of offers. Ten offers by one process was slow, and twenty would virtually halt the program, which wasn’t right at all. So I set up a simple set of micro-benchmarks; a single channel communication, and others with one “stressed” process offering to communicate with 2, 5, or 10 other processes (who were all eagerly waiting to communicate to the stressed process, and nothing else). None of these things should be slow. I developed my recently-released Progression library to help me see what difference the changes would make.

The Optimisation

The old algorithm would look through all the connected offer-sets available (reading them from all the STM TVars) and then see if it could find a completion. I noticed that the offer-sets were being searched in an arbitrary order, rather than by the most likely to complete; a judicious sort at the right point provided surprising speed-up:

On the one hand, I’m pleased to have made such a gain so easily, and on the other I’m embarrassed that such a gain was so readily achievable, and hadn’t been done before. But I wasn’t done yet. The algorithm was needlessly expensive, and this resulted from the way I had decomposed the problem. I had one function that read in the value of all the offer-sets (in the STM monad), one that trimmed these offer-sets to iteratively remove offers that couldn’t complete (a pure function), one that searched for remaining possible completions (also pure) and one that wrote the results to the appropriate places (again, in the STM monad).

This approach previously seemed to break the problem down nicely (and separated the effectful parts from the non-effectful parts), but it’s horribly slow if you take a step back. All offer-sets are read in, even if the first is ready to complete. So you may touch far more TVars than you needed to, then spend time pruning the offers when the first was obviously ready to complete. So I rewrote the algorithm; I now have two (mutually recursive) functions in the STM monad that combine the reading in of offer-sets and the searching (without explicit iterative pruning). This yielded great benefits over the previous algorithm for larger choices; here is the difference between the three versions so far on a logarithmic scale:

(As a side note: what horrific time-complexity must my previous algorithm have had to display this curve on a log-scale?!) To my surprise, the details of the algorithm became shorter after I rewrote it to be faster. This was perhaps because I created a new monad to write my algorithm in (a caching, backtracking search monad) which pulled out some of the common logic. With a few extra tweaks, I was able to reduce the speed even further. Here is the final optimisation I settled on, compare to that initial rewrite (back on a linear scale):

See how the performance is now roughly the same for each simpleChoice benchmark, no matter how many offers are being chosen between (because typically in this benchmark, the first one is always ready). I remain a little uneasy about releasing these changes. I have tested the algorithm a lot, but it’s still a rewrite of the core of CHP. I’ll release this soon as CHP 2.0.1 (there is no change in the API, hence the minor version increment). It should be generally faster, but please let me know if you have any odd behaviour that may indicate a bug. (I also plan to do some more optimisation in the future, but this batch will do for now.)

Optimisation Techniques

I don’t want to go into all the gory details of what I did to optimise the library, as that would involve explaining the full algorithm — which be a much longer post than I have time for. The changes that brought the most gain were algorithmic. Inserting the sort to change the search order made a big difference, and rewriting the algorithm to exit the search as soon as possible (and touching as few TVars as possible) made most of the remaining difference. After that, I was making changes such as adding a little caching of certain calculations to make sure that events and offers were never unnecessarily processed twice (by using a StateT monad on top of STM). My monad is a backtracking monad (well, sort of — I use an Applicative Alternative instance to explicitly denote searching of several possible paths, with something a lot like a MaybeT monad) but the caching is preserved across the backtracking as you would hope. Another change I made (but may revisit) was switching from using Map and Set from the standard libraries to using ordered lists for the same purpose. In my common cases these maps and sets will be 1-3 items, which is why I was able to see some speed-up from a simpler structure. I originally put these ordered lists in a newtype, but it turned out that removing the newtype wrapper also gained me some speed-up.

It has not escaped my notice that supporting conjunction in CHP’s events algorithm is really an instance of the “other” CSP: a Constraint Satisfaction Problem. But it does have some interesting performance characteristics (e.g. reading from less events can help a lot) and some useful aspects that I can optimise on (e.g. if one process was offering event “e” anywhere in its offers, and you choose one without “e”, “e” can no longer complete; this is not as obvious as it sounds), which is why I have my own algorithm for resolving the choices.