Tag Archives: memory layout

I have been trying to debug why some MiniSat-based solvers perform better at unit propagation than CryptoMiniSat. It took me exactly 3 full days to find out and I’d like to share this tidbit of information with everyone, as I think it might be of interest and I hardly believe many understand it.

The mystery of the faster unit propagation was that I tried my best to make my solver behave exactly as MiniSat to debug the issue, but even though it was behaving almost identically, it was still slower. It made no sense and I had to dig deeper. You have to understand that both solvers use their own memory managers. In fact, CryptoMiniSat had a memory manager before MiniSat. Memory managers are used so that the clauses are put close to one another so there is a chance that they are in the same memory page, or even better, close enough for them not to waste memory in the cache. This means that a contiguous memory space is reserved where the clauses are placed.

Armin Biere gave a lecture at the Pragmatics of SAT workshop (proceeedings here) in Vienna about all the things inside lingeling which won a lot of awards[PDF] this year. If you weren’t there, you missed an amazing presentation. In this blog post I’ll reflect on a particular part of the presentation dealing with a memory trickery that has been intriguing me for a long while but I did not implement. Before I begin let me say: the presentation was awesome, and it’s not by chance that lingeling won so many awards.

The idea

The idea used by lingeling I want to talk about is easy to explain (though not easy to invent, as is usually the case). If you look at typical CNF problems, the majority of the clauses will be binary, i.e. only contain 2 literals. These clauses used to be stored exactly the same way as normal clauses: in the heap we allocate 2 literals and we put a pointer into the watchlist to these literals.

An improvement over this idea are the so-called implicit clauses. For the binary clause “x V y” we put into the watchlist of x the literal y, and in the watchlist of y the literal x. There is no other place we store these binary clauses, hence the word “implicit”. For other clauses, we still put pointers into the watchlist and allocate space, as usual. The problem with this approach is that the pointer to a clause is 32b (we use an 32b offset on 64b machines) but for each clause we also store a so-called blocking literal in the watchlist, which is also 32b. That makes the entries in the watchlist 64b long for normal clauses, and 32b long for binary clauses.

The idea is to have differing sizes of elements in a watchlist. If e.g. the first bit of the element is a 1, the next 63b relate to a long clause, and if it’s a 0, then the next 31b relate to a binary clause. In case 80% of the clauses are binary, this saves 50% of the space in 80% of the cases. Not bad at all.

The advantages

The advantages of using this idea are twofold. First, as already mentioned, memory use is lowered. This is non-trivial as memory usage of the watchlists can be enormous and although many other improvements can be switched off (such as e.g. stamping), storage of the clauses can never be switched off. Secondly, not having holes in memory leads to much better cache usage which in turn can bring real speedup. In case you think this is not important, you might enjoy knowing that the HHVM module of Facebook was made over 2x faster by making sure that important cache lines are not knocked out[PDF].

The disadvantages

In case you have an array that has varying size elements in it, some non-trivial complications arise. Let me list just a few that come to my mind.

Sorting the list is no longer trivial You cannot just swap elements: they might not fit. One way to do sorting efficiently is to move all the data into another, equally-spaced container and sort there, then move it back in. However, keep in mind that the reason why quicksort is so fast is that it can do in-place sort. Merge-sorting would be another option, but it copies elements and it’s not by chance it’s not the default sort in most cases. Also, you would have to re-write merge-sort of course, to deal with the varying-sized elements.

In case you think that sorting is not needed, maybe you forgot to consider the lightning-quick subsumption you can do between implicit binary&tertiary clauses using sort to give just one example.

Removing an element is no longer an O(1) operation In case you need to remove element X in a watchlist, you can simply swap the last element to the position of X and make the array one smaller. This trick is used quite extensively since the order of the watchlist is usually irrelevant.

Loops need to be re-written All your loops that go through the watchlist need to be re-written and have to have a switch() in them with some pointer arithmetic (do we advance by 32b or 64b?). In case you think you don’t need to go through the watchlist too often anyway, think again. Any time you need to do anything with clauses you will have to go through the watchlists, since they are the only place where binary (and tertiary, if your system is optimised enough) clauses are stored. This means your watchlist-gothrough function will be absolutely everywhere in the code unless you don’t want to implement any pre- or in-processing.

You might think you could write a function and just pass a pointer to another function to it that does the ‘real’ job, essentially hiding the complexity in a function that you only need to write once. There are three problems with this. First, this will be a tight loop and so performance is important, which you will loose as you will need to dereference the passed-in function pointer every time. This can be overcome with the use of templates but it won’t make the code pretty. Secondly, your original, hiding function will need to be written more than a single time. For example, some such executions will need to count time (operations) and some won’t. You will need to count pointer dereferences (normal clause is fetched) and binary clauses (no pointer dereferenced) in a significantly different way as a cache-miss is very expensive and a clause-access will cause such a cache-miss most of the time. For performance reasons you will need variations that don’t dereference long clauses, variations that allow for manipulation of the array, variations that don’t, etc.

Maintaining datastructure consistency becomes harder Unless you use hiding functions, which is non-trivial as explained above (and maybe impossible in e.g. plain C), the complexity to maintain consistency of the datastructure will be all over the code. Even if done very carefully, the constraints on the datastructure may end up being implicitly, rather than explicitly, represented in the code. It will make it easier to create bugs and harder to find them.

Conclusions

This idea of using less memory for binary clauses in the watchlists is very interesting and has intrigued me for a long while — Armin was kind enough to tell me about this a long time ago. It has the potential to save a lot of memory and to keep things more packed in the datastructure that is arguably the most accessed during solving and inprocessing. However, I was always daunted by the obstacles I saw in front of me — though I might simply need to understand C++11 and templates better to make it work.

Currently, I feel like there are plenty of other optimisations that I could implement from the talk of Armin, e.g. that all watchlists are stored in the same array, using offsets and a hand-rolled memory manager. That seems to have a potential of also improving the memory usage and speed while being easier to implement and easier to hide in a class.

Variable renumbering in SAT solvers keeps a mapping between the external variable numbers that is visible to the users and the internal variable numbers that is visible the to the system. The trivial mapping that most SAT solvers use is the one-to-one mapping where there is no difference between outer and internal variables. A smart mapping doesn’t keep track of all data related to variables that have been set or eliminated internally, so the internal datastructures can be smaller.

Advantages

Having smaller internal data structures help in achieving a lower memory footprint and better cache usage.

The memory savings are useful because some CNFs have tens of millions of variables. If the solver uses the typical watched literal scheme, it needs 2 arrays for each variable. If we are using 64b pointers and 32b array sizes, it’s 32B for each variable, so 32MB for every million variable only to keep the watching literal array(!). I have seen people complaining that their 100M variable problem runs out of memory — if we count that right that’s 3.2GB of memory only to hold the watching literal array pointers and sizes, not any data at all.

As for the CPU cache benefits: modern CPUs work using cache lines which are e.g. 64B long on Intel Sandy Bridge. If half of the variables are set already, the array holding the variable values — which will be accessed non-stop during propagation — will contain 50% useless data. In practice the speedup achieved can be upwards of 10%.

The simple problems

One problem with having a renumbering scheme is that you need to keep track of which datastructure is numbered in which way. The easy solution is to renumber absolutely everything. This is costly, however, as the mapping has to change every once in a while when new variables have been set. In this case, if everything is renumbered, then the eliminated variables‘ data needs to be updated according to the new mapping every time. This might be quite significant. So, it’s best not to renumber that. Similarly, if disconnected component analysis is used, then the disconnected components’ saved solutions need to be renumbered as well, along with the clauses that have been moved to the components.

An approach I have found to be satisfying is to keep every dynamic datastructure such as variable states (eliminated/decomposed/etc.), variable values (True/False/Unknown), clauses’ literals, etc. renumbered, while keeping mostly static datastructures such as eliminated clauses or equivalent literal maps non-renumbered. This works very well in practice as it allows the main system to shuffle the mapping around while not caring about all the other systems’ data.

The hard problem

The above is all fine and dandy until bounded variable addition (BVA) comes to the scene. This technique adds new variables to the problem to simplify it. These new variables will look like new outer variables, which seems good at first sight: the system could simply print the solution to all variables except the last N that were added by BVA and are not part of the original problem. However, if the caller adds new variables after the call to solve(), what can we do? The actual variables by the caller and the BVA variables will be mixed up: start with a bunch of original variables, sprinkle the end with some BVA, then some original variables, then some BVA…

The trivial solution to this is to have another mapping, one that translates variable numbers between the BVA and non-BVA system. As you might imagine, this complicates everything. Another solution is to forcibly eliminate all BVA variables after the call to solve(), let the user add the new clauses, and perform BVA again. Another even more complicated solution is to keep track of the variables being added, then re-number all variables inside all datastructures to move all BVA variables to the end of the variable array. This is expensive but only needs to be done once after the call to solve(), which may be acceptable. Currently, CryptoMiniSat uses the trivial scheme. Maybe I’ll move to the last (and most complicated) system later on.

Conclusion

Variable renumbering is not for the faint of heart. Bugs become significantly harder to track, as all debug messages need to be translated to a common variable numbering or they make no sense at all. It’s also very easy to introduce bugs through variable renumbering. A truly difficult bug I had was when the disconnected component finder’s sub-solver renumbered its internal variables and when I tried to import some values from the sub-solver back to the main solver, I used the wrong variable numbers.

This is just a short note that CryptoMiniSat 3.3 has been released. This is mainly a bugfix release. I have managed to make a fuzzer that simulates library usage of the system by interspersing the CNF with “c Solver::solve()” calls and then checking the intermediate solutions. Checking is either performed by verifying that all clauses are satisfied, or if the solution is UNSAT, by using a standard SAT solver (I use lingeling). This turned up a surprising number of bugs. I have also fixed the web-based statistics printing which should now work without a hitch.

The non-bugfix part of the release is that CryptoMiniSat now saves memory by dynamically shortening and reallocating the watchlist pointer array and some other data structures. This can save quite a bit of memory on large, sparse instances.

In this post I will try to explain how locality of reference could be exploited by grouping operations that access different memory locations. In the SAT solver program I write, there is a loop that does a something called propagation. The goal of propagation is to set a number of variables to a given value through a pretty cache-intensive process. While doing this, as the variables are set, some statistics are calculated. For example, I used to have something like this called for every variable set:

C++

1

2

3

4

5

6

7

8

9

enqueueNewFact(Literallit,PropagatedBy someData){

trail.push_back(lit.var());

value[lit.var()]=lit.sign();

agility*=conf.agilityG;

if(varPolarity[lit.var()]!=lit.sign())agility+=0.0001;

varPolarity[lit.var()]=lit.sign();

varSet[lit.var()]++;

level[lit.var()]=getLevel();

}

The problem with this function is that it is accessing a lot of data pieces: trail[], value[], level[], varPolarity[] and varSet[]. All of these must be in cache for these to be updated, and if they aren’t, they must be pulled in from lower levels of cache or the main memory. Since enqueuNewFact() is called often, these are mostly in cache all the time, taking the space away from what really needs it: the data used during the propagate() routine.

What I have just recently realised is that I don’t use the values of level[], varPolarity[], varSet[] or agility at all during the propagate() routine. Instead, they are used when the recursive calling of propagate() has finished, and a certain technical point (called a conflict) is reached. If this is indeed the case, why are we updating them all the time, immediately when enqueueNewFact() is called? If you look at the code, due to various reasons, we must save a trail (called, conveniently, trail[]) of what we have done. This trail contains all the information needed by all the statistics-update functions: var() and the sign(), both contained in Lit is all that’s necessary.

So, I have made a new function: delayedUpdate() that calculates all the statistics just when it is needed: when a conflict is reached. It has a simple state that saves at what trail position it was called last time, lastDelayedEnqueueUpdate. The default is trail level 0, and then whenever we reach a conflict, we update it, once the stats are calculated:

C++

1

2

3

4

5

6

7

8

9

10

11

12

13

14

delayedEnqueueUpdate(){

intpseudoLevel=lastDelayedEnqueueUpdateLevel;

for(inti=lastDelayedEnqueueUpdate;i<trail.size();i++){

Lit lit=trail[i];

agility*=conf.agilityG;

if(varPolarity[lit.var()]!=lit.sign())agility+=0.0001;

varPolarity[lit.var()]=lit.sign();

varSet[lit.var()]++;

if(trail_lim[pseudoLevel]==i)pseudoLevel++;

level[p.var()]=pseudoLevel;

}

lastDelayedEnqueueUpdate=trail.size();

lastDelayedEnqueueUpdateLevel=decisionLevel();

}

This code, apart from the pseudoLevel and lastDelayedEnqueueUpdateLevel (both of which are SAT solver technicalities) is rather easy to understand. Basically, we take the last position we updated last time, lastDelayedEnqueueUpdate, and we work our way from there to the end, trail.size(). Once we have worked our way through, we set lastDelayedEnqueueUpdate to the current point, trail.size().

All this seems useless if we don’t take cache usage into consideration: we are not saving any CPU instructions. If anything, we add CPU instructions to execute, as we now need to execute an extra for loop (propagate() is essentially a recursive for loop). However, from a cache usage point of view, we are saving a lot. When doing propagation, the propagate() routine can have all the cache it needs. Then, when that cache-intensive process has finished, we pull in a set of arrays, varPolarity[], varSet[], level[], update them all in one go, and then continue operation.

We have managed to group operations according to what memory pieces they are using&updating. This makes our program more cache-friendly, which almost always leads to a measurable speed increase.