I have a collection of computational models that could be described as asynchronous cellular automata. These models resemble the Ising model, but are slightly more complicated. It seems as if such models would benefit from being run on a GPU rather than a CPU. Unfortunately it isn't quite straightforward to parallelise such a model, and it isn't at all clear to me how to go about it. I'm aware that there is literature on the subject, but it all seems to be aimed at hardcore computer scientists who are interested in the details of algorithmic complexity, rather than someone like me who just wants a description of something I can implement, and consequently I find it rather inpenetrable.

For clarity, I'm not looking for an optimal algorithm so much as something I can rapidly implement in CUDA that's likely to give a significant speedup over my CPU implementation. Programmer time is much more of a limiting factor than computer time in this project.

I should also clarify that an asynchronous cellular automaton is a rather different thing from a synchronous one, and techniques for parallelising synchronous CAs (such as Conway's life) cannot easily be adapted to this problem. The difference is that a synchronous CA updates every cell simultaneously at every time step, whereas an asynchronous one updates a randomly chosen local region at every time step as outlined below.

The models I wish to parallelise are implemented on a lattice (usually a hexagonal one) consisting of ~100000 cells (though I'd like to use more), and the non-parallelised algorithm for running them looks like this:

Choose a neighbouring pair of cells at random

Calculate an "energy" function $\Delta E$ based on a local neighbourhood surrounding these cells

With a probability that depends on $e^{-\beta \Delta E}$ (with $\beta$ a parameter), either swap the states of the two cells or do nothing.

Repeat the above steps indefinitely.

There are also some complications to do with the boundary conditions, but I imagine these won't pose much difficulty for parallelisation.

It's worth mentioning that I'm interested in the transient dynamics of these systems rather than just the equilibrium state, so I need something that has equivalent dynamics to the above, rather than just something that will approach the same equilibrium distribution. (So variations of the chequerboard algorithm are not what I'm looking for.)

The main difficulty in parallelising the above algorithm is collisions. Because all the calculations depend only on a local region of the lattice, it's possible for many lattice sites to be updated in parallel, as long as their neighbourhoods aren't overlapping. The question is how to avoid such overlaps. I can think of several ways, but I don't know which if any is the best one to implement. These are as follows:

Use the CPU to generate a list of random grid sites and check for collisions. When the number of grid sites equals the number of GPU processors, or if a collision is detected, send each set of coordinates to a GPU unit to update the corresponding grid site. This would be easy to implement but probably wouldn't give much of a speed up, since checking for collisions on the CPU would probably not be all that much cheaper than doing the whole update on the CPU.

Divide the lattice up into regions (one per GPU unit), and have one GPU unit responsible for randomly selecting and updating grid cells within its region. But there are many issues with this idea that I don't know how to solve, the most obvious being what exactly should happen when a unit chooses a neighbourhood overlapping the edge of its region.

Approximate the system as follows: let time proceed in discrete steps. Divide the lattice up into a different set of regions on every time step according to some pre-defined scheme, and have each GPU unit randomly select and update a pair of grid cells whose neighbourhood does not overlap the region's boundary. Since the boundaries change every time step this constraint might not affect the dynamics too much, as long as the regions are relatively large. This seems easy to implement and likely to be fast, but I don't know how well it will approximate the dynamics, or what is the best scheme for choosing the region boundaries on each time step. I found some references to "block-synchronous cellular automata", which may or may not be the same as this idea. (I don't know because it seems that all the descriptions of the method are either in Russian or are in sources to which I don't have access.)

My specific questions are as follows:

Are any of the above algorithms a sensible way to approach GPU parallelisation of an asynchronous CA model?

Is there a better way?

Is there existing library code for this type of problem?

Where can I find a clear English-language description of the "block-synchronous" method?

Progress

I believe I have come up with a way to parallelise an asynchronous CA that might be suitable. The algorithm outlined below is for a normal asynchronous CA that updates only one cell at a time, rather than a neighbouring pair of cells as mine does. There are some issues with generalising it to my specific case, but I think I have an idea how to resolve them. However, I'm not sure how much of a speed benefit it will give, for reasons discussed below.

The idea is to replace the asynchronous CA (henceforth ACA) with a stochastic synchronous CA (SCA) that behaves equivalently. To do this we first imagine that the ACA is a Poisson process. That is, time proceeds continuously, and each cell as a constant probability per unit time of performing its update function, independently of the other cells.

We construct an SCA whose cells each store two things: the state $X_{ij}$ of the cell (i.e. the data that would be stored in each cell in the sequential implementation), and a floating point number $t_{ij}$ representing the (continuous) time at which it will next update. This continuous time does not correspond to the update steps of the SCA. I will refer to the latter as "logical time". The time values are initialised randomly according to an exponential distribution: $t_{ij}(0) \sim \operatorname{Exp}(\lambda)$. (Where $\lambda$ is a parameter whose value can be chosen arbitrarily.)

At each logical time step, the cells of the SCA are updated as follows:

If, for any $k, l$ in the neighbourhood of $i,j$, the time $t_{kl}<t_{ij}$, do nothing.

Otherwise, (1) update the state $X_{ij}$ according to the states $X_{kl}$ of the neighbouring cells, using the same rule as the original ACA; and (2) generate a random value $\Delta t \sim \operatorname{Exp}(\lambda)$ and update $t_{ij}$ to $t_{ij}+\Delta t$.

I believe this guarantees that the cells will be updated in an order that can be "decoded" to correspond to the original ACA, while avoiding collisions and allowing some cells to be updated in parallel. However, because of the first bullet point above, it means that most of the GPU processors will be mostly idle on each time step of the SCA, which is less than ideal.

I need to give some more thought to whether the performance of this algorithm can be improved, and how to extend this algorithm to deal with the case where multiple cells are updated simultaneously in the ACA. However, it looks promising so I thought I would describe it here in case anyone (a) knows of anything similar in the literature, or (b) can offer any insight into these remaining issues.

Perhaps you can formulate your problem in a stencil-based approach. Much software exists for stencil-based problems. You may have a look at: libgeodecomp.org/gallery.html, Conway's Game of Life. This might have some similarities.
–
vanComputeFeb 17 '13 at 10:17

@vanCompute that looks like a fantastic tool, but from my initial (rather cursory) investigation, it looks like the stencil code paradigm is inherently synchronous, so it's probably not well suited to what I'm trying to do. I will look into it further, however.
–
NathanielFeb 17 '13 at 11:29

Can you provide a few more details on how you would parallelize this using SIMT? Would you use one thread per pair? Or can the work involved with updating a single pair be spread over 32 or more threads?
–
PedroFeb 25 '13 at 10:47

@Pedro the work involved in updating a single pair is fairly small (basically just summing over the neighbourhood, plus one iteration of a random number generator and one exp()) so I wouldn't have thought it makes much sense to spread it over multiple threads. I think it's better (and easier for me) to try and update multiple pairs in parallel, with one pair per thread.
–
NathanielFeb 25 '13 at 11:29

Ok, and how do you define an overlap between to pair updates? If the pairs themselves overlap, or if their neighbourhoods overlap?
–
PedroFeb 25 '13 at 12:52

4 Answers
4

I would use the first option and would use a synchronous AC run before (using the GPU), to detect collisions, execute a step of a hexagonal AC whose rule is the value of the center cell = Sum (neighbors),
This CA must have seven states should be initiated with randomly selected cell, and their status verified before running the update rule for each gpu.

Sample 1. The value of a neighboring cell is shared

0 0 0 0 0 0 0

0 0 1 0 0 0

0 0 0 0 0 0 0

0 0 0 1 0 0

0 0 0 0 0 0 0

a step of a CA whose rule is hexagonal central cell = Sum (neighbors)

0 0 1 1 0 0 0

0 1 1 1 0 0

0 0 1 2 1 0 0

0 0 1 1 1 0

0 0 0 1 1 0 0

Sample 2. The value of a cell to update is taken into account as a neighbor on the other

This is an interesting idea, but I'm not convinced it will work very well. The issue is that in order to finish the collision detection step, you're faced with the problem of working out whether the grid contains any 2's. This takes $O(n)$ calculations, where $n$ is the number of grid cells. With some clever trickery, many of these can be done in parallel on the GPU side - but with the most obvious way of doing that you have no way of knowing where on the grid the collision was, so if there is one you have to start again. It's an interesting idea though, and I'll give some thought to it.
–
NathanielFeb 18 '13 at 14:56

I think there is much that can be parallelized. Collision processing is effected entirely on the GPU is a step in a synchronous AC as shown in the link posted above. for verification would use a local rule if Sum (neighbors) = 8 NO collision, Sum (neighbors)> 8 Collision, it would be verified before running your update rule change if there is no collision cell states, as the two should be placed near the points to be evaluated if they are not close is that belong to other cells.
–
jlopez1967Feb 18 '13 at 15:57

I understand that, but the problem is, what do you do when you detect a collision? As I explained above, your CA algorithm is only the first step in detecting a collision. The second step is to search the grid for cells with a state >= 2, and this is not trivial.
–
NathanielFeb 19 '13 at 0:42

eg, Imagine that we want to detect the collision cell (5.7), on cellular automata and executed sum (neighbors of cell (5,7)) and if the value is 8 and if there is no collision is greater than 8 no collision this should be in the function that evaluates each cell to define the next state of the cell in asynchronous cellular automata. The detection of collision for each cell is a local rule that involves only its neighboring cells
–
jlopez1967Feb 19 '13 at 11:53

Yes, but the question we need to be able to answer in order to parallelise an asynchronous CA is not "was there a collision in cell (5,7)" but "was there a collision somewhere on the grid, and if so where was it?" That can't be answered without iterating over the grid.
–
NathanielFeb 19 '13 at 11:56

Following your answers to my questions in the comments above, I would suggest you try a lock-based approach in which each thread tries to lock-down the neighbourhood it will be updating before computing the actual update.

You can do this using the atomic operations provided for in CUDA, and an array of int containing the locks for each cell, e.g. lock. Each thread then does the following:

Note that this approach is probably not the most optimal, but it could provide an interesting starting point. If there are a lot of collisions between threads, i.e. one or more per 32 threads (as in one collision per warp), then there will be a fair bit of branch diversion. Also, atomic operations can be a bit slow, but since you are only doing compare-and-swap operations, it should scale ok.

The locking overhead may look intimidating, but it's really only a few assignments and branches, not much more.

Note also that I'm being fast and loose with notation in the loops of i over the neighbours.

Addendum: I was cavalier enough to assume that you could simply back off when to pairs collide. If this is not the case, then you can wrap everything as of the second line in a while-loop and add a break at the end of the final if-statement.

All the threads will then have to wait until the last one is done, but if collisions are rare, you should be able to get away with it.

Addendum 2: Do not be tempted to add calls to __syncthreads() anywhere in this code, especially it the looping version described in the previous addendum! Asynchronicity is essential in avoiding repeated collisions in the latter case.

Thanks, this looks pretty good. Probably better than the complicated idea I was considering, and much easier to implement. I can make collisions rare by using a big enough grid, which is probably fine. If the just-back-off method turns out to be significantly faster I can use it for informally investigating the parameters, and switch to the waiting-for-everyone-else-to-complete method when I need to generate official results. I will give this a try some time soon.
–
NathanielFeb 25 '13 at 16:02

I'm the lead developer of LibGeoDecomp. While I agree with vanCompute that you could emulate your ACA with a CA, you're right that this would not be very efficient, as only few cells in any given step are meant to be updated. This is indeed a very interesting application -- and fun to tinker with!

I'd suggest you to combine the solutions proposed by jlopez1967 and Pedro: Pedro's algorithm captures the parallelism well, but those atomic locks are awfully slow. The solution of jlopez1967 is elegant when it comes to detecting collisions, but checking all n cells, when only a smaller subset (I'll from now on assume that there is some parameter k which denotes the number of cells to be updated simultaneously) are active, is clearly prohibitive.

In the absence of good global synchronization on the GPU, you need to invoke multiple kernels for the different phases. On Nvidia's Kepler you can move even the main loop to the GPU, but I don't expect that to gain much.

The algorithms achieves a (configurable) degree of parallelism. I guess, the interesting question is whether the collisions will affect your random distribution when you increase k.

I suggest to you that you see this link http://www.wolfram.com/training/courses/hpc021.html approximately 14:15 minutes into the video of course, mathematica training where they make an implementation of a cellular automata using CUDA, from there and you can modify it.

Unfortunately that is a synchronous CA, which is a rather different type of beast from the asynchronous ones I'm dealing with. In a synchronous CA, every cell is updated simultaneously, and this is easy to parallelise on a GPU, but in an asynchronous CA a single randomly chosen cell is updated every time step (actually in my case it's two neighbouring cells), and this makes the parallelisation a lot harder. The problems outlined in my question are specific to needing an asynchronous update function.
–
NathanielFeb 18 '13 at 13:00