Consider this basic scheme for particle in cell simulations ( with just short-range interactions ):

assign particles to disjunct cells

for cell $A$ go over neighboring cells $B$

for each particle $a_i$ in $A$ interact with all $b_i$ in $B$

move all particles

For GPU is very important memory locality. Therefore it make sense to assign each cell-cell interaction $(A,B)$ to one work-group, which can share__local buffer of $a_i$s. But it may very well happen that some cells are empty and other are filled with very varying numbers of particles $n$. => each work group would have to process very different number of interaction $N = n_A . n_B$ between particle pairs $(a_i,b_i)$. They will have problem to synchronize.

I guess this is some commonplace problem in PIC, GPGPU and parallel computing. But I have seen just introductory tutorials and codes, without much care of optimizations. I would be happy for reference to good and concise learning resources.

$\begingroup$perhaps for production, but I would like something for learning the essential strategies ... this Zoltan does not seem to be even opensource$\endgroup$
– Prokop HapalaMar 27 '17 at 8:38

$\begingroup$It'll depend a bit on your personal definition of open source, but if you look at the download page, Zoltan is available under three fairly standard licences (GPL, LGPL and BSD).$\endgroup$
– origimboMar 27 '17 at 10:21